1.
Eﬃcient Parallel Algorithms Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskinAlexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 1 / 174
2.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 2 / 174
3.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 3 / 174
4.
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . . Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 4 / 174
5.
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . .Computation model: abstract computing device to reason aboutcomputations and algorithmsE.g. scales+weights, Turing machine, von Neumann machine (“ordinarycomputer”), JVM, quantum computer. . . Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 4 / 174
6.
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . .Computation model: abstract computing device to reason aboutcomputations and algorithmsE.g. scales+weights, Turing machine, von Neumann machine (“ordinarycomputer”), JVM, quantum computer. . .An algorithm in a speciﬁc model: input → (computation steps) → outputInput/output encoding must be speciﬁed Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 4 / 174
7.
Computation by circuitsComputation models and algorithmsModel: abstraction of reality allowing qualitative and quantitativereasoningE.g. atom, galaxy, biological cell, Newton’s universe, Einstein’s universe. . .Computation model: abstract computing device to reason aboutcomputations and algorithmsE.g. scales+weights, Turing machine, von Neumann machine (“ordinarycomputer”), JVM, quantum computer. . .An algorithm in a speciﬁc model: input → (computation steps) → outputInput/output encoding must be speciﬁedAlgorithm complexity (worst-case): T (n) = max computation steps input size=n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 4 / 174
8.
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the model Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 5 / 174
9.
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items: Ω(n log n) in the comparison model O(n) in the arithmetic model (by radix sort) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 5 / 174
10.
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items: Ω(n log n) in the comparison model O(n) in the arithmetic model (by radix sort)E.g. factoring large numbers: hard in a von Neumann-type (standard) model not so hard on a quantum computer Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 5 / 174
11.
Computation by circuitsComputation models and algorithmsAlgorithm complexity depends on the modelE.g. sorting n items: Ω(n log n) in the comparison model O(n) in the arithmetic model (by radix sort)E.g. factoring large numbers: hard in a von Neumann-type (standard) model not so hard on a quantum computerE.g. deciding if a program halts on a given input: impossible in a standard (or even quantum) model can be added to the standard model as an oracle, to create a more powerful model Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 5 / 174
12.
Computation by circuitsThe circuit modelBasic special-purpose parallel model: a circuit a ba2 + 2ab + b 2 x2 2xy y2a2 − b 2 x +y +z x −y Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 6 / 174
13.
Computation by circuitsThe circuit modelBasic special-purpose parallel model: a circuit a ba2 + 2ab + b 2 x2 2xy y2a2 − b 2 x +y +z x −yDirected acyclic graph (dag)Fixed number of inputs/outputsOblivious computation: control sequence independent of the input Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 6 / 174
14.
Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations: arithmetic/Boolean/comparison each (usually) constant time Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 7 / 174
15.
Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations: arithmetic/Boolean/comparison each (usually) constant timesize = number of nodesdepth = max path length from input to output Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 7 / 174
16.
Computation by circuitsThe circuit modelBounded or unbounded fan-in/fan-outElementary operations: arithmetic/Boolean/comparison each (usually) constant timesize = number of nodesdepth = max path length from input to outputTimed circuits with feedback: systolic arrays Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 7 / 174
17.
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 8 / 174
18.
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes x y x y = min denotes = max x y x y x y x yThe input and output sequences have the same length Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 8 / 174
19.
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes x y x y = min denotes = max x y x y x y x yThe input and output sequences have the same lengthExamples: n=4 size 5 depth 3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 8 / 174
20.
Computation by circuitsThe comparison network modelA comparison network is a circuit of comparator nodes x y x y = min denotes = max x y x y x y x yThe input and output sequences have the same lengthExamples: n=4 n=4 size 5 size 6 depth 3 depth 3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 8 / 174
21.
Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputsequences of length n , n , and produces a sorted output sequence oflength n = n + nA sorting network is a comparison network that takes an arbitrary inputsequence, and produces a sorted output sequence Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 9 / 174
22.
Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputsequences of length n , n , and produces a sorted output sequence oflength n = n + nA sorting network is a comparison network that takes an arbitrary inputsequence, and produces a sorted output sequenceA sorting (or merging) network is equivalent to an oblivious sorting (ormerging) algorithm; the network’s size/depth determine the algorithm’ssequential/parallel complexity Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 9 / 174
23.
Computation by circuitsThe comparison network modelA merging network is a comparison network that takes two sorted inputsequences of length n , n , and produces a sorted output sequence oflength n = n + nA sorting network is a comparison network that takes an arbitrary inputsequence, and produces a sorted output sequenceA sorting (or merging) network is equivalent to an oblivious sorting (ormerging) algorithm; the network’s size/depth determine the algorithm’ssequential/parallel complexityGeneral merging: O(n) comparisons, non-obliviousGeneral sorting: O(n log n) comparisons by mergesort, non-obliviousWhat is the complexity of oblivious sorting? Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 9 / 174
28.
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1s Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 12 / 174
29.
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 12 / 174
30.
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 12 / 174
31.
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. “If”: by contradiction.Assume a given network does not sort input x = x1 , . . . , xn x1 , . . . , xn → y1 , . . . , yn ∃k, l : k < l : yk > yl Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 12 / 174
32.
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. “If”: by contradiction.Assume a given network does not sort input x = x1 , . . . , xn x1 , . . . , xn → y1 , . . . , yn ∃k, l : k < l : yk > yl 0 if xi < ykLet Xi = , and run the network on input X = X1 , . . . , Xn 1 if xi ≥ ykFor all i, j we have xi ≤ xj ⇔ Xi ≤ Xj , therefore each Xi follows the samepath through the network as xi Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 12 / 174
33.
Computation by circuitsThe zero-one principleZero-one principle: A comparison network is sorting, if and only if it sortsall input sequences of 0s and 1sProof. “Only if”: trivial. “If”: by contradiction.Assume a given network does not sort input x = x1 , . . . , xn x1 , . . . , xn → y1 , . . . , yn ∃k, l : k < l : yk > yl 0 if xi < ykLet Xi = , and run the network on input X = X1 , . . . , Xn 1 if xi ≥ ykFor all i, j we have xi ≤ xj ⇔ Xi ≤ Xj , therefore each Xi follows the samepath through the network as xi X1 , . . . , Xn → Y1 , . . . , Yn Yk = 1 > 0 = YlWe have k < l but Yk > Yl , so the network does not sort 0s and 1s Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 12 / 174
34.
Computation by circuitsThe zero-one principleThe zero-one principle applies to sorting, merging and other comparisonproblems (e.g. selection) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 13 / 174
35.
Computation by circuitsThe zero-one principleThe zero-one principle applies to sorting, merging and other comparisonproblems (e.g. selection)It allows one to test: a sorting network by checking only 2n input sequences, instead of a much larger number n! ≈ (n/e)n a merging network by checking only (n + 1) · (n + 1) pairs of input sequences, instead of an exponentially larger number n = nn n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 13 / 174
36.
Computation by circuitsEﬃcient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we merge obliviously? Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 14 / 174
37.
Computation by circuitsEﬃcient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we merge obliviously? x1 ≤ · · · ≤ xn , y1 ≤ · · · ≤ yn → z1 ≤ · · · ≤ znOdd-even mergingWhen n = n = 1 compare (x1 , y1 ), otherwise merge x1 , x3 , . . . , y1 , y3 , . . . → u1 ≤ u2 ≤ · · · ≤ u n /2 + n /2 merge x2 , x4 , . . . , y2 , y4 , . . . → v1 ≤ v2 ≤ · · · ≤ v n /2 + n /2 compare pairwise: (u2 , v1 ), (u3 , v2 ), . . . Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 14 / 174
38.
Computation by circuitsEﬃcient merging and sorting networksGeneral merging: O(n) comparisons, non-obliviousHow fast can we merge obliviously? x1 ≤ · · · ≤ xn , y1 ≤ · · · ≤ yn → z1 ≤ · · · ≤ znOdd-even mergingWhen n = n = 1 compare (x1 , y1 ), otherwise merge x1 , x3 , . . . , y1 , y3 , . . . → u1 ≤ u2 ≤ · · · ≤ u n /2 + n /2 merge x2 , x4 , . . . , y2 , y4 , . . . → v1 ≤ v2 ≤ · · · ≤ v n /2 + n /2 compare pairwise: (u2 , v1 ), (u3 , v2 ), . . .size OEM (n , n ) ≤ 2 · size OEM (n /2, n /2) + O(n) = O(n log n)depthOEM (n , n ) ≤ depthOEM (n /2, n /2) + 1 = O(log n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 14 / 174
39.
Computation by circuitsEﬃcient merging and sorting networksOEM(n , n )size O(n log n) OEM( n /2 , n /2 )depth O(log n)n ≤n OEM( n /2 , n /2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 15 / 174
40.
Computation by circuitsEﬃcient merging and sorting networksOEM(n , n )size O(n log n) OEM( n /2 , n /2 )depth O(log n)n ≤n OEM( n /2 , n /2 )OEM(4, 4)size 9depth 3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 15 / 174
41.
Computation by circuitsEﬃcient merging and sorting networksCorrectness proof of odd-even merging (sketch): Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 16 / 174
42.
Computation by circuitsEﬃcient merging and sorting networksCorrectness proof of odd-even merging (sketch): by induction and thezero-one principleInduction base: trivial (2 inputs, 1 comparator)Inductive step. By the inductive hypothesis, we have for all k, l: 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . . 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . .We need 0k 11 . . . , 0l 11 . . . → 0k+l 11 . . . Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 16 / 174
43.
Computation by circuitsEﬃcient merging and sorting networksCorrectness proof of odd-even merging (sketch): by induction and thezero-one principleInduction base: trivial (2 inputs, 1 comparator)Inductive step. By the inductive hypothesis, we have for all k, l: 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . . 0 k/2 11 . . . , 0 l/2 11 . . . → 0 k/2 + l/2 11 . . .We need 0k 11 . . . , 0l 11 . . . → 0k+l 11 . . .( k/2 + l/2 ) − ( k/2 + l/2 ) = 0, 1 result sorted: 0k+l 11 . . . 2 single pair wrong: 0k+l−1 1011 . . .The ﬁnal stage of comparators corrects the wrong pair Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 16 / 174
44.
Computation by circuitsEﬃcient merging and sorting networksSorting an arbitrary input x1 , . . . , xnOdd-even merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 recursively sort x n/2 +1 , . . . , xn recursively merge results by OEM( n/2 , n/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 17 / 174
45.
Computation by circuitsEﬃcient merging and sorting networksSorting an arbitrary input x1 , . . . , xnOdd-even merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 recursively sort x n/2 +1 , . . . , xn recursively merge results by OEM( n/2 , n/2 )size OEMSORT (n) ≤ 2 · size OEMSORT (n/2) + size OEM (n) = 2 · size OEMSORT (n/2) + O(n log n) = O n(log n)2depthOEMSORT (n) ≤ depthOEMSORT (n/2) + depthOEM (n) = depthOEMSORT (n/2) + O(log n) = O (log n)2 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 17 / 174
46.
Computation by circuitsEﬃcient merging and sorting networksOEM-SORT (n) OEMSORT OEMSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 OEM( n/2 , n/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 18 / 174
47.
Computation by circuitsEﬃcient merging and sorting networksOEM-SORT (n) OEMSORT OEMSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 OEM( n/2 , n/2 )OEM-SORT (8)size 19depth 6 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 18 / 174
49.
Computation by circuitsEﬃcient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging: sorting a bitonic sequenceWhen n = 1 we are done, otherwise sort bitonic x1 , x3 , . . . recursively sort bitonic x2 , x4 , . . . recursively compare pairwise: (x1 , x2 ), (x3 , x4 ), . . .Correctness proof: by zero-one principle (exercise)(Note: cannot exchange ≥ and ≤ in deﬁnition of bitonic!) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 19 / 174
50.
Computation by circuitsEﬃcient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging: sorting a bitonic sequenceWhen n = 1 we are done, otherwise sort bitonic x1 , x3 , . . . recursively sort bitonic x2 , x4 , . . . recursively compare pairwise: (x1 , x2 ), (x3 , x4 ), . . .Correctness proof: by zero-one principle (exercise)(Note: cannot exchange ≥ and ≤ in deﬁnition of bitonic!)Bitonic merging is more ﬂexible than odd-even merging, since a singlecircuit applies to all values of m Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 19 / 174
51.
Computation by circuitsEﬃcient merging and sorting networksA bitonic sequence: x1 ≥ · · · ≥ xm ≤ · · · ≤ xnBitonic merging: sorting a bitonic sequenceWhen n = 1 we are done, otherwise sort bitonic x1 , x3 , . . . recursively sort bitonic x2 , x4 , . . . recursively compare pairwise: (x1 , x2 ), (x3 , x4 ), . . .Correctness proof: by zero-one principle (exercise)(Note: cannot exchange ≥ and ≤ in deﬁnition of bitonic!)Bitonic merging is more ﬂexible than odd-even merging, since a singlecircuit applies to all values of msize BM (n) = O(n log n) depthBM (n) = O(log n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 19 / 174
52.
Computation by circuitsEﬃcient merging and sorting networksBM(n)size O(n log n) BM( n/2 )depth O(log n) BM( n/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 20 / 174
53.
Computation by circuitsEﬃcient merging and sorting networksBM(n)size O(n log n) BM( n/2 )depth O(log n) BM( n/2 )BM(8)size 12depth 3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 20 / 174
54.
Computation by circuitsEﬃcient merging and sorting networksBitonic merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 → y1 ≥ · · · ≥ y n/2 in reverse, recursively sort x n/2 +1 , . . . , xn → y n/2 +1 ≤ · · · ≤ yn recursively sort bitonic y1 ≥ · · · ≥ ym ≤ · · · ≤ yn m = n/2 or n/2 + 1Sorting in reverse seems to require “inverted comparators” Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 21 / 174
55.
Computation by circuitsEﬃcient merging and sorting networksBitonic merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 → y1 ≥ · · · ≥ y n/2 in reverse, recursively sort x n/2 +1 , . . . , xn → y n/2 +1 ≤ · · · ≤ yn recursively sort bitonic y1 ≥ · · · ≥ ym ≤ · · · ≤ yn m = n/2 or n/2 + 1Sorting in reverse seems to require “inverted comparators”, however comparators are actually nodes in a circuit, which can always be drawn using “standard comparators” a network drawn with “inverted comparators” can be converted into one with only “standard comparators” by a top-down rearrangement Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 21 / 174
56.
Computation by circuitsEﬃcient merging and sorting networksBitonic merge sorting [Batcher: 1968]When n = 1 we are done, otherwise sort x1 , . . . , x n/2 → y1 ≥ · · · ≥ y n/2 in reverse, recursively sort x n/2 +1 , . . . , xn → y n/2 +1 ≤ · · · ≤ yn recursively sort bitonic y1 ≥ · · · ≥ ym ≤ · · · ≤ yn m = n/2 or n/2 + 1Sorting in reverse seems to require “inverted comparators”, however comparators are actually nodes in a circuit, which can always be drawn using “standard comparators” a network drawn with “inverted comparators” can be converted into one with only “standard comparators” by a top-down rearrangementsize BM-SORT (n) = O n(log n)2 depthBM-SORT (n) = O (log n)2 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 21 / 174
57.
Computation by circuitsEﬃcient merging and sorting networksBM-SORT (n) ⇐= =⇒ BSORT BSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 BM( n/2 , n/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 22 / 174
58.
Computation by circuitsEﬃcient merging and sorting networksBM-SORT (n) ⇐= =⇒ BSORT BSORTsize O n(log n)2 ( n/2 ) ( n/2 )depth O (log n)2 BM( n/2 , n/2 )BM-SORT (8)size 24depth 6 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 22 / 174
59.
Computation by circuitsEﬃcient merging and sorting networksBoth OEM-SORT and BM-SORT have size Θ n(log n)2Is it possible to sort obliviously in size o n(log n)2 ? O(n log n)? Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 23 / 174
60.
Computation by circuitsEﬃcient merging and sorting networksBoth OEM-SORT and BM-SORT have size Θ n(log n)2Is it possible to sort obliviously in size o n(log n)2 ? O(n log n)?AKS sorting [Ajtai, Koml´s, Szemer´di: 1983] o eSorting network: size O(n log n), depth O(log n)Uses sophisticated graph theory (expanders)Asymptotically optimal, but has huge constant factors Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 23 / 174
61.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 24 / 174
62.
Parallel computation modelsThe PRAM modelParallel Random Access Machine (PRAM) [Fortune, Wyllie: 1978]Simple, idealised general-purpose parallel 0 1 2model P P P ··· MEMORY Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 25 / 174
63.
Parallel computation modelsThe PRAM modelParallel Random Access Machine (PRAM) [Fortune, Wyllie: 1978]Simple, idealised general-purpose parallel 0 1 2model P P P ··· MEMORYContains unlimited number of processors (1 time unit/op) global shared memory (1 time unit/access)Operates in full synchrony Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 25 / 174
64.
Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation taken for grantedNot scalable in practice! Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 26 / 174
65.
Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation taken for grantedNot scalable in practice!PRAM variants: concurrent/exclusive read concurrent/exclusive writeCRCW, CREW, EREW, (ERCW) PRAM Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 26 / 174
66.
Parallel computation modelsThe PRAM modelPRAM computation: sequence of parallel stepsCommunication and synchronisation taken for grantedNot scalable in practice!PRAM variants: concurrent/exclusive read concurrent/exclusive writeCRCW, CREW, EREW, (ERCW) PRAME.g. a linear system solver: O (log n)2 steps using n4 processors :-OPRAM algorithm design: minimising number of steps, sometimes alsonumber of processors Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 26 / 174
67.
Parallel computation modelsThe BSP modelBulk-Synchronous Parallel (BSP) computer [Valiant: 1990]Simple, realistic general-purpose parallel 0 1 p−1model P P ··· P M M M COMM. ENV . (g , l) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 27 / 174
68.
Parallel computation modelsThe BSP modelBulk-Synchronous Parallel (BSP) computer [Valiant: 1990]Simple, realistic general-purpose parallel 0 1 p−1model P P ··· P M M M COMM. ENV . (g , l)Contains p processors, each with local memory (1 time unit/operation) communication environment, including a network and an external memory (g time units/data unit communicated) barrier synchronisation mechanism (l time units/synchronisation) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 27 / 174
69.
Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g. external memory by local memory + communication barrier synchronisation mechanism by the network Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 28 / 174
70.
Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g. external memory by local memory + communication barrier synchronisation mechanism by the networkParameter g corresponds to the network’s communication gap (inversebandwidth) — the time for a data unit to enter/exit the networkParameter l corresponds to the network’s latency — the worst-case timefor a data unit to get across the network Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 28 / 174
71.
Parallel computation modelsThe BSP modelSome elements of a BSP computer can be emulated by others, e.g. external memory by local memory + communication barrier synchronisation mechanism by the networkParameter g corresponds to the network’s communication gap (inversebandwidth) — the time for a data unit to enter/exit the networkParameter l corresponds to the network’s latency — the worst-case timefor a data unit to get across the networkEvery parallel computer can be (approximately) described by theparameters p, g , lE.g. for Cray T3E: p = 64, g ≈ 78, l ≈ 1825 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 28 / 174
72.
Parallel computation modelsThe BSP model Þ Ò Ø ß ÕÎ Í ÔÍ Û Ý Ó Î ÛË Ì í ë îð ëðð Ó -«®»¼¿ »¿ ¼ ¬¿ Ô -¬ó«¿ º »¿ -¯ ®»- ·¬ îð ððð ïð ëðð ïð ððð ëð ðð ð ð ë ð ï ðð ï ëð î ðð î ëð ¸ Ú òò ò ·³ ±º² ó ¿ ±² ê °®±½ ·¹ï í Ì » ï ¿ ®»´¬·±² ¿ ì ó »--±® Ý § ÌÛ ®¿ íò Ì ¾» ï ò »² ³ ®µ»¼ Í °¿ ³ ¿´ ò Þ ½ ¿ ÞÐ ®¿ »¬»®- î ¸ ¿¼ » ² ¬¸ ¬·³ ±º ð ¿ º ¿ ®¿ ÌÛ ß ¬·³ ¿ ·² ±° » ¿ ó ¬·±²±® Ý § íò ´ » ®» ®»´ ´ Œ «² øãí Ó ±°ñ ·¬- ë Œ ÷ ½ ³ ð ±³ ø÷ ï í ê ì é í è î î è ìè ê íî ë ì í ï êé ç ìí é è í ï ïí ïç ëè ð ï ê í ï îè ðï éë é í î é î ïë ïì èé ï ê ì é è ïë èî ïð ìì ´ ´ Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 29 / 174
78.
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 ··· 1 ··· ··· p−1 ···Asynchronous computation/communication within supersteps (includesdata exchange with external memory)Synchronisation before/after each superstep Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 30 / 174
79.
Parallel computation modelsThe BSP modelBSP computation: sequence of parallel supersteps 0 ··· 1 ··· ··· p−1 ···Asynchronous computation/communication within supersteps (includesdata exchange with external memory)Synchronisation before/after each superstepCf. CSP: parallel collection of sequential processes Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 30 / 174
80.
Parallel computation modelsThe BSP modelCompositional cost modelFor individual processor proc in superstep sstep: comp(sstep, proc): the amount of local computation and local memory operations by processor proc in superstep sstep comm(sstep, proc): the amount of data sent and received by processor proc in superstep sstep Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 31 / 174
81.
Parallel computation modelsThe BSP modelCompositional cost modelFor individual processor proc in superstep sstep: comp(sstep, proc): the amount of local computation and local memory operations by processor proc in superstep sstep comm(sstep, proc): the amount of data sent and received by processor proc in superstep sstepFor the whole BSP computer in one superstep sstep: comp(sstep) = max0≤proc<p comp(sstep, proc) comm(sstep) = max0≤proc<p comm(sstep, proc) cost(sstep) = comp(sstep) + comm(sstep) · g + l Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 31 / 174
82.
Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps: comp = 0≤sstep<sync comp(sstep) comm = 0≤sstep<sync comm(sstep) cost = 0≤sstep<sync cost(sstep) = comp + comm · g + sync · l Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 32 / 174
83.
Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps: comp = 0≤sstep<sync comp(sstep) comm = 0≤sstep<sync comm(sstep) cost = 0≤sstep<sync cost(sstep) = comp + comm · g + sync · lThe input/output data are stored in the external memory; the cost ofinput/output is included in comm Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 32 / 174
84.
Parallel computation modelsThe BSP modelFor the whole BSP computation with sync supersteps: comp = 0≤sstep<sync comp(sstep) comm = 0≤sstep<sync comm(sstep) cost = 0≤sstep<sync cost(sstep) = comp + comm · g + sync · lThe input/output data are stored in the external memory; the cost ofinput/output is included in commE.g. for a particular linear system solver with an n × n matrix:comp O(n3 /p) comm O(n2 /p 1/2 ) sync O(p 1/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 32 / 174
88.
Parallel computation modelsThe BSP modelBSP computation: scalable, portable, predictableBSP algorithm design: minimising comp, comm, syncMain principles: load balancing minimises comp data locality minimises comm coarse granularity minimises syncData locality exploited, network locality ignored!Typically, problem size n p (slackness) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 33 / 174
89.
Parallel computation modelsNetwork routingBSP network model: complete graph, uniformly accessible (accesseﬃciency described by parameters g , l)Has to be implemented on concrete networks Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 34 / 174
90.
Parallel computation modelsNetwork routingBSP network model: complete graph, uniformly accessible (accesseﬃciency described by parameters g , l)Has to be implemented on concrete networksParameters of a network topology (i.e. the underlying graph): degree — number of links per node diameter — maximum distance between nodesLow degree — easier to implementLow diameter — more eﬃcient Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 34 / 174
95.
Parallel computation modelsNetwork routingNetwork Degree Diameter1D array 2 1/2 · p2D array 4 p 1/23D array 6 3/2 · p 1/3Butterﬂy 4 log pHypercube log p log p··· ··· ···BSP parameters g , l depend on degree, diameter, routing strategy Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 39 / 174
96.
Parallel computation modelsNetwork routingNetwork Degree Diameter1D array 2 1/2 · p2D array 4 p 1/23D array 6 3/2 · p 1/3Butterﬂy 4 log pHypercube log p log p··· ··· ···BSP parameters g , l depend on degree, diameter, routing strategyAssume store-and-forward routing (alternative — wormhole)Assume distributed routing: no global control Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 39 / 174
97.
Parallel computation modelsNetwork routingNetwork Degree Diameter1D array 2 1/2 · p2D array 4 p 1/23D array 6 3/2 · p 1/3Butterﬂy 4 log pHypercube log p log p··· ··· ···BSP parameters g , l depend on degree, diameter, routing strategyAssume store-and-forward routing (alternative — wormhole)Assume distributed routing: no global controlOblivious routing: path determined only by source and destinationE.g. greedy routing: a packet always takes the shortest path Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 39 / 174
98.
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packets Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 40 / 174
99.
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSuﬃcient to consider permutations (1-relations): once we can route anypermutation in k steps, we can route any h-relation in hk steps Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 40 / 174
100.
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSuﬃcient to consider permutations (1-relations): once we can route anypermutation in k steps, we can route any h-relation in hk stepsAny routing method may be forced to make Ω(diameter ) steps Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 40 / 174
101.
Parallel computation modelsNetwork routingh-relation (h-superstep): every processor sends and receives ≤ h packetsSuﬃcient to consider permutations (1-relations): once we can route anypermutation in k steps, we can route any h-relation in hk stepsAny routing method may be forced to make Ω(diameter ) stepsAny oblivious routing method may be forced to make Ω(p 1/2 /degree) stepsMany practical patterns force such “hot spots” on traditional networks Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 40 / 174
102.
Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link corresponds to (possibly several) comparatorsRouting corresponds to sorting by destination addressEach stage of routing corresponds to a stage of sorting Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 41 / 174
103.
Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link corresponds to (possibly several) comparatorsRouting corresponds to sorting by destination addressEach stage of routing corresponds to a stage of sortingSuch routing is non-oblivious (for individual packets)!Network Degree DiameterOEM-SORT/BM-SORT O (log p)2 O (log p)2AKS O(log p) O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 41 / 174
104.
Parallel computation modelsNetwork routingRouting based on sorting networksEach processor corresponds to a wireEach link corresponds to (possibly several) comparatorsRouting corresponds to sorting by destination addressEach stage of routing corresponds to a stage of sortingSuch routing is non-oblivious (for individual packets)!Network Degree DiameterOEM-SORT/BM-SORT O (log p)2 O (log p)2AKS O(log p) O(log p)No “hot spots”: can always route a permutation in O(diameter ) stepsRequires a specialised network, too messy and impractical Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 41 / 174
105.
Parallel computation modelsNetwork routingTwo-phase randomised routing: [Valiant: 1980] send every packet to random intermediate destination forward every packet to ﬁnal destinationBoth phases oblivious (e.g. greedy), but non-oblivious overall due torandomness Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 42 / 174
106.
Parallel computation modelsNetwork routingTwo-phase randomised routing: [Valiant: 1980] send every packet to random intermediate destination forward every packet to ﬁnal destinationBoth phases oblivious (e.g. greedy), but non-oblivious overall due torandomnessHot spots very unlikely: on a 2D array, butterﬂy, hypercube, can route apermutation in O(diameter ) steps with high probability Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 42 / 174
107.
Parallel computation modelsNetwork routingTwo-phase randomised routing: [Valiant: 1980] send every packet to random intermediate destination forward every packet to ﬁnal destinationBoth phases oblivious (e.g. greedy), but non-oblivious overall due torandomnessHot spots very unlikely: on a 2D array, butterﬂy, hypercube, can route apermutation in O(diameter ) steps with high probabilityOn a hypercube, the same holds even for a log p-relationHence constant g , l in the BSP model Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 42 / 174
108.
Parallel computation modelsNetwork routingBSP implementation: processes placed at random, communication delayeduntil end of superstepAll packets with same source and destination sent together, hence messageoverhead absorbed in l Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 43 / 174
109.
Parallel computation modelsNetwork routingBSP implementation: processes placed at random, communication delayeduntil end of superstepAll packets with same source and destination sent together, hence messageoverhead absorbed in lNetwork g l1D array O(p) O(p)2D array O(p 1/2 ) O(p 1/2 )3D array O(p 1/3 ) O(p 1/3 )Butterﬂy O(log p) O(log p)Hypercube O(1) O(log p)··· ··· ···Actual values of g , l obtained by running benchmarks Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 43 / 174
110.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 44 / 174
111.
Basic parallel algorithmsBroadcast/combineThe broadcasting problem: initially, one designated processor holds a value a at the end, every processor must hold a copy of a Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 45 / 174
112.
Basic parallel algorithmsBroadcast/combineThe broadcasting problem: initially, one designated processor holds a value a at the end, every processor must hold a copy of aThe combining problem (complementary to broadcasting): initially, every processor holds a value ai , 0 ≤ i p at the end, one designated processor must hold a0 • · · · • ap−1 for a given associative operator • (e.g. +) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 45 / 174
113.
Basic parallel algorithmsBroadcast/combineThe broadcasting problem: initially, one designated processor holds a value a at the end, every processor must hold a copy of aThe combining problem (complementary to broadcasting): initially, every processor holds a value ai , 0 ≤ i p at the end, one designated processor must hold a0 • · · · • ap−1 for a given associative operator • (e.g. +)By symmetry, we only need to consider broadcasting Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 45 / 174
114.
Basic parallel algorithmsBroadcast/combineDirect broadcast: designated processor makes p − 1 copies of a and sends them directly to destinations a a a a a Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 46 / 174
115.
Basic parallel algorithmsBroadcast/combineDirect broadcast: designated processor makes p − 1 copies of a and sends them directly to destinations a a a a a comp O(p) comm O(p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 46 / 174
116.
Basic parallel algorithmsBroadcast/combineDirect broadcast: designated processor makes p − 1 copies of a and sends them directly to destinations a a a a a comp O(p) comm O(p) sync O(1)(from now on, cost components will be shaded when they are optimal,i.e. cannot be improved under reasonable assumptions) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 46 / 174
117.
Basic parallel algorithmsBroadcast/combineBinary tree broadcast: initially, only designated processor is awake processors are woken up in log p rounds in every round, every awake processor makes a copy of a and send it to a sleeping processor, waking it up Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 47 / 174
118.
Basic parallel algorithmsBroadcast/combineBinary tree broadcast: initially, only designated processor is awake processors are woken up in log p rounds in every round, every awake processor makes a copy of a and send it to a sleeping processor, waking it upIn round k = 0, . . . , log p − 1, the number of awake processors is 2k Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 47 / 174
119.
Basic parallel algorithmsBroadcast/combineBinary tree broadcast: initially, only designated processor is awake processors are woken up in log p rounds in every round, every awake processor makes a copy of a and send it to a sleeping processor, waking it upIn round k = 0, . . . , log p − 1, the number of awake processors is 2k comp O(log p) comm O(log p) sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 47 / 174
120.
Basic parallel algorithmsBroadcast/combineThe array broadcasting/combining problem: broadcast/combine an arrayof size n ≥ p elementwise(eﬀectively, n independent instances of broadcasting/combining) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 48 / 174
121.
Basic parallel algorithmsBroadcast/combineTwo-phase array broadcast: partition array into p blocks of size n/p scatter blocks, then total-exchange blocks A B C D A B C D A B A B A B A B C D C D C D C D Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 49 / 174
122.
Basic parallel algorithmsBroadcast/combineTwo-phase array broadcast: partition array into p blocks of size n/p scatter blocks, then total-exchange blocks A B C D A B C D A B A B A B A B C D C D C D C D comp O(n) comm O(n) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 49 / 174
123.
Basic parallel algorithmsBalanced tree and preﬁx sumsThe balanced binary tree dag: a generalisation of broadcasting/combining can be deﬁned top-down (root the input, leaves the outputs) or bottom-up atree(n)1 inputn outputssize n − 1depth log n b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 50 / 174
124.
Basic parallel algorithmsBalanced tree and preﬁx sumsThe balanced binary tree dag: a generalisation of broadcasting/combining can be deﬁned top-down (root the input, leaves the outputs) or bottom-up atree(n)1 inputn outputssize n − 1depth log n b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15Sequential work O(n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 50 / 174
125.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel balanced tree computationFrom now on, we always assume that a problem’s input/output is stored inthe external memory Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 51 / 174
126.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel balanced tree computationFrom now on, we always assume that a problem’s input/output is stored inthe external memoryPartition tree(n) into one top block, isomorphic to tree(p) a bottom layer of p blocks, each isomorphic to tree(n/p) atree(n) tree(p) tree(n/p) tree(n/p) tree(n/p) tree(n/p) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 51 / 174
127.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel balanced tree computation (contd.) a designated processor is assigned the top block; the processor reads the input from external memory, computes the block, and writes the p outputs back to external memory; every processor is assigned a diﬀerent bottom block; a processor reads the input from external memory, computes the block, and writes the n/p outputs back to external memory.For bottom-up computation, reverse the steps Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 52 / 174
128.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel balanced tree computation (contd.) a designated processor is assigned the top block; the processor reads the input from external memory, computes the block, and writes the p outputs back to external memory; every processor is assigned a diﬀerent bottom block; a processor reads the input from external memory, computes the block, and writes the n/p outputs back to external memory.For bottom-up computation, reverse the stepsn ≥ p2 comp O(n/p) comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 52 / 174
129.
Basic parallel algorithmsBalanced tree and preﬁx sumsThe described parallel balanced tree algorithm is fully optimal: sequential work optimal comp O(n/p) = O p input/output size optimal comm O(n/p) = O p optimal sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 53 / 174
130.
Basic parallel algorithmsBalanced tree and preﬁx sumsThe described parallel balanced tree algorithm is fully optimal: sequential work optimal comp O(n/p) = O p input/output size optimal comm O(n/p) = O p optimal sync O(1)For other problems, we may not be so lucky. However, we are typicallyinterested in algorithms that are optimal in comp (under reasonableassumptions). Optimality in comm and sync is considered relative to that. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 53 / 174
131.
Basic parallel algorithmsBalanced tree and preﬁx sumsThe described parallel balanced tree algorithm is fully optimal: sequential work optimal comp O(n/p) = O p input/output size optimal comm O(n/p) = O p optimal sync O(1)For other problems, we may not be so lucky. However, we are typicallyinterested in algorithms that are optimal in comp (under reasonableassumptions). Optimality in comm and sync is considered relative to that.For example, we are not allowed to run the whole computation in a singleprocessor, sacriﬁcing comp and comm to guarantee optimal sync O(1)! Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 53 / 174
132.
Basic parallel algorithmsBalanced tree and preﬁx sumsLet • be an associative operator, computable in time O(1)a • (b • c) = (a • b) • cE.g. numerical +, ·, min. . . a0 a0 a1 a0 • a1 The preﬁx sums problem: a2 → a0 • a1 • a2 . . . . . . an−1 a0 • a1 • · · · • an−1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 54 / 174
133.
Basic parallel algorithmsBalanced tree and preﬁx sumsLet • be an associative operator, computable in time O(1)a • (b • c) = (a • b) • cE.g. numerical +, ·, min. . . a0 a0 a1 a0 • a1 The preﬁx sums problem: a2 → a0 • a1 • a2 . . . . . . an−1 a0 • a1 • · · · • an−1Sequential work O(n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 54 / 174
134.
Basic parallel algorithmsBalanced tree and preﬁx sumsThe preﬁx circuit [Ladner, Fischer: 1980] ∗ a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15preﬁx(n) a0:1 a2:3 a4:5 a6:7 a8:9 a10:11 a12:13 a14:15 preﬁx(n/2) a0:1 a0:3 a0:5 a0:7 a0:9 a0:11 a0:13 a0:15 ∗ a0 a0:2 a0:4 a0:6 a0:8 a0:10 a0:12 a0:14where ak:l = ak • ak+1 • . . . • al , and “∗” is a dummy valueThe underlying dag is called the preﬁx dag Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 55 / 174
136.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel preﬁx computationThe dag preﬁx(n) consists of a dag similar to bottom-up tree(n), but with an extra output per node (total n inputs, n outputs) a dag similar to top-down tree(n), but with an extra input per node (total n inputs, n outputs) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 57 / 174
137.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel preﬁx computationThe dag preﬁx(n) consists of a dag similar to bottom-up tree(n), but with an extra output per node (total n inputs, n outputs) a dag similar to top-down tree(n), but with an extra input per node (total n inputs, n outputs)Both trees can be computed by the previous algorithm. Extrainputs/outputs are absorbed into O(n/p) communication cost. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 57 / 174
138.
Basic parallel algorithmsBalanced tree and preﬁx sumsParallel preﬁx computationThe dag preﬁx(n) consists of a dag similar to bottom-up tree(n), but with an extra output per node (total n inputs, n outputs) a dag similar to top-down tree(n), but with an extra input per node (total n inputs, n outputs)Both trees can be computed by the previous algorithm. Extrainputs/outputs are absorbed into O(n/p) communication cost.n ≥ p2 comp O(n/p) comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 57 / 174
139.
Basic parallel algorithmsBalanced tree and preﬁx sumsApplication: binary addition via Boolean logicx +y =zLet x = xn−1 , . . . , x0 , y = yn−1 , . . . , y0 , z = zn , zn−1 , . . . , z0 be thebinary representation of x, y , zThe problem: given xi , yi , compute zi using bitwise ∧ (“and”), ∨(“or”), ⊕ (“xor”) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 58 / 174
140.
Basic parallel algorithmsBalanced tree and preﬁx sumsApplication: binary addition via Boolean logicx +y =zLet x = xn−1 , . . . , x0 , y = yn−1 , . . . , y0 , z = zn , zn−1 , . . . , z0 be thebinary representation of x, y , zThe problem: given xi , yi , compute zi using bitwise ∧ (“and”), ∨(“or”), ⊕ (“xor”)Let c = cn−1 , . . . , c0 , where ci is the i-th carry bitWe have: xi + yi + ci−1 = zi + 2ci 0≤i n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 58 / 174
141.
Basic parallel algorithmsBalanced tree and preﬁx sumsx +y =zLet ui = xi ∧ yi vi = xi ⊕ yi 0≤i nArrays u = un−1 , . . . , u0 , v = vn−1 , . . . , v0 can be computed in sizeO(n) and depth O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 59 / 174
142.
Basic parallel algorithmsBalanced tree and preﬁx sumsx +y =zLet ui = xi ∧ yi vi = xi ⊕ yi 0≤i nArrays u = un−1 , . . . , u0 , v = vn−1 , . . . , v0 can be computed in sizeO(n) and depth O(1) z0 = v0 c0 = u0 z1 = v1 ⊕ c0 c1 = u1 ∨ (v1 ∧ c0 ) ··· ··· zn−1 = vn−1 ⊕ cn−2 cn−1 = un−1 ∨ (vn−1 ∧ cn−2 ) zn = cn−1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 59 / 174
143.
Basic parallel algorithmsBalanced tree and preﬁx sumsx +y =zLet ui = xi ∧ yi vi = xi ⊕ yi 0≤i nArrays u = un−1 , . . . , u0 , v = vn−1 , . . . , v0 can be computed in sizeO(n) and depth O(1) z0 = v0 c0 = u0 z1 = v1 ⊕ c0 c1 = u1 ∨ (v1 ∧ c0 ) ··· ··· zn−1 = vn−1 ⊕ cn−2 cn−1 = un−1 ∨ (vn−1 ∧ cn−2 ) zn = cn−1Resulting circuit has size and depth O(n). Can we do better? Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 59 / 174
144.
Basic parallel algorithmsBalanced tree and preﬁx sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c) ci = Fui ,vi (ci−1 )We have ci = Fui ,vi (. . . Fu0 ,v0 (0))) = Fu0 ,v0 ◦ · · · ◦ Fui ,vi (0) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 60 / 174
145.
Basic parallel algorithmsBalanced tree and preﬁx sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c) ci = Fui ,vi (ci−1 )We have ci = Fui ,vi (. . . Fu0 ,v0 (0))) = Fu0 ,v0 ◦ · · · ◦ Fui ,vi (0)Function composition ◦ is associativeFu ,v ◦ Fu,v (c) = Fu,v (Fu ,v (c)) = u ∨ (v ∧ (u ∨ (v ∧ c))) = u ∨ (v ∧ u ) ∨ (v ∧ v ∧ c) = Fu∨(v ∧u ),v ∧v (c)Hence, Fu ,v ◦ Fu,v = Fu∨(v ∧u ),v ∧v is computable from u, v , u , v intime O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 60 / 174
146.
Basic parallel algorithmsBalanced tree and preﬁx sumsWe have ci = ui ∨ (vi ∧ ci−1 )Let Fu,v (c) = u ∨ (v ∧ c) ci = Fui ,vi (ci−1 )We have ci = Fui ,vi (. . . Fu0 ,v0 (0))) = Fu0 ,v0 ◦ · · · ◦ Fui ,vi (0)Function composition ◦ is associativeFu ,v ◦ Fu,v (c) = Fu,v (Fu ,v (c)) = u ∨ (v ∧ (u ∨ (v ∧ c))) = u ∨ (v ∧ u ) ∨ (v ∧ v ∧ c) = Fu∨(v ∧u ),v ∧v (c)Hence, Fu ,v ◦ Fu,v = Fu∨(v ∧u ),v ∧v is computable from u, v , u , v intime O(1)We compute Fu0 ,v0 ◦ · · · ◦ Fui ,vi for all i by preﬁx(n)Then compute ci , zi in size O(n) and depth O(1)Resulting circuit has size O(n) and depth O(log n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 60 / 174
147.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagA complex number ω is called a primitive root of unity of degree n, ifω, ω 2 , . . . , ω n−1 = 1, and ω n = 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 61 / 174
148.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagA complex number ω is called a primitive root of unity of degree n, ifω, ω 2 , . . . , ω n−1 = 1, and ω n = 1The Discrete Fourier Transform problem: n−1Fn,ω (a) = Fn,ω · a = b, where Fn,ω = ω ij i,j=0 1 1 1 ··· 1 a0 b0 1 ω ω2 ··· ω n−1 a1 b1 1 ω2 ω4 ··· ω n−2 · a2 = b2 . . . . . . .. . . . . . . . . . . . . . 1 ω n−1 ω n−2 ··· ω an−1 bn−1 j ω ij aj = bi i, j = 0, . . . , n − 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 61 / 174
149.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagA complex number ω is called a primitive root of unity of degree n, ifω, ω 2 , . . . , ω n−1 = 1, and ω n = 1The Discrete Fourier Transform problem: n−1Fn,ω (a) = Fn,ω · a = b, where Fn,ω = ω ij i,j=0 1 1 1 ··· 1 a0 b0 1 ω ω2 ··· ω n−1 a1 b1 1 ω2 ω4 ··· ω n−2 · a2 = b2 . . . . . . .. . . . . . . . . . . . . . 1 ω n−1 ω n−2 ··· ω an−1 bn−1 j ω ij aj = bi i, j = 0, . . . , n − 1Sequential work O(n2 ) by matrix-vector multiplication Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 61 / 174
150.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe Fast Fourier Transform (FFT) algorithm (“four-step” version)Assume n = 22r Let m = n1/2 = 2rLet Au,v = amu+v Bs,t = bms+t s, t, u, v = 0, . . . , m − 1Matrices A, B are vectors a, b written out as m × m matrices Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 62 / 174
151.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe Fast Fourier Transform (FFT) algorithm (“four-step” version)Assume n = 22r Let m = n1/2 = 2rLet Au,v = amu+v Bs,t = bms+t s, t, u, v = 0, . . . , m − 1Matrices A, B are vectors a, b written out as m × m matricesBs,t = u,v ω (ms+t)(mu+v ) Au,v = u,v ω msv +tv +mtu Au,v = m sv tv m tu v (ω ) · ω · u (ω ) Au,v , thus B = Fm,ω m (Tm,ω (Fm,ω m (A)))Fm,ωm (A) is m independent DFTs of size m on each column of A Fm,ωm (A) = Fm,ωm · A Fm,ωm (A)t,v = m )tu A u (ω u,vTm,ω (A) is the transposition of matrix A, with twiddle-factor scaling Tm,ω (A)v ,t = ω tv · At,v Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 62 / 174
152.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe Fast Fourier Transform (FFT) algorithm (contd.)We have B = Fm,ωm (Tm,ω (Fm,ωm (A))), thus DFT of size n in four steps: m independent DFTs of size m transposition and twiddle-factor scaling m independent DFTs of size m Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 63 / 174
153.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe Fast Fourier Transform (FFT) algorithm (contd.)We have B = Fm,ωm (Tm,ω (Fm,ωm (A))), thus DFT of size n in four steps: m independent DFTs of size m transposition and twiddle-factor scaling m independent DFTs of size mWe reduced DFT of size n = 22r to DFTs of size m = 2r . Similarly, canreduce DFT of size n = 22r +1 to DFTs of sizes m = 2r and 2m = 2r +1 .By recursion, we have the FFT circuit Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 63 / 174
154.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe Fast Fourier Transform (FFT) algorithm (contd.)We have B = Fm,ωm (Tm,ω (Fm,ωm (A))), thus DFT of size n in four steps: m independent DFTs of size m transposition and twiddle-factor scaling m independent DFTs of size mWe reduced DFT of size n = 22r to DFTs of size m = 2r . Similarly, canreduce DFT of size n = 22r +1 to DFTs of sizes m = 2r and 2m = 2r +1 .By recursion, we have the FFT circuitsize FFT (n) = O(n) + 2 · n1/2 · size FFT (n1/2 ) = O(1 · n · 1 + 2 · n1/2 · n1/2 + 4 ·n3/4 · n1/4 + · · · + log n · n · 1) = O(n + 2n + 4n + · · · + log n · n) = O(n log n)depthFFT (n) = 1+2·depthFFT (n1/2 ) = O(1+2+4+· · ·+log n) = O(log n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 63 / 174
156.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe FFT circuit and the butterﬂy dag (contd.) a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15bﬂy (n)n inputsn outputs n log nsize 2depth log n b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 65 / 174
157.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe FFT circuit and the butterﬂy dag (contd.) a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15bﬂy (n)n inputsn outputs n log nsize 2depth log n b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15Applications: Fast Fourier Transform; sorting bitonic sequences Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 65 / 174
158.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagThe FFT circuit and the butterﬂy dag (contd.)Dag bﬂy (n) consists of a top layer of n1/2 blocks, each isomorphic to bﬂy (n1/2 ) a bottom layer of n1/2 blocks, each isomorphic to bﬂy (n1/2 )The data exchange pattern between the top and bottom layerscorresponds to n1/2 × n1/2 matrix transposition Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 66 / 174
159.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagParallel butterﬂy computationTo compute bﬂy (n): every processor is assigned n1/2 /p blocks from the top layer; the processor reads the total of n/p inputs, computes the blocks, and writes back the n/p outputs every processor is assigned n1/2 /p blocks from the bottom layer; the processor reads the total of n/p inputs, computes the blocks, and writes back the n/p outputs Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 67 / 174
160.
Basic parallel algorithmsFast Fourier Transform and the butterﬂy dagParallel butterﬂy computationTo compute bﬂy (n): every processor is assigned n1/2 /p blocks from the top layer; the processor reads the total of n/p inputs, computes the blocks, and writes back the n/p outputs every processor is assigned n1/2 /p blocks from the bottom layer; the processor reads the total of n/p inputs, computes the blocks, and writes back the n/p outputsn ≥ p2 comp O( n log n ) p comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 67 / 174
161.
Basic parallel algorithmsOrdered gridThe ordered 2D grid daggrid 2 (n)nodes arranged in an n × n gridedges directed top-to-bottom, left-to-right≤ 2n inputs (to left/top borders)≤ 2n outputs (from right/bottom borders)size n2 depth 2n − 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 68 / 174
162.
Basic parallel algorithmsOrdered gridThe ordered 2D grid daggrid 2 (n)nodes arranged in an n × n gridedges directed top-to-bottom, left-to-right≤ 2n inputs (to left/top borders)≤ 2n outputs (from right/bottom borders)size n2 depth 2n − 1Applications: Gauss–Seidel iteration (single step); triangular systemsolution; dynamic programming; 1D cellular automataSequential work O(n2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 68 / 174
164.
Basic parallel algorithmsOrdered gridParallel ordered 2D grid computationgrid 2 (n)Consists of a p × p grid of blocks, each isomorphic to grid 2 (n/p)The blocks can be arranged into 2p − 1 layers orthogonal to the maindiagonal, with ≤ p independent blocks in each layer Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 69 / 174
165.
Basic parallel algorithmsOrdered gridParallel ordered 2D grid computation (contd.)The computation proceeds in 2p − 1 stages, each computing a layer ofblocks. In a stage: every processor is either assigned a block or is idle a non-idle processor reads the 2n/p block inputs, computes the block, and writes back the 2n/p block outputs Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 70 / 174
166.
Basic parallel algorithmsOrdered gridParallel ordered 2D grid computation (contd.)The computation proceeds in 2p − 1 stages, each computing a layer ofblocks. In a stage: every processor is either assigned a block or is idle a non-idle processor reads the 2n/p block inputs, computes the block, and writes back the 2n/p block outputscomp: (2p − 1) · O (n/p)2 = O(p · n2 /p 2 ) = O(n2 /p)comm: (2p − 1) · O(n/p) = O(n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 70 / 174
167.
Basic parallel algorithmsOrdered gridParallel ordered 2D grid computation (contd.)The computation proceeds in 2p − 1 stages, each computing a layer ofblocks. In a stage: every processor is either assigned a block or is idle a non-idle processor reads the 2n/p block inputs, computes the block, and writes back the 2n/p block outputscomp: (2p − 1) · O (n/p)2 = O(p · n2 /p 2 ) = O(n2 /p)comm: (2p − 1) · O(n/p) = O(n)n≥p comp O(n2 /p) comm O(n) sync O(p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 70 / 174
168.
Basic parallel algorithmsOrdered gridApplication: string comparisonLet a, b be strings of charactersA subsequence of string a is obtained by deleting some (possibly none, orall) characters from aThe longest common subsequence (LCS) problem: ﬁnd the longest stringthat is a subsequence of both a and b Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 71 / 174
169.
Basic parallel algorithmsOrdered gridApplication: string comparisonLet a, b be strings of charactersA subsequence of string a is obtained by deleting some (possibly none, orall) characters from aThe longest common subsequence (LCS) problem: ﬁnd the longest stringthat is a subsequence of both a and ba = “define” b = “design” Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 71 / 174
170.
Basic parallel algorithmsOrdered gridApplication: string comparisonLet a, b be strings of charactersA subsequence of string a is obtained by deleting some (possibly none, orall) characters from aThe longest common subsequence (LCS) problem: ﬁnd the longest stringthat is a subsequence of both a and ba = “define” b = “design”LCS(a, b) = “dein” Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 71 / 174
171.
Basic parallel algorithmsOrdered gridApplication: string comparisonLet a, b be strings of charactersA subsequence of string a is obtained by deleting some (possibly none, orall) characters from aThe longest common subsequence (LCS) problem: ﬁnd the longest stringthat is a subsequence of both a and ba = “define” b = “design”LCS(a, b) = “dein”In computational molecular biology, the LCS problem and its variants arereferred to as sequence alignment Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 71 / 174
172.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
173.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e ∗ 0 0 0 0 0 0 0 d 0 e 0 s 0 i 0 g 0 n 0 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
174.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e ∗ 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 s 0 i 0 g 0 n 0 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
175.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e ∗ 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 1 2 2 2 2 2 s 0 i 0 g 0 n 0 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
176.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e ∗ 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 1 2 2 2 2 2 s 0 1 2 2 2 2 2 i 0 g 0 n 0 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
177.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e ∗ 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 1 2 2 2 2 2 s 0 1 2 2 2 2 2 i 0 1 2 2 3 3 3 g 0 n 0 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
178.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e ∗ 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 1 2 2 2 2 2 s 0 1 2 2 2 2 2 i 0 1 2 2 3 3 3 g 0 1 2 2 3 3 3 n 0 1 2 2 3 4 4 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
179.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e lcs(“define”, “design”) = 4 ∗ 0 0 0 0 0 0 0 d 0 1 1 1 1 1 1 e 0 1 2 2 2 2 2 s 0 1 2 2 2 2 2 i 0 1 2 2 3 3 3 g 0 1 2 2 3 3 3 n 0 1 2 2 3 4 4 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
180.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e lcs(“define”, “design”) = 4 ∗ 0 0 0 0 0 0 0 LCS(a, b) can be “traced back” through d 0 1 1 1 1 1 1 the table at no extra asymptotic cost e 0 1 2 2 2 2 2 s 0 1 2 2 2 2 2 i 0 1 2 2 3 3 3 g 0 1 2 2 3 3 3 n 0 1 2 2 3 4 4 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
181.
Basic parallel algorithmsOrdered gridLCS computation by dynamic programmingLet lcs(a, b) denote the LCS lengthlcs(a, “”) = 0 max(lcs(aα, b), lcs(a, bβ)) if α = β lcs(aα, bβ) =lcs(“”, b) = 0 lcs(a, b) + 1 if α = β ∗ d e f i n e lcs(“define”, “design”) = 4 ∗ 0 0 0 0 0 0 0 LCS(a, b) can be “traced back” through d 0 1 1 1 1 1 1 the table at no extra asymptotic cost e 0 1 2 2 2 2 2 Data dependence in the table s 0 1 2 2 2 2 2 corresponds to the 2D grid dag i 0 1 2 2 3 3 3 g 0 1 2 2 3 3 3 n 0 1 2 2 3 4 4 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 72 / 174
182.
Basic parallel algorithmsOrdered gridParallel LCS computationThe 2D grid approach gives a BSP algorithm for the LCS problem (andmany other problems solved by dynamic programming) comp O(n2 /p) comm O(n) sync O(p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 73 / 174
183.
Basic parallel algorithmsOrdered gridParallel LCS computationThe 2D grid approach gives a BSP algorithm for the LCS problem (andmany other problems solved by dynamic programming) comp O(n2 /p) comm O(n) sync O(p)It may seem that the grid dag algorithm for the LCS problem is the bestpossible. However, an asymptotically faster BSP algorithm can beobtained by divide-and-conquer, via a careful analysis of the resulting LCSsubproblems on substrings.The semi-local LCS algorithm (details omitted) [Krusche, T: 2007] n log p comp O(n2 /p) comm O p 1/2 sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 73 / 174
184.
Basic parallel algorithmsOrdered gridThe ordered 3D grid daggrid 3 (n)nodes arranged in an n × n × n gridedges directed top-to-bottom, left-to-right,front-to-back≤ 3n2 inputs (to front/left/top faces)≤ 3n2 outputs (from back/right/bottom faces)size n3 depth 3n − 2 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 74 / 174
185.
Basic parallel algorithmsOrdered gridThe ordered 3D grid daggrid 3 (n)nodes arranged in an n × n × n gridedges directed top-to-bottom, left-to-right,front-to-back≤ 3n2 inputs (to front/left/top faces)≤ 3n2 outputs (from back/right/bottom faces)size n3 depth 3n − 2Applications: Gauss–Seidel iteration; Gaussian elimination; dynamicprogramming; 2D cellular automataSequential work O(n3 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 74 / 174
186.
Basic parallel algorithmsOrdered gridParallel ordered 3D grid computationgrid 3 (n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 75 / 174
187.
Basic parallel algorithmsOrdered gridParallel ordered 3D grid computationgrid 3 (n)Consists of a p 1/2 × p 1/2 × p 1/2 grid of blocks, each isomorphic togrid 3 (n/p 1/2 )The blocks can be arranged into 3p 1/2 − 2 layers orthogonal to the maindiagonal, with ≤ p independent blocks in each layer Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 75 / 174
188.
Basic parallel algorithmsOrdered gridParallel ordered 3D grid computation (contd.)The computation proceeds in 3p 1/2 − 2 stages, each computing a layer ofblocks. In a stage: every processor is either assigned a block or is idle a non-idle processor reads the 3n2 /p block inputs, computes the block, and writes back the 3n2 /p block outputs Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 76 / 174
189.
Basic parallel algorithmsOrdered gridParallel ordered 3D grid computation (contd.)The computation proceeds in 3p 1/2 − 2 stages, each computing a layer ofblocks. In a stage: every processor is either assigned a block or is idle a non-idle processor reads the 3n2 /p block inputs, computes the block, and writes back the 3n2 /p block outputscomp: (3p 1/2 − 2) · O (n/p 1/2 )3 = O(p 1/2 · n3 /p 3/2 ) = O(n3 /p)comm: (3p 1/2 − 2) · O (n/p 1/2 )2 = O(p 1/2 · n2 /p) = O(n2 /p 1/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 76 / 174
190.
Basic parallel algorithmsOrdered gridParallel ordered 3D grid computation (contd.)The computation proceeds in 3p 1/2 − 2 stages, each computing a layer ofblocks. In a stage: every processor is either assigned a block or is idle a non-idle processor reads the 3n2 /p block inputs, computes the block, and writes back the 3n2 /p block outputscomp: (3p 1/2 − 2) · O (n/p 1/2 )3 = O(p 1/2 · n3 /p 3/2 ) = O(n3 /p)comm: (3p 1/2 − 2) · O (n/p 1/2 )2 = O(p 1/2 · n2 /p) = O(n2 /p 1/2 )n ≥ p 1/2 comp O(n3 /p) comm O(n2 /p 1/2 ) sync O(p 1/2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 76 / 174
191.
Basic parallel algorithmsDiscussionTypically, realistic slackness requirements: n pCosts comp, comm, sync: functions of n, pThe goals: comp = comp opt = comp seq /p comm scales down with increasing p sync is a function of p, independent of n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 77 / 174
192.
Basic parallel algorithmsDiscussionTypically, realistic slackness requirements: n pCosts comp, comm, sync: functions of n, pThe goals: comp = comp opt = comp seq /p comm scales down with increasing p sync is a function of p, independent of nThe challenges: eﬃcient (optimal) algorithms good (sharp) lower bounds Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 77 / 174
193.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 78 / 174
194.
Further parallel algorithmsList contraction and colouringLinked list: n nodes, each contains data and a pointer to successor Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 79 / 174
195.
Further parallel algorithmsList contraction and colouringLinked list: n nodes, each contains data and a pointer to successorLet • be an associative operator, computable in time O(1)Primitive list operation: pointer jumping a b • a•b a, bThe original node data a, b and the pointer to b are kept, so that thepointer jumping operation can be reversed Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 79 / 174
196.
Further parallel algorithmsList contraction and colouringAbstract view: node merging, allows e.g. for bidirectional links a b a•bThe original a, b are kept implicitly, so that node merging can be reversed Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 80 / 174
197.
Further parallel algorithmsList contraction and colouringAbstract view: node merging, allows e.g. for bidirectional links a b a•bThe original a, b are kept implicitly, so that node merging can be reversedThe list contraction problem: reduce the list to a single node by successivemerging (note the result is independent on the merging order)The list expansion problem: restore the original list by reversing thecontraction Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 80 / 174
198.
Further parallel algorithmsList contraction and colouringApplication: list rankingThe problem: for each node, ﬁnd its rank (distance from the head) by listcontraction 0 1 2 3 4 5 6 7Note the solution should be independent of the merging order! Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 81 / 174
199.
Further parallel algorithmsList contraction and colouringApplication: list ranking (contd.)With each intermediate node during contraction/expansion, associate thecorresponding contiguous sublist in the original listContraction phase: for each node keep the length of its sublistInitially, each node assigned 1Merging operation: k, l → k + lIn the fully contracted list, the node contains value n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 82 / 174
200.
Further parallel algorithmsList contraction and colouringApplication: list ranking (contd.)Expansion phase: for each node keep the rank of the starting node of its sublist the length of its sublistInitially, the node (fully contracted list) assigned (0, n)Un-merging operation: (s, k), (s + k, l) ← (s, k + l)In the fully expanded list, a node with rank i contains (i, 1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 83 / 174
201.
Further parallel algorithmsList contraction and colouringApplication: list preﬁx sumsInitially, each node i contains value ai a0 a1 a2 a3 a4 a5 a6 a7 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 84 / 174
202.
Further parallel algorithmsList contraction and colouringApplication: list preﬁx sumsInitially, each node i contains value ai a0 a1 a2 a3 a4 a5 a6 a7Let • be an associative operator with identityThe problem: for each node i, ﬁnd a0:i = a0 • a1 • · · · • ai by list contraction a0 a0:1 a0:2 a0:3 a0:4 a0:5 a0:6 a0:7Note the solution should be independent of the merging order! Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 84 / 174
203.
Further parallel algorithmsList contraction and colouringApplication: list preﬁx sums (contd.)With each intermediate node during contraction/expansion, associate thecorresponding contiguous sublist in the original listContraction phase: for each node keep the •-sum of its sublistInitially, each node assigned aiMerging operation: u, v → u • vIn the fully contracted list, the node contains value bn−1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 85 / 174
204.
Further parallel algorithmsList contraction and colouringApplication: list preﬁx sums (contd.)Expansion phase: for each node keep the •-sum of all nodes before its sublist the •-sum of its sublistInitially, the node (fully contracted list) assigned ( , bn−1 )Un-merging operation: (t, u), (t • u, v ) ← (t, u • v )In the fully expanded list, a node with rank i contains (bi−1 , ai )We have bi = bi−1 • ai Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 86 / 174
205.
Further parallel algorithmsList contraction and colouringFrom now on, we only consider pure list contraction (the expansion phaseis obtained by symmetry)Sequential work O(n) by always contracting at the list’s head Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 87 / 174
206.
Further parallel algorithmsList contraction and colouringFrom now on, we only consider pure list contraction (the expansion phaseis obtained by symmetry)Sequential work O(n) by always contracting at the list’s headParallel list contraction must be based on local merging decisions: a nodecan be merged with either its successor or predecessor, but not with bothsimultaneouslyTherefore, we need either node splitting, or eﬃcient symmetry breaking Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 87 / 174
207.
Further parallel algorithmsList contraction and colouringWyllie’s mating [Wyllie: 1979] Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 88 / 174
208.
Further parallel algorithmsList contraction and colouringWyllie’s mating [Wyllie: 1979] ¢ § Split every node into “forward” node , and “backward” node ¢ ¢ ¢ ¢ ¢ ¢ ¢¢ § § § § § § §§ Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 88 / 174
209.
Further parallel algorithmsList contraction and colouringWyllie’s mating [Wyllie: 1979] ¢ § Split every node into “forward” node , and “backward” node ¢ ¢ ¢ ¢ ¢ ¢ ¢¢ § § § § § § §§Merge mating node pairs, obtaining two lists of size ≈ n/2 ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ § § § § § § § § Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 88 / 174
210.
Further parallel algorithmsList contraction and colouringParallel list contraction by Wyllie’s matingInitially, each processor reads a subset of n/p nodesA node merge involves communication between the two correspondingprocessors; the merged node is placed arbitrarily on either processor Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 89 / 174
211.
Further parallel algorithmsList contraction and colouringParallel list contraction by Wyllie’s matingInitially, each processor reads a subset of n/p nodesA node merge involves communication between the two correspondingprocessors; the merged node is placed arbitrarily on either processor reduce the original list to n fully contracted lists by log n rounds of Wyllie’s mating; after each round, the current reduced lists are written back to external memory select one fully contracted list Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 89 / 174
212.
Further parallel algorithmsList contraction and colouringParallel list contraction by Wyllie’s matingInitially, each processor reads a subset of n/p nodesA node merge involves communication between the two correspondingprocessors; the merged node is placed arbitrarily on either processor reduce the original list to n fully contracted lists by log n rounds of Wyllie’s mating; after each round, the current reduced lists are written back to external memory select one fully contracted listTotal work O(n log n), not optimal vs. sequential work O(n)n≥p comp O( n log n ) p comm O( n log n ) p sync O(log n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 89 / 174
213.
Further parallel algorithmsList contraction and colouring Random mating [Miller, Reif: 1985]Label every node either “forward” ¢ § , or “backward” For each node, labelling independent with probability 1/2 §¢§¢¢§§¢ Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 90 / 174
214.
Further parallel algorithmsList contraction and colouring Random mating [Miller, Reif: 1985]Label every node either “forward” ¢ § , or “backward” For each node, labelling independent with probability 1/2 §¢§¢¢§§¢A node mates with probability 1/2, hence on average n/2 nodes mate Merge mating node pairs, obtaining a new list of expected size 3n/4 §¢§¢¢§§¢ Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 90 / 174
215.
Further parallel algorithmsList contraction and colouringParallel list contraction by random matingInitially, each processor reads a subset of n/p nodes Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 91 / 174
216.
Further parallel algorithmsList contraction and colouringParallel list contraction by random matingInitially, each processor reads a subset of n/p nodes reduce list to expected size n/p by log4/3 p rounds of random mating collect the reduced list in a designated processor and contract sequentially Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 91 / 174
217.
Further parallel algorithmsList contraction and colouringParallel list contraction by random matingInitially, each processor reads a subset of n/p nodes reduce list to expected size n/p by log4/3 p rounds of random mating collect the reduced list in a designated processor and contract sequentiallyTotal work O(n), optimal but randomisedThe time bound holds with high probability (whp)This means ∃C , d 0 : Prob(total work ≤ C · n) ≥ 1 − 1/ndn ≥ p 2 · log p comp O(n/p) whp comm O(n/p) whp sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 91 / 174
218.
Further parallel algorithmsList contraction and colouringBlock matingWill mate nodes deterministicallyContract local chains (if any)Build distribution graph: 2 1 complete weighted digraph on p supernodes 1 w (i, j) = |{u → v : u ∈ proc i , v ∈ proc j }| 1 1 1Each processor holds a supernode’s outgoing edges 2 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 92 / 174
219.
Further parallel algorithmsList contraction and colouringBlock mating (contd.)Collect distribution graph in a designated processor F 2 1Label every supernode “forward” F or “backward” 1 F BB, so that i∈F ,j∈B w (i, j) ≥ 1 · i,j w (i, j) 4 1 1 1by a sequential greedy algorithm F B 2Scatter supernode labels to processorsBy construction of supernode labelling, atleast n/2 nodes have matesMerge mating node pairs, obtaining a newlist of size at most 3n/4 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 93 / 174
220.
Further parallel algorithmsList contraction and colouringParallel list contraction by block matingInitially, each processor reads a subset of n/p nodes Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 94 / 174
221.
Further parallel algorithmsList contraction and colouringParallel list contraction by block matingInitially, each processor reads a subset of n/p nodes reduce list to size n/p by log4/3 p rounds of block mating collect the reduced list in a designated processor and contract sequentially Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 94 / 174
222.
Further parallel algorithmsList contraction and colouringParallel list contraction by block matingInitially, each processor reads a subset of n/p nodes reduce list to size n/p by log4/3 p rounds of block mating collect the reduced list in a designated processor and contract sequentiallyTotal work O(n), optimal and deterministicn ≥ p3 comp O(n/p) comm O(n/p) sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 94 / 174
223.
Further parallel algorithmsList contraction and colouringThe list k-colouring problem: given a linked list and an integer k 1,assign a colour from {0, . . . , k − 1} to every node, so that all pairs ofadjacent nodes receive a diﬀerent colour Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 95 / 174
224.
Further parallel algorithmsList contraction and colouringThe list k-colouring problem: given a linked list and an integer k 1,assign a colour from {0, . . . , k − 1} to every node, so that all pairs ofadjacent nodes receive a diﬀerent colourUsing list contraction, k-colouring for any k can be done in comp O(n/p) comm O(n/p) sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 95 / 174
225.
Further parallel algorithmsList contraction and colouringThe list k-colouring problem: given a linked list and an integer k 1,assign a colour from {0, . . . , k − 1} to every node, so that all pairs ofadjacent nodes receive a diﬀerent colourUsing list contraction, k-colouring for any k can be done in comp O(n/p) comm O(n/p) sync O(log p)For k = p, we can easily achieve (how?) comp O(n/p) comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 95 / 174
226.
Further parallel algorithmsList contraction and colouringThe list k-colouring problem: given a linked list and an integer k 1,assign a colour from {0, . . . , k − 1} to every node, so that all pairs ofadjacent nodes receive a diﬀerent colourUsing list contraction, k-colouring for any k can be done in comp O(n/p) comm O(n/p) sync O(log p)For k = p, we can easily achieve (how?) comp O(n/p) comm O(n/p) sync O(1)Can we achieve the same for all k ≤ p? For k = O(1)? Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 95 / 174
227.
Further parallel algorithmsList contraction and colouringDeterministic coin tossing [Cole, Vishkin: 1986]Given a k-colouring, k 6; colours represented in binary Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 96 / 174
228.
Further parallel algorithmsList contraction and colouringDeterministic coin tossing [Cole, Vishkin: 1986]Given a k-colouring, k 6; colours represented in binaryConsider every node v . We have col(v ) = col(next(v )).If col(v ) diﬀers from col(next(v )) in i-th bit, re-colour v in 2i, if i-th bit in col(v ) is 0, and in col(next(v )) is 1 2i + 1, if i-th bit in col(v ) is 1, and in col(next(v )) is 0After re-colouring, still have col(v ) = col(next(v ))Number of colours reduced from k to 2 log k k Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 96 / 174
229.
Further parallel algorithmsList contraction and colouringParallel list 3-colouring by deteministic coin tossing: compute a p-colouring reduce the number of colours from p to 6 by deteministic coin tossing: O(log∗ k) rounds log∗ k = min r : log . . . log k ≤ 1 (r times) select node v as a pivot, if col(prev (v )) col(v ) col(next(v )). No two pivots are adjacent or further than 12 nodes apart from each pivot, re-colour the succeeding run of at most 12 non-pivots sequentially in 3 colours comp O(n/p) comm O(n/p) sync O(log∗ p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 97 / 174
230.
Further parallel algorithmsSorting and convex hulla = [a0 , . . . , an−1 ]The sorting problem: arrange elements of a in increasing orderMay assume all ai are distinct (otherwise, attach unique tags)Assume the comparison model: primitives , , no bitwise operationsSequential work O(n log n) e.g. by mergesort Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 98 / 174
231.
Further parallel algorithmsSorting and convex hulla = [a0 , . . . , an−1 ]The sorting problem: arrange elements of a in increasing orderMay assume all ai are distinct (otherwise, attach unique tags)Assume the comparison model: primitives , , no bitwise operationsSequential work O(n log n) e.g. by mergesortParallel sorting based on an AKS sorting network n log n n log n comp O p comm O p sync O(log n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 98 / 174
232.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling [Shi, Schaeﬀer: 1992]Every processor reads a subarray of size n/p and sorts it sequentially selects from its subarray p samples at regular intervals Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 99 / 174
233.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling [Shi, Schaeﬀer: 1992]Every processor reads a subarray of size n/p and sorts it sequentially selects from its subarray p samples at regular intervalsA designated processor collects all p 2 samples and sorts them sequentially selects from the sorted samples p splitters at regular intervalsIn each subarray of size n/p, samples deﬁne p local blocks of size n/p 2In the whole array of size n, splitters deﬁne p global buckets Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 99 / 174
234.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling (contd.) block (low) block (high) bucket Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 100 / 174
235.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling (contd.) block (low) block (high) bucketClaim: each bucket has size ≤ 2n/p Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 100 / 174
236.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling (contd.) block (low) block (high) bucketClaim: each bucket has size ≤ 2n/pProof. Relative to a ﬁxed bucket B, a block b is low (respectively high), iflower boundary of b is ≤ (respectively ) lower boundary of BA bucket may intersect ≤ p low blocks and ≤ p high blocksBucket size is at most (p + p) · n/p 2 = 2n/p Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 100 / 174
237.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling (contd.)The designated processor broadcasts the splittersEvery processor receives the splitters and is assigned a bucket scans its subarray and sends each element to the appropriate bucket receives the elements of its bucket and sorts them sequentially writes the sorted bucket back to external memory Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 101 / 174
238.
Further parallel algorithmsSorting and convex hullParallel sorting by regular sampling (contd.)The designated processor broadcasts the splittersEvery processor receives the splitters and is assigned a bucket scans its subarray and sends each element to the appropriate bucket receives the elements of its bucket and sorts them sequentially writes the sorted bucket back to external memoryn ≥ p3 comp O( n log n ) p comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 101 / 174
239.
Further parallel algorithmsSorting and convex hullSet S ⊆ Rd is convex, if for all x, y in S, every point between x and y isalso in SA ⊆ RdThe convex hull conv A is the smallest convex setcontaining Aconv A is a polytope, deﬁned by its vertices Ai ∈ ASet A is in convex position, if every its point is avertex of conv A Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 102 / 174
240.
Further parallel algorithmsSorting and convex hulla = [a0 , . . . , an−1 ] ai ∈ RdThe (discrete) convex hull problem: ﬁnd vertices of conv aOutput must be ordered: every vertex must “know” its neighbours Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 103 / 174
241.
Further parallel algorithmsSorting and convex hulla = [a0 , . . . , an−1 ] ai ∈ RdThe (discrete) convex hull problem: ﬁnd vertices of conv aOutput must be ordered: every vertex must “know” its neighboursd = 2: two neighbours per vertex; output size 2nd = 3: on average, O(1) neighbours per vertex; output size O(n)Sequential work O(n log n) by “gift wrapping” Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 103 / 174
242.
Further parallel algorithmsSorting and convex hulla = [a0 , . . . , an−1 ] ai ∈ RdThe (discrete) convex hull problem: ﬁnd vertices of conv aOutput must be ordered: every vertex must “know” its neighboursd = 2: two neighbours per vertex; output size 2nd = 3: on average, O(1) neighbours per vertex; output size O(n)Sequential work O(n log n) by “gift wrapping”d 3: typically, a lot of neighbours per vertex; output size Ω(n)From now on, will concentrate on d = 2, 3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 103 / 174
243.
Further parallel algorithmsSorting and convex hullClaim: Convex hull problem in R2 is at least as hard as sorting Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 104 / 174
244.
Further parallel algorithmsSorting and convex hullClaim: Convex hull problem in R2 is at least as hard as sortingProof. Let x0 , . . . , xn−1 ∈ RTo sort [x0 , . . . , xn−1 ]: compute conv (xi , xi2 ) ∈ R2 : 0 ≤ i n follow the neighbour links to obtain sorted outputCan the convex hull problem be solved at the same BSP cost as sorting? Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 104 / 174
245.
Further parallel algorithmsSorting and convex hullA ⊆ Rd Let 0 ≤ ≤ 1Set E ⊆ A is an -net for A, if any halfspace with no points in E covers≤ |X | points in ASet E ⊆ A is an -approximation for A, if any halfspace with α|E | pointsin E covers (α ± )|A| points in A Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 105 / 174
246.
Further parallel algorithmsSorting and convex hullA ⊆ Rd Let 0 ≤ ≤ 1Set E ⊆ A is an -net for A, if any halfspace with no points in E covers≤ |X | points in ASet E ⊆ A is an -approximation for A, if any halfspace with α|E | pointsin E covers (α ± )|A| points in AClaim. An -approximation for A is an -net for A Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 105 / 174
247.
Further parallel algorithmsSorting and convex hullA ⊆ Rd Let 0 ≤ ≤ 1Set E ⊆ A is an -net for A, if any halfspace with no points in E covers≤ |X | points in ASet E ⊆ A is an -approximation for A, if any halfspace with α|E | pointsin E covers (α ± )|A| points in AClaim. An -approximation for A is an -net for AClaim. A union of -approximations for A , A is an -approximation forA ∪A Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 105 / 174
248.
Further parallel algorithmsSorting and convex hullA ⊆ Rd Let 0 ≤ ≤ 1Set E ⊆ A is an -net for A, if any halfspace with no points in E covers≤ |X | points in ASet E ⊆ A is an -approximation for A, if any halfspace with α|E | pointsin E covers (α ± )|A| points in AClaim. An -approximation for A is an -net for AClaim. A union of -approximations for A , A is an -approximation forA ∪AClaim. An -net for a δ-approximation for A is an ( + δ)-net for A Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 105 / 174
249.
Further parallel algorithmsSorting and convex hullA ⊆ Rd Let 0 ≤ ≤ 1Set E ⊆ A is an -net for A, if any halfspace with no points in E covers≤ |X | points in ASet E ⊆ A is an -approximation for A, if any halfspace with α|E | pointsin E covers (α ± )|A| points in AClaim. An -approximation for A is an -net for AClaim. A union of -approximations for A , A is an -approximation forA ∪AClaim. An -net for a δ-approximation for A is an ( + δ)-net for AProofs: Easy by deﬁnitions. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 105 / 174
250.
Further parallel algorithmsSorting and convex hulld = 2 A ⊆ R2 |A| = n = 1/rClaim. A 1/r -net for A of size r exists and can be computed in sequentialwork O(n log n). Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 106 / 174
251.
Further parallel algorithmsSorting and convex hulld = 2 A ⊆ R2 |A| = n = 1/rClaim. A 1/r -net for A of size r exists and can be computed in sequentialwork O(n log n).Proof. Generalised “gift wrapping”. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 106 / 174
252.
Further parallel algorithmsSorting and convex hulld = 2 A ⊆ R2 |A| = n = 1/rClaim. A 1/r -net for A of size r exists and can be computed in sequentialwork O(n log n).Proof. Generalised “gift wrapping”.Claim. If A is in convex position, then a 1/r -approximation for A of size rexists and can be computed in sequential work O(n log n). Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 106 / 174
253.
Further parallel algorithmsSorting and convex hulld = 2 A ⊆ R2 |A| = n = 1/rClaim. A 1/r -net for A of size r exists and can be computed in sequentialwork O(n log n).Proof. Generalised “gift wrapping”.Claim. If A is in convex position, then a 1/r -approximation for A of size rexists and can be computed in sequential work O(n log n).Proof. Take every r -th point on the convex hull of A. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 106 / 174
254.
Further parallel algorithmsSorting and convex hulld = 3 A ⊆ R3 |A| = n = 1/rClaim. A 1/r -net for A of size O(r ) exists and can be computed insequential work O(rn log n).Proof: [Br¨nnimann, Goodrich: 1995] o Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 107 / 174
255.
Further parallel algorithmsSorting and convex hulld = 3 A ⊆ R3 |A| = n = 1/rClaim. A 1/r -net for A of size O(r ) exists and can be computed insequential work O(rn log n).Proof: [Br¨nnimann, Goodrich: 1995] oClaim. A 1/r -approximation for A of size O r 3 (log r )O(1) exists and canbe computed in sequential work O(n log r ).Proof: [Matouˇek: 1992] s(Better approximations exist, but are slower to compute) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 107 / 174
256.
Further parallel algorithmsSorting and convex hullParallel convex hull computation by generalised regular samplinga = [a0 , . . . , an−1 ] ai ∈ R3 (R2 analogous but simpler)Every processor reads a subset of n/p points and computes its convex hull selects a 1/p-approximation of O p 3 (log p)O(1) points as samplesThe global union of all samples is a 1/p-approximation for a Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 108 / 174
257.
Further parallel algorithmsSorting and convex hullParallel convex hull computation by generalised regular samplinga = [a0 , . . . , an−1 ] ai ∈ R3 (R2 analogous but simpler)Every processor reads a subset of n/p points and computes its convex hull selects a 1/p-approximation of O p 3 (log p)O(1) points as samplesThe global union of all samples is a 1/p-approximation for aA designated processor collects all O p 4 (log p)O(1) samples select from the samples a 1/p-net of O(p) points as splittersThe set of splitters is an 1/p-net for a 1/p-approximation for a. Therefore,it is a 2/p-net for a. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 108 / 174
258.
Further parallel algorithmsSorting and convex hullParallel convex hull computation by generalised regular sampling (contd.)The O(p) splitters can be assumed to be in convex position (like any -net), and therefore deﬁne a splitter polytope with O(p) edgesEach edge of splitter polytope deﬁnes a bucket: the subset of a visiblewhen sitting on this edge (assuming the polytope is opaque)Each bucket can be covered by two half-planes not containg any splitters.Therefore, bucket size is at most 2 · (2/p) · n = 4n/p. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 109 / 174
259.
Further parallel algorithmsSorting and convex hullParallel convex hull computation by generalised regular sampling (contd.)The designated processor broadcasts the splittersEvery processor receives the splitters and is assigned a bucket scans its hull and sends each point to the appropriate bucket receives the points of its bucket and computes their convex hull sequentially writes the bucket hull back to external memoryn p comp O( n log n ) p comm O(n/p) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 110 / 174
260.
Further parallel algorithmsSelectiona = [a0 , . . . , an−1 ]The selection problem: given k, ﬁnd k-th largest element of aE.g. median selection: k = n/2As before, assume the comparison modelSequential work O(n log n) by naive sortingSequential work O(n) by successive elimination [Blum+: 1973] Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 111 / 174
261.
Further parallel algorithmsSelectionStandard approach to selection: eliminate elements in rounds partition current array into subarrays of size 5 select from each subarray the median by recursion select the median of all subarray medians eliminate all elements that are on the “wrong” side of the median-of-mediansEach round removes at least a fraction of 3/10 of the array elementsProof. In half of all subarrays, the median is on the “wrong” side of themedian-of-medians. In every such subarray, two oﬀ-median elements areon the “wrong” side of the local median. Hence, in a round, at least afraction of 1/2 · (1 + 2)/5 = 3/10 elements are eliminated.Data reduction rate is exponential Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 112 / 174
262.
Further parallel algorithmsSelectionMore general approach to selection: regular samplingIn each round: partition current array into subarrays select from each subarray a set of regular samples by recursion select from all samples a subset of regular splitters eliminate all elements outside the relevant bucketSelecting samples by recursion helps to avoid subarray pre-sortingFor work-optimality, suﬃcient to use constant subarray size (e.g. 5) andconstant sampling frequency (e.g. subarray medians)Each round removes at least a constant fraction of the array elementsData reduction rate is exponential Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 113 / 174
264.
Further parallel algorithmsSelectionParallel selection by accelerated regular samplingMain idea: variable sampling frequency in diﬀerent rounds. As databecome more and more sparse, can aﬀord to increase sampling frequency.Data reduction rate now superexponentialSelection can be completed in O(log log p) rounds reduce the input array to size n/p by O(log log p) rounds of accelerated regular sampling (implicit load balancing); collect the reduced array in a designated processor and perform selection sequentiallyn p comp O(n/p) comm O(n/p) sync O(log log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 115 / 174
265.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 116 / 174
266.
Parallel matrix algorithmsMatrix-vector multiplicationLet A, b, c be a matrix and two vectors of size nThe matrix-vector multiplication problemA·b =cci = j Aij · bj A · b = c0 ≤ i, j n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 117 / 174
267.
Parallel matrix algorithmsMatrix-vector multiplicationLet A, b, c be a matrix and two vectors of size nThe matrix-vector multiplication problemA·b =cci = j Aij · bj A · b = c0 ≤ i, j nOverall, n2 elementary products Aij · bjSequential work O(n2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 117 / 174
268.
Parallel matrix algorithmsMatrix-vector multiplicationThe matrix-vector multiplication circuitci ← 0 j b • +ci ← Aij · bj0 ≤ i, j n • •iAn ij-square of nodes, eachrepresenting an elementary productsize O(n2 ), depth O(1) A c Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 118 / 174
269.
Parallel matrix algorithmsMatrix-vector multiplicationParallel matrix-vector multiplicationAssume A is predistributed across the processors as needed, does notcount as input (motivation: iterative linear algebra methods) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 119 / 174
270.
Parallel matrix algorithmsMatrix-vector multiplicationParallel matrix-vector multiplicationAssume A is predistributed across the processors as needed, does notcount as input (motivation: iterative linear algebra methods)Partition the ij-square into p regular square blocksMatrix A gets partitioned into p square blocks Aij of size n/p 1/2Vectors b, c each gets partitioned into p 1/2 linear blocks bj , ci of sizen/p 1/2 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 119 / 174
271.
Parallel matrix algorithmsMatrix-vector multiplicationParallel matrix-vector multiplication (contd.)ci ← 0 j b +ci ← Aij · bj0 ≤ i, j p 1/2 i A c Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 120 / 174
272.
Parallel matrix algorithmsMatrix-vector multiplicationParallel matrix-vector multiplication (contd.)Vector c in external memory is initialised by zero values Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 121 / 174
273.
Parallel matrix algorithmsMatrix-vector multiplicationParallel matrix-vector multiplication (contd.)Vector c in external memory is initialised by zero valuesEvery processor j is assigned to compute a block product Aij · bj = ci j reads block bj and computes ci j updates ci in external memory by adding ci elementwiseUpdates to ci add up (asynchronously) to its correct ﬁnal value Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 121 / 174
274.
Parallel matrix algorithmsMatrix-vector multiplicationParallel matrix-vector multiplication (contd.)Vector c in external memory is initialised by zero valuesEvery processor j is assigned to compute a block product Aij · bj = ci j reads block bj and computes ci j updates ci in external memory by adding ci elementwiseUpdates to ci add up (asynchronously) to its correct ﬁnal valuen≥p n2 n comp O p comm O p 1/2 sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 121 / 174
275.
Parallel matrix algorithmsMatrix multiplicationLet A, B, C be matrices of size nThe matrix multiplication problemA·B =CCik = j Aij · Bjk A · B = C0 ≤ i, j, k n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 122 / 174
276.
Parallel matrix algorithmsMatrix multiplicationLet A, B, C be matrices of size nThe matrix multiplication problemA·B =CCik = j Aij · Bjk A · B = C0 ≤ i, j, k nOverall, n3 elementary products Aij · BjkSequential work O(n3 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 122 / 174
277.
Parallel matrix algorithmsMatrix multiplicationThe matrix multiplication circuitCik ← 0 B + jk •Cik ← Aij · Bjk0 ≤ i, j, k n ijAn ijk-cube of nodes, each A • •representing an elementary productsize O(n3 ), depth O(1) • ik C Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 123 / 174
278.
Parallel matrix algorithmsMatrix multiplicationParallel matrix multiplicationPartition the ijk-cube into p regular cubic blocksMatrices A, B, C each gets partitioned into p 2/3 square blocks Aij , Bjk ,Cik of size n/p 1/3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 124 / 174
279.
Parallel matrix algorithmsMatrix multiplicationParallel matrix multiplication (contd.)Cik ← 0 B +Cik ← Aij · Bjk0 ≤ i, j, k p 1/3 A C Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 125 / 174
280.
Parallel matrix algorithmsMatrix multiplicationParallel matrix multiplication (contd.)Matrix C in external memory is initialised by zero values Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 126 / 174
281.
Parallel matrix algorithmsMatrix multiplicationParallel matrix multiplication (contd.)Matrix C in external memory is initialised by zero valuesEvery processor j is assigned to compute a block product Aij · Bjk = Cik j reads blocks Aij , Bjk , and computes Cik j updates Cik in external memory by adding Cik elementwiseUpdates to Cik add up (asynchronously) to its correct ﬁnal value Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 126 / 174
282.
Parallel matrix algorithmsMatrix multiplicationParallel matrix multiplication (contd.)Matrix C in external memory is initialised by zero valuesEvery processor j is assigned to compute a block product Aij · Bjk = Cik j reads blocks Aij , Bjk , and computes Cik j updates Cik in external memory by adding Cik elementwiseUpdates to Cik add up (asynchronously) to its correct ﬁnal valuen ≥ p 1/2 n3 n2 comp O p comm O p 2/3 sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 126 / 174
283.
Parallel matrix algorithmsMatrix multiplicationTheorem. Computing the matrix multiplication dag by regular blockpartitioning is asymptotically optimal Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 127 / 174
284.
Parallel matrix algorithmsMatrix multiplicationTheorem. Computing the matrix multiplication dag by regular blockpartitioning is asymptotically optimal n3Proof. comp O p , sync O(1) trivially optimal Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 127 / 174
285.
Parallel matrix algorithmsMatrix multiplicationTheorem. Computing the matrix multiplication dag by regular blockpartitioning is asymptotically optimal n3Proof. comp O p , sync O(1) trivially optimal n2Optimality of comm O p 2/3 : (discrete) volume vs surface areaLet V be the subset of ijk-cube computed by a certain processor n3For at least one processor: |V | ≥ pLet A, B, C be projections of V onto coordinate planes Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 127 / 174
286.
Parallel matrix algorithmsMatrix multiplicationTheorem. Computing the matrix multiplication dag by regular blockpartitioning is asymptotically optimal n3Proof. comp O p , sync O(1) trivially optimal n2Optimality of comm O p 2/3 : (discrete) volume vs surface areaLet V be the subset of ijk-cube computed by a certain processor n3For at least one processor: |V | ≥ pLet A, B, C be projections of V onto coordinate planesArithmetic vs geometric mean: |A| + |B| + |C | ≥ 3(|A| · |B| · |C |)1/3Loomis–Whitney inequality: |A| · |B| · |C | ≥ |V |2 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 127 / 174
287.
Parallel matrix algorithmsMatrix multiplicationTheorem. Computing the matrix multiplication dag by regular blockpartitioning is asymptotically optimal n3Proof. comp O p , sync O(1) trivially optimal n2Optimality of comm O p 2/3 : (discrete) volume vs surface areaLet V be the subset of ijk-cube computed by a certain processor n3For at least one processor: |V | ≥ pLet A, B, C be projections of V onto coordinate planesArithmetic vs geometric mean: |A| + |B| + |C | ≥ 3(|A| · |B| · |C |)1/3Loomis–Whitney inequality: |A| · |B| · |C | ≥ |V |2We have comm ≥ |A| + |B| + |C | ≥ 3(|A| · |B| · |C |)1/3 ≥ 3|V |2/3 ≥ 3 2/3 3n2 23 n p = p2/3 , hence comm = Ω pn 2/3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 127 / 174
288.
Parallel matrix algorithmsMatrix multiplicationThe optimality theorem only applies to matrix multiplication by thespeciﬁc O(n3 )-node dagIncludes e.g. numerical matrix multiplication using primitives +, · Boolean matrix multiplication using primitives ∨, ∧ Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 128 / 174
289.
Parallel matrix algorithmsMatrix multiplicationThe optimality theorem only applies to matrix multiplication by thespeciﬁc O(n3 )-node dagIncludes e.g. numerical matrix multiplication using primitives +, · Boolean matrix multiplication using primitives ∨, ∧Excludes e.g. numerical matrix multiplication with extra primitive − (Strassen-type) Boolean matrix multiplication with extra primitive if/then Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 128 / 174
290.
Parallel matrix algorithmsFast matrix multiplicationLet A, B, C be numerical matrices of size nStrassen-type matrix multiplication: A · B = CPrimitives +, −, · on matrix elements Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 129 / 174
291.
Parallel matrix algorithmsFast matrix multiplicationLet A, B, C be numerical matrices of size nStrassen-type matrix multiplication: A · B = CPrimitives +, −, · on matrix elementsMain idea: for certain matrix sizes N, we can multiply N × N matricesusing R N 3 elementary products (·) and some linear operations (+, −): some linear operations on the elements of A some linear operations on the elements of B R elementary products of the resulting linear combinations some more linear operations to obtain C Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 129 / 174
292.
Parallel matrix algorithmsFast matrix multiplicationMatrices of size n ≥ N are partitioned into an N × N grid of regularblocks, and multiplied recursively: some linear operations on the blocks of A some linear operations on the blocks of B by recursion, R block products of the resulting linear combinations some more linear operations to obtain the blocks of C Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 130 / 174
293.
Parallel matrix algorithmsFast matrix multiplicationMatrices of size n ≥ N are partitioned into an N × N grid of regularblocks, and multiplied recursively: some linear operations on the blocks of A some linear operations on the blocks of B by recursion, R block products of the resulting linear combinations some more linear operations to obtain the blocks of CThe number of linear operations turns out to be irrelevant. Theasymptotic work is determined by N and RLet ω = logN R logN N 3 = 3Resulting dag has size O(nω ), depth ≈ 2 log nSequential work O(nω ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 130 / 174
295.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplicationAssign matrix elements to processors by the cyclic distribution: partition A into regular blocks of size p 1/2 distribute each block identically, one element per processor partition B, C analogously (distribution identical across all blocks of the same matrix, but need not be identical across diﬀerent matrices) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 132 / 174
296.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplicationAssign matrix elements to processors by the cyclic distribution: partition A into regular blocks of size p 1/2 distribute each block identically, one element per processor partition B, C analogously (distribution identical across all blocks of the same matrix, but need not be identical across diﬀerent matrices)The cyclic distribution allows us to perform linear operations on matrixblocks of size at least p 1/2 without communication Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 132 / 174
297.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplication (contd.)At each level, the R recursive calls are independent, hence the recursiontree unfolded breadth-ﬁrstAt level logR p, a task ﬁts in a single processorlevel tasks task size each task0 1 n parallel1 R n/N2 R2 n/N 2... n nlogR p p N logR p = p 1/ω sequential...logN n R logN n = nω 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 133 / 174
298.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplication (contd.)Each processor reads its assigned elements of A, BRecursion levels 0 to logR p on the way down: comm-free linear operationson blocks of A, B Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 134 / 174
299.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplication (contd.)Each processor reads its assigned elements of A, BRecursion levels 0 to logR p on the way down: comm-free linear operationson blocks of A, BRecursion level logR p: we have p independent block multiplication tasks assign each task to a diﬀerent processor a processor collects its task’s two input blocks, performs the task sequentially, and redistributes the task’s output block Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 134 / 174
300.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplication (contd.)Each processor reads its assigned elements of A, BRecursion levels 0 to logR p on the way down: comm-free linear operationson blocks of A, BRecursion level logR p: we have p independent block multiplication tasks assign each task to a diﬀerent processor a processor collects its task’s two input blocks, performs the task sequentially, and redistributes the task’s output blockRecursion levels logR p to 0 on the way up: comm-free linear operationson blocks of CEach processor writes back its assigned elements of C Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 134 / 174
301.
Parallel matrix algorithmsFast matrix multiplicationParallel Strassen-type matrix multiplication (contd.)Each processor reads its assigned elements of A, BRecursion levels 0 to logR p on the way down: comm-free linear operationson blocks of A, BRecursion level logR p: we have p independent block multiplication tasks assign each task to a diﬀerent processor a processor collects its task’s two input blocks, performs the task sequentially, and redistributes the task’s output blockRecursion levels logR p to 0 on the way up: comm-free linear operationson blocks of CEach processor writes back its assigned elements of C nω n2 comp O p comm O p 2/ω : opt? sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 134 / 174
302.
Parallel matrix algorithmsBoolean matrix multiplicationLet A, B, C be Boolean matrices of size nBoolean matrix multiplication: A ∧ B = CPrimitives ∨, ∧, if/then on matrix elements Cik = j Aik ∧ Bjk 0 ≤ i, j, k nOverall, n3 elementary products Aij ∧ BjkSequential work O(n3 ) bit operationsSequential work O(nω ) using a Strassen-type algorithm Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 135 / 174
303.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplicationThe following algorithm is impractical, but of theoretical interest, since itbeats the generic Loomis–Whitney communication lower bound Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 136 / 174
304.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplicationThe following algorithm is impractical, but of theoretical interest, since itbeats the generic Loomis–Whitney communication lower boundRegularity Lemma: in a Boolean matrix, the rows and the columns can bepartitioned into K (almost) equal-sized subsets, so that K 2 resultingsubmatrices are random-like (of various densities) [Szemer´di: 1978] e Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 136 / 174
305.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplicationThe following algorithm is impractical, but of theoretical interest, since itbeats the generic Loomis–Whitney communication lower boundRegularity Lemma: in a Boolean matrix, the rows and the columns can bepartitioned into K (almost) equal-sized subsets, so that K 2 resultingsubmatrices are random-like (of various densities) [Szemer´di: 1978] eK = K ( ), where is the “degree ofrandom-likeness” ? −→Function K ( ) grows enormously as → 0, but is independent of nWe shall call this the regular decomposition of a Boolean matrix Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 136 / 174
306.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplication (contd.)A∧B =CIf A, B, C random-like, then either A or B has few 1s, or C has few 0sEquivalently, A ∧ B = C , either A, B or C has few 1sBy Regularity Lemma, we have the three-way regular decomposition A1 A2 A3 A1 ∧ B1 = C 1 , where A1 has few 1s B1 B2 B3 A2 ∧ B2 = C 2 , where B2 has few 1s C1 C2 C3 A3 ∧ B3 = C 3 , where C 3 has few 1s C = C1 ∨ C2 ∨ C3Matrices A1 , . . . , C3 can be “eﬃciently” computed from A, B, C Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 137 / 174
307.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplication (contd.)A∧B =CPartition the ijk-cube into p 3 regular cubic blocksMatrices A, B, C each gets partitioned into p 2 square blocks Aij , Bjk , C ik nof size p1/3 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 138 / 174
308.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplication (contd.)Every processor j assigned to compute a “slab” of p 2 cubic blocks Aij ∧ Bjk = C ik for a ﬁxed j and all i, k j reads blocks Aij , Bjk and computes C ik for all i, k computes the three-way regular decomposition for the block product and determines the submatrices having very 1s Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 139 / 174
309.
Parallel matrix algorithmsBoolean matrix multiplicationParallel Boolean matrix multiplication (contd.)Recompute A ∧ B = C from block regular decompositions by aStrassen-type algorithmCommunication saved by only sending the positions of 1sn p :-( nω n2 comp O p comm O p sync O(1) ;-) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 140 / 174
310.
Parallel matrix algorithmsTriangular system solutionLet L, b, c be a matrix and two vectors of size n 0 0≤i j nL is lower triangular: Lij = arbitrary otherwise Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 141 / 174
311.
Parallel matrix algorithmsTriangular system solutionLet L, b, c be a matrix and two vectors of size n 0 0≤i j nL is lower triangular: Lij = arbitrary otherwiseL·b =c j Lij · bj = ci · b = c0≤j ≤i n LThe triangular system problem: given L, c, ﬁnd b Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 141 / 174
316.
Parallel matrix algorithmsTriangular system solutionParallel forward substitution by 2D grid dagAssume L is predistributed as needed, does not count as input c0 c1 c2 c3 c4 c0 b0 s L−1 · (c − s) ii0 b1 L−1 · (c − s) ii b0 b2 s s + Lij · b0 b3 b0 b4 b0 b1 b2 b3 b4 comp O(n2 /p) comm O(n) sync O(p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 143 / 174
317.
Parallel matrix algorithmsTriangular system solutionBlock forward substitutionL·b =c L00 ◦ b c · 0 = 0 L10 L11 b1 c1Recursion: two half-sized subproblemsL00 · b0 = c0 by recursion L10L11 · b1 = c1 − L10 · b1 by recursion Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 144 / 174
318.
Parallel matrix algorithmsTriangular system solutionBlock forward substitutionL·b =c L00 ◦ b c · 0 = 0 L10 L11 b1 c1Recursion: two half-sized subproblemsL00 · b0 = c0 by recursion L10L11 · b1 = c1 − L10 · b1 by recursionSequential work O(n2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 144 / 174
319.
Parallel matrix algorithmsTriangular system solutionParallel block forward substitutionAssume L is predistributed as needed, does not count as input Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 145 / 174
320.
Parallel matrix algorithmsTriangular system solutionParallel block forward substitutionAssume L is predistributed as needed, does not count as inputAt each level, the two subproblems are dependent, hence recursion treeunfolded depth-ﬁrstAt level log p, a task ﬁts in a single processorlevel tasks task size each task0 1 n parallel1 2 n/22 22 n/22...log p p n/p sequential...log n n 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 145 / 174
321.
Parallel matrix algorithmsTriangular system solutionParallel block forward substitution (contd.)Recursion levels 0 to log p: block forward substitution using parallelmatrix-vector multiplication Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 146 / 174
322.
Parallel matrix algorithmsTriangular system solutionParallel block forward substitution (contd.)Recursion levels 0 to log p: block forward substitution using parallelmatrix-vector multiplicationRecursion level log p: a designated processor reads the current task’sinput, performs the task sequentially, and writes back the task’s output Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 146 / 174
323.
Parallel matrix algorithmsTriangular system solutionParallel block forward substitution (contd.)Recursion levels 0 to log p: block forward substitution using parallelmatrix-vector multiplicationRecursion level log p: a designated processor reads the current task’sinput, performs the task sequentially, and writes back the task’s outputcomp = O(n2 /p) · 1 + 2 · ( 1 )2 + 22 · ( 22 )2 + . . . + O (n/p)2 · p = 2 1O(n2 /p) + O(n2 /p) = O(n2 /p) 1 1comm = O(n/p 1/2 ) · 1 + 2 · 2 + 22 · 22 + . . . + O(n/p) · p =O(n/p 1/2 ) · log p + O(n) = O(n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 146 / 174
324.
Parallel matrix algorithmsTriangular system solutionParallel block forward substitution (contd.)Recursion levels 0 to log p: block forward substitution using parallelmatrix-vector multiplicationRecursion level log p: a designated processor reads the current task’sinput, performs the task sequentially, and writes back the task’s outputcomp = O(n2 /p) · 1 + 2 · ( 1 )2 + 22 · ( 22 )2 + . . . + O (n/p)2 · p = 2 1O(n2 /p) + O(n2 /p) = O(n2 /p) 1 1comm = O(n/p 1/2 ) · 1 + 2 · 2 + 22 · 22 + . . . + O(n/p) · p =O(n/p 1/2 ) · log p + O(n) = O(n) comp O(n2 /p) comm O(n) sync O(p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 146 / 174
325.
Parallel matrix algorithmsGaussian eliminationLet A, L, U be matrices of size n 0 0≤i j nL is unit lower triangular: Lij = 1 0≤i =j n arbitrary otherwise 0 0≤j i nU is upper triangular: Uij = arbitrary otherwise Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 147 / 174
326.
Parallel matrix algorithmsGaussian eliminationLet A, L, U be matrices of size n 0 0≤i j nL is unit lower triangular: Lij = 1 0≤i =j n arbitrary otherwise 0 0≤j i nU is upper triangular: Uij = arbitrary otherwiseA=L·U UAik = j Lij · Ujk A = ·0≤k ≤j ≤i n LThe LU decomposition problem: given A, ﬁnd L, U Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 147 / 174
334.
Parallel matrix algorithmsGaussian eliminationParallel block generic Gaussian eliminationAt each level, the two subproblems are dependent, hence recursion treeunfolded depth-ﬁrstAt level α log p, α ≥ 1/2, a task ﬁts in a single processorlevel tasks task size each task0 1 n parallel1 2 n/22 22 n/22...α log p pα n/p α sequential...log n n 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 151 / 174
336.
Parallel matrix algorithmsGaussian eliminationParallel block generic Gaussian elimination (contd.)Recursion levels 0 to α log p: block generic LU decomposition usingparallel matrix multiplicationRecursion level α log p: on each visit, a designated processor reads thecurrent task’s input, performs the task sequentially, and writes back thetask’s output Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 152 / 174
337.
Parallel matrix algorithmsGaussian eliminationParallel block generic Gaussian elimination (contd.)Recursion levels 0 to α log p: block generic LU decomposition usingparallel matrix multiplicationRecursion level α log p: on each visit, a designated processor reads thecurrent task’s input, performs the task sequentially, and writes back thetask’s outputThreshold level controlled by parameter α: 1/2 ≤ α ≤ 2/3 α ≥ 1/2 needed for comp-optimality α ≤ 2/3 ensures total comm of threshold tasks is not more than the comm of top-level matrix multiplication Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 152 / 174
338.
Parallel matrix algorithmsGaussian eliminationParallel block generic Gaussian elimination (contd.)Recursion levels 0 to α log p: block generic LU decomposition usingparallel matrix multiplicationRecursion level α log p: on each visit, a designated processor reads thecurrent task’s input, performs the task sequentially, and writes back thetask’s outputThreshold level controlled by parameter α: 1/2 ≤ α ≤ 2/3 α ≥ 1/2 needed for comp-optimality α ≤ 2/3 ensures total comm of threshold tasks is not more than the comm of top-level matrix multiplication comp O(n3 /p) comm O(n2 /p α ) sync O(p α ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 152 / 174
341.
1 Computation by circuits2 Parallel computation models3 Basic parallel algorithms4 Further parallel algorithms5 Parallel matrix algorithms6 Parallel graph algorithms Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 154 / 174
342.
Parallel graph algorithmsAlgebraic path problemSemiring: a set S with addition and multiplicationAddition commutative, associative, has identity 0a b=b a a (b c) = (a b) c a 0 = 0 a=aMultiplication associative, has annihilator 0 and identity 1a (b c) = (a b) c a 0 = 0 a= 0 a 1 = 1 a=aMultiplication distributes over additiona (b c) = a b a c (a b) c =a c b cIn general, no subtraction or division! Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 155 / 174
343.
Parallel graph algorithmsAlgebraic path problemSemiring: a set S with addition and multiplicationAddition commutative, associative, has identity 0a b=b a a (b c) = (a b) c a 0 = 0 a=aMultiplication associative, has annihilator 0 and identity 1a (b c) = (a b) c a 0 = 0 a= 0 a 1 = 1 a=aMultiplication distributes over additiona (b c) = a b a c (a b) c =a c b cIn general, no subtraction or division!Given a semiring S, square matrices of size n over S also form a semiring: given by matrix addition; 0 by the zero matrix given by matrix multiplication; 1 by the unit matrix Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 155 / 174
344.
Parallel graph algorithmsAlgebraic path problemSome speciﬁc semirings: S 0 1numerical R + 0 · 1Boolean {0, 1} ∨ 0 ∧ 1tropical R≥0 ∪ {+∞} min +∞ + 0We will occasionally write ab for a b, a2 for a a, etc. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 156 / 174
345.
Parallel graph algorithmsAlgebraic path problemSome speciﬁc semirings: S 0 1numerical R + 0 · 1Boolean {0, 1} ∨ 0 ∧ 1tropical R≥0 ∪ {+∞} min +∞ + 0We will occasionally write ab for a b, a2 for a a, etc.The closure of a: a∗ = 1 a a2 a3 ··· Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 156 / 174
346.
Parallel graph algorithmsAlgebraic path problemSome speciﬁc semirings: S 0 1numerical R + 0 · 1Boolean {0, 1} ∨ 0 ∧ 1tropical R≥0 ∪ {+∞} min +∞ + 0We will occasionally write ab for a b, a2 for a a, etc.The closure of a: a∗ = 1 a a2 a3 ··· 1 if |a| 1Numerical closure a∗ = 1 + a + a2 + a3 + · · · = 1−a undeﬁned otherwiseBoolean closure a∗ = 1 ∨ a ∨ a ∨ a ∨ . . . = 1Tropical closure a∗ = min(0, a, 2a, 3a, . . .) = 0In matrix semirings, closures are more interesting Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 156 / 174
347.
Parallel graph algorithmsAlgebraic path problemA semiring is closed, if an inﬁnite sum a1 a2 a3 · · · (e.g. a closure) is always deﬁned such inﬁnite sums are commutative, associative and distributiveIn a closed semiring, every element and every square matrix have a closure Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 157 / 174
348.
Parallel graph algorithmsAlgebraic path problemA semiring is closed, if an inﬁnite sum a1 a2 a3 · · · (e.g. a closure) is always deﬁned such inﬁnite sums are commutative, associative and distributiveIn a closed semiring, every element and every square matrix have a closureThe numerical semiring is not closed: an inﬁnite sum can be divergent Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 157 / 174
349.
Parallel graph algorithmsAlgebraic path problemA semiring is closed, if an inﬁnite sum a1 a2 a3 · · · (e.g. a closure) is always deﬁned such inﬁnite sums are commutative, associative and distributiveIn a closed semiring, every element and every square matrix have a closureThe numerical semiring is not closed: an inﬁnite sum can be divergentThe Boolean semiring is closed: an inﬁnite ∨ is 1, iﬀ at least one term is 1 Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 157 / 174
350.
Parallel graph algorithmsAlgebraic path problemA semiring is closed, if an inﬁnite sum a1 a2 a3 · · · (e.g. a closure) is always deﬁned such inﬁnite sums are commutative, associative and distributiveIn a closed semiring, every element and every square matrix have a closureThe numerical semiring is not closed: an inﬁnite sum can be divergentThe Boolean semiring is closed: an inﬁnite ∨ is 1, iﬀ at least one term is 1The tropical semiring is closed: an inﬁnite min is the greatest lower bound Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 157 / 174
351.
Parallel graph algorithmsAlgebraic path problemA semiring is closed, if an inﬁnite sum a1 a2 a3 · · · (e.g. a closure) is always deﬁned such inﬁnite sums are commutative, associative and distributiveIn a closed semiring, every element and every square matrix have a closureThe numerical semiring is not closed: an inﬁnite sum can be divergentThe Boolean semiring is closed: an inﬁnite ∨ is 1, iﬀ at least one term is 1The tropical semiring is closed: an inﬁnite min is the greatest lower boundWhere deﬁned, these inﬁnite sums are commutative, associative anddistributive Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 157 / 174
352.
Parallel graph algorithmsAlgebraic path problemLet A be a matrix of size n over a semiringThe algebraic path problem: compute A∗ = I ⊕ A ⊕ A2 ⊕ A3 ⊕ · · · Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 158 / 174
353.
Parallel graph algorithmsAlgebraic path problemLet A be a matrix of size n over a semiringThe algebraic path problem: compute A∗ = I ⊕ A ⊕ A2 ⊕ A3 ⊕ · · ·Numerical algebraic path problem: equivalent to matrix inversionA∗ = I + A + A2 + · · · = (I − A)−1 , if deﬁned Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 158 / 174
354.
Parallel graph algorithmsAlgebraic path problemLet A be a matrix of size n over a semiringThe algebraic path problem: compute A∗ = I ⊕ A ⊕ A2 ⊕ A3 ⊕ · · ·Numerical algebraic path problem: equivalent to matrix inversionA∗ = I + A + A2 + · · · = (I − A)−1 , if deﬁnedThe algebraic path problem in a closed semiring: interpreted via aweighted digraph on n nodes with adjacency matrix AAij = length of the edge i → j Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 158 / 174
355.
Parallel graph algorithmsAlgebraic path problemLet A be a matrix of size n over a semiringThe algebraic path problem: compute A∗ = I ⊕ A ⊕ A2 ⊕ A3 ⊕ · · ·Numerical algebraic path problem: equivalent to matrix inversionA∗ = I + A + A2 + · · · = (I − A)−1 , if deﬁnedThe algebraic path problem in a closed semiring: interpreted via aweighted digraph on n nodes with adjacency matrix AAij = length of the edge i → jBoolean A∗ : the graph’s transitive closureTropical A∗ : the graph’s all-pairs shortest paths Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 158 / 174
358.
Parallel graph algorithmsAlgebraic path problemFloyd–Warshall algorithm [Floyd, Warshall: 1962]Works for any closed semiring; we assume tropical, all 0s on main diagonalWeights may be negative; assume no negative cyclesFirst step of elimination: pivot A00 = 0 0 A01Replace each weight Aij , i, j = 0, with Ai0 + A0j , ifthat gives a shortcut from i to jA11 ← A11 ⊕ A10 A01 = min(A11 , A10 + A01 ) A10 A11Continue elimination on reduced matrix A11Generic Gaussian elimination in disguiseSequential work O(n3 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 160 / 174
362.
Parallel graph algorithmsAlgebraic path problemParallel algebraic path computationSimilar to LU decomposition by block generic Gaussian eliminationTe recursion tree is unfolded depth-ﬁrstRecursion levels 0 to α log p: block Floyd–Warshall using parallel matrixmultiplicationRecursion level α log p: on each visit, a designated processor reads thecurrent task’s input, performs the task sequentially, and writes back thetask’s output Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 162 / 174
363.
Parallel graph algorithmsAlgebraic path problemParallel algebraic path computationSimilar to LU decomposition by block generic Gaussian eliminationTe recursion tree is unfolded depth-ﬁrstRecursion levels 0 to α log p: block Floyd–Warshall using parallel matrixmultiplicationRecursion level α log p: on each visit, a designated processor reads thecurrent task’s input, performs the task sequentially, and writes back thetask’s outputThreshold level controlled by parameter α: 1/2 ≤ α ≤ 2/3 comp O(n3 /p) comm O(n2 /p α ) sync O(p α ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 162 / 174
366.
Parallel graph algorithmsAll-pairs shortest pathsThe all-pairs shortest paths problem: the algebraic path problem over thetropical semiring S 0 1tropical R≥0 ∪ {+∞} min +∞ + 0We continue to use the generic notation: for min, for + Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 164 / 174
367.
Parallel graph algorithmsAll-pairs shortest pathsThe all-pairs shortest paths problem: the algebraic path problem over thetropical semiring S 0 1tropical R≥0 ∪ {+∞} min +∞ + 0We continue to use the generic notation: for min, for +To improve on the generic algebraic path algorithm, we must exploit thetropical semiring’s idempotence: a a = min(a, a) = a Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 164 / 174
368.
Parallel graph algorithmsAll-pairs shortest pathsLet A be a matrix of size n over the tropical semiring, deﬁning a weighteddirected graphAij = length of the edge i → jAij ≥ 0 Aii = 1 =0 0 ≤ i, j n Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 165 / 174
369.
Parallel graph algorithmsAll-pairs shortest pathsLet A be a matrix of size n over the tropical semiring, deﬁning a weighteddirected graphAij = length of the edge i → jAij ≥ 0 Aii = 1 =0 0 ≤ i, j nPath length: sum ( -product) of all its edge lengthsPath size: its total number of edges (by deﬁnition, ≤ n) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 165 / 174
370.
Parallel graph algorithmsAll-pairs shortest pathsLet A be a matrix of size n over the tropical semiring, deﬁning a weighteddirected graphAij = length of the edge i → jAij ≥ 0 Aii = 1 =0 0 ≤ i, j nPath length: sum ( -product) of all its edge lengthsPath size: its total number of edges (by deﬁnition, ≤ n)Ak = length of the shortest path i ij j of size ≤ kA∗ = length of the shortest path i ij j (of any size) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 165 / 174
371.
Parallel graph algorithmsAll-pairs shortest pathsLet A be a matrix of size n over the tropical semiring, deﬁning a weighteddirected graphAij = length of the edge i → jAij ≥ 0 Aii = 1 =0 0 ≤ i, j nPath length: sum ( -product) of all its edge lengthsPath size: its total number of edges (by deﬁnition, ≤ n)Ak = length of the shortest path i ij j of size ≤ kA∗ = length of the shortest path i ij j (of any size)The all-pairs shortest paths problem:A∗ = I ⊕ A ⊕ A2 ⊕ · · · = I ⊕ A ⊕ A2 ⊕ · · · ⊕ An = (I ⊕ A)n = An Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 165 / 174
372.
Parallel graph algorithmsAll-pairs shortest pathsDijkstra’s algorithm [Dijkstra: 1959]Computes single-source shortest paths from ﬁxed source (say, node 0)Ranks all nodes by distance from node 0: nearest, second nearest, etc.Every time a node i has been ranked: replace each weight A0j , j unranked,with A0i + Aij , if that gives a shortcut from 0 to jAssign the next rank to the unranked node closest to node 0 and repeat Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 166 / 174
373.
Parallel graph algorithmsAll-pairs shortest pathsDijkstra’s algorithm [Dijkstra: 1959]Computes single-source shortest paths from ﬁxed source (say, node 0)Ranks all nodes by distance from node 0: nearest, second nearest, etc.Every time a node i has been ranked: replace each weight A0j , j unranked,with A0i + Aij , if that gives a shortcut from 0 to jAssign the next rank to the unranked node closest to node 0 and repeatIt is essential that the edge lengths are nonnegativeSequential work O(n2 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 166 / 174
374.
Parallel graph algorithmsAll-pairs shortest pathsDijkstra’s algorithm [Dijkstra: 1959]Computes single-source shortest paths from ﬁxed source (say, node 0)Ranks all nodes by distance from node 0: nearest, second nearest, etc.Every time a node i has been ranked: replace each weight A0j , j unranked,with A0i + Aij , if that gives a shortcut from 0 to jAssign the next rank to the unranked node closest to node 0 and repeatIt is essential that the edge lengths are nonnegativeSequential work O(n2 )All-pairs shortest paths: multi-Dijkstra, i.e. running Dijkstra’s algorithmindependently from every node as a sourceSequential work O(n3 ) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 166 / 174
375.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by multi-DijkstraEvery processor reads matrix A and is assigned a subset of n/p nodes runs n/p independent instances of Dijkstra’s algorithm from its assigned nodes writes back the resulting n2 /p shortest distances Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 167 / 174
376.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by multi-DijkstraEvery processor reads matrix A and is assigned a subset of n/p nodes runs n/p independent instances of Dijkstra’s algorithm from its assigned nodes writes back the resulting n2 /p shortest distances comp O(n3 /p) comm O(n2 ) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 167 / 174
377.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths: summary so far comp O(n3 /p)Floyd–Warshall, α = 2/3 comm O(n2 /p 2/3 ) sync O(p 2/3 )Floyd–Warshall, α = 1/2 comm O(n2 /p 1/2 ) sync O(p 1/2 )Multi-Dijkstra comm O(n2 ) sync O(1) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 168 / 174
378.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths: summary so far comp O(n3 /p)Floyd–Warshall, α = 2/3 comm O(n2 /p 2/3 ) sync O(p 2/3 )Floyd–Warshall, α = 1/2 comm O(n2 /p 1/2 ) sync O(p 1/2 )Multi-Dijkstra comm O(n2 ) sync O(1)Coming next comm O(n2 /p 2/3 ) sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 168 / 174
379.
Parallel graph algorithmsAll-pairs shortest pathsPath doublingCompute A, A2 , A4 = (A2 )2 , A8 = (A4 )2 , . . . , An = A∗Overall, log n rounds of matrix -multiplication: looks promising. . .Sequential work O(n3 log n): not work-optimal! Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 169 / 174
380.
Parallel graph algorithmsAll-pairs shortest pathsSelective path doublingIdea: to remove redundancy in path doubling by keeping track of path sizes Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 170 / 174
381.
Parallel graph algorithmsAll-pairs shortest pathsSelective path doublingIdea: to remove redundancy in path doubling by keeping track of path sizesAssume we already have Ak . The next round is as follows. ≤kLet Aij = length of the shortest path i j of size ≤ k =kLet Aij = length of the shortest path i j of size exactly kWe have Ak = A≤k = A=0 ⊕ · · · ⊕ A=k Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 170 / 174
382.
Parallel graph algorithmsAll-pairs shortest pathsSelective path doublingIdea: to remove redundancy in path doubling by keeping track of path sizesAssume we already have Ak . The next round is as follows. ≤kLet Aij = length of the shortest path i j of size ≤ k =kLet Aij = length of the shortest path i j of size exactly kWe have Ak = A≤k = A=0 ⊕ · · · ⊕ A=k kConsider A= 2 , . . . , A=k . The total number of non- 0 elements in these 2matrices is at most n2 , on average 2n per matrix. Hence, for some l ≤ k , k 2 k 2n2matrix A= 2 +l has at most k non- 0 elements. = k +l 3kCompute (I + A ) A≤k = A≤ 2 +l . This is a sparse-by-dense matrix 2 2 3product, requiring at most 2n · n = 2n elementary multiplications. k k Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 170 / 174
386.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling 3 3 2All processors compute A, A≤ 2 +··· , A(≤ 2 ) +··· , . . . , A≤p+··· by ≤ log3/2 prounds of parallel sparse-by-dense matrix -multiplicationConsider A=0 , . . . , A=p . The total number of non- 0 elements in these 2matrices is at most n2 , on average n per matrix. Hence, for some q ≤ p , p 2 2n2matrices A=q and A=p−q have together at most p non- 0 elements.Every processor reads A=q and A=p−q and computes A=q A=p−q = A=p Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 172 / 174
387.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling 3 3 2All processors compute A, A≤ 2 +··· , A(≤ 2 ) +··· , . . . , A≤p+··· by ≤ log3/2 prounds of parallel sparse-by-dense matrix -multiplicationConsider A=0 , . . . , A=p . The total number of non- 0 elements in these 2matrices is at most n2 , on average n per matrix. Hence, for some q ≤ p , p 2 2n2matrices A=q and A=p−q have together at most p non- 0 elements.Every processor reads A=q and A=p−q and computes A=q A=p−q = A=pAll processors compute (A=p )∗ by parallel multi-Dijkstra, and then(A=p )∗ A≤p+··· = A∗ by parallel matrix -multiplicationUse of multi-Dijkstra requires that all edge lengths in A are nonnegative Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 172 / 174
388.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling 3 3 2All processors compute A, A≤ 2 +··· , A(≤ 2 ) +··· , . . . , A≤p+··· by ≤ log3/2 prounds of parallel sparse-by-dense matrix -multiplicationConsider A=0 , . . . , A=p . The total number of non- 0 elements in these 2matrices is at most n2 , on average n per matrix. Hence, for some q ≤ p , p 2 2n2matrices A=q and A=p−q have together at most p non- 0 elements.Every processor reads A=q and A=p−q and computes A=q A=p−q = A=pAll processors compute (A=p )∗ by parallel multi-Dijkstra, and then(A=p )∗ A≤p+··· = A∗ by parallel matrix -multiplicationUse of multi-Dijkstra requires that all edge lengths in A are nonnegative comp O(n3 /p) comm O(n2 /p 2/3 ) sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 172 / 174
389.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Now let A have arbitrary (nonnegative or negative) edge lengths. We stillassume there are no negative-length cycles. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 173 / 174
390.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Now let A have arbitrary (nonnegative or negative) edge lengths. We stillassume there are no negative-length cycles. 3 3 2 2All processors compute A, A≤ 2 +··· , A(≤ 2 ) +··· , . . . , A≤p +··· by ≤ 2 log prounds of parallel sparse-by-dense matrix -multiplication Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 173 / 174
391.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Now let A have arbitrary (nonnegative or negative) edge lengths. We stillassume there are no negative-length cycles. 3 3 2 2All processors compute A, A≤ 2 +··· , A(≤ 2 ) +··· , . . . , A≤p +··· by ≤ 2 log prounds of parallel sparse-by-dense matrix -multiplication 2Let A=(p) = A=p ⊕ A=2p ⊕ · · · ⊕ A=p 2 −qLet A=(p)−q = A=p−q ⊕ A=2p−q ⊕ · · · ⊕ A=p Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 173 / 174
392.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Now let A have arbitrary (nonnegative or negative) edge lengths. We stillassume there are no negative-length cycles. 3 3 2 2All processors compute A, A≤ 2 +··· , A(≤ 2 ) +··· , . . . , A≤p +··· by ≤ 2 log prounds of parallel sparse-by-dense matrix -multiplication 2Let A=(p) = A=p ⊕ A=2p ⊕ · · · ⊕ A=p 2 −qLet A=(p)−q = A=p−q ⊕ A=2p−q ⊕ · · · ⊕ A=p p pConsider A=0 , . . . , A= 2 and A=(p)− 2 , . . . , A=(p) . The total number of 2non- 0 elements in these matrices is at most n2 , on average n per matrix. pHence, for some q ≤ p , matrices A=q and A=(p)−q have together at most 22n2 p non- 0 elements. Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 173 / 174
393.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Every processor reads A=q and A=(p)−q and computes A=q A=(p)−q = A=(p) computes (A=(p) )∗ = (A=p )∗ by sequential selective path doubling Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 174 / 174
394.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Every processor reads A=q and A=(p)−q and computes A=q A=(p)−q = A=(p) computes (A=(p) )∗ = (A=p )∗ by sequential selective path doublingAll processors compute (A=p )∗ A≤p = A∗ by parallel matrix -multiplication Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 174 / 174
395.
Parallel graph algorithmsAll-pairs shortest pathsParallel all-pairs shortest paths by selective path doubling (contd.)Every processor reads A=q and A=(p)−q and computes A=q A=(p)−q = A=(p) computes (A=(p) )∗ = (A=p )∗ by sequential selective path doublingAll processors compute (A=p )∗ A≤p = A∗ by parallel matrix -multiplication comp O(n3 /p) comm O(n2 /p 2/3 ) sync O(log p) Alexander Tiskin (Warwick) Eﬃcient Parallel Algorithms 174 / 174
Be the first to comment