Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DFA minimization algorithms in map reduce

516 views

Published on

Explaining implementation and analysis of two well known DFA minimisation algorithms namely Morore and Hopcroft, in Map Reduce using Hadoop. Cost analysis and complexity are described.
Please follow this link: http://spectrum.library.concordia.ca/980838/

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

DFA minimization algorithms in map reduce

  1. 1. DFA Minimization Algorithms in Map-Reduce Iraj Hedayati Somarin MasterThesis Defense – January 2016 Computer Science and Software Engineering Faculty of Engineering and Computer Science Concordia University Supervisor: Gösta K. Grahne Examiner: Brigitte Jaumard Examiner: Hovhannes A. Harutyunyan Chair: Rajagopalan Jayakumar
  2. 2. Outline • Introduction • DFA Minimization in Map-Reduce • Cost Analysis • Experimental Results • Conclusion 1
  3. 3. INTRODUCTION An introduction about the problem and related works done so far 2
  4. 4. DFA, Big-Data and our Motivation • Finite Automata • Deterministic Finite Automata • DFA Minimization is the process of: • Removing unreachable states • Merging non-distinguishable states • What is Big-Data? (e.g. peta equal to 250 or 1015) • Insufficient study of DFA minimization for data-intensive applications and parallel environments 3 𝐴 = 〈𝑄, Σ, 𝛿, 𝑠, 𝐹〉
  5. 5. DFA Minimization Methods (Watson, 1993) Equivalence of States (≡) Equivalence Relation Bottom-Up Top-Down Layer-wise Unordered State Pairs Point-Wise Brzozowski Denote 𝜋 = {𝐵1, 𝐵2, … , 𝐵 𝑚} as a partition on 𝑄, then: 𝑝 ≡ 𝜋 𝑞 ↔ ∀𝑤 ∈ Σ∗ , 𝛿 𝑝, 𝑤 ∈ 𝐵𝑖 ∧ 𝛿 𝑞, 𝑤 ∈ 𝐵𝑖 4
  6. 6. Moore’sAlgorithm (Moore, 1956) • Input is DFA 𝐴 = 〈𝑄, Σ, 𝛿, 𝑠, 𝐹〉 where 𝑘 = |Σ| and 𝑛 = |𝑄| • Initialize partition 𝜋 = {0,1} over 𝑄 where: • ∀𝑝 ∈ 𝑄, 𝑝 ∈ 0 , 𝑝 ∈ 𝑄 ∖ 𝐹 1 , 𝑝 ∈ 𝐹 • Iteratively refine the partition using equivalence relation in iteration 𝑖 (≡𝑖) 𝑝 ≡𝑖 𝑞 ↔ 𝑝 ≡𝑖−1 𝑞 ∧ ∀𝑎 ∈ Σ, 𝛿 𝑝, 𝑎 ≡𝑖−1 𝛿 𝑞, 𝑎 • The initial partition is ≡0 • Complexity 𝑂(𝑘𝑛2 ) 5
  7. 7. Hopcroft’s Algorithm (Hopcroft, 1971) • The idea is avoiding some unnecessary operations • Input is DFA 𝐴 = 〈𝑄, Σ, 𝛿, 𝑠, 𝐹〉 where 𝑘 = Σ and 𝑛 = |𝑄| • Initialize partition 𝜋 = {0,1} over 𝑄 where: • ∀𝑝 ∈ 𝑄, 𝑝 ∈ 0 , 𝑝 ∈ 𝑄 ∖ 𝐹 1 , 𝑝 ∈ 𝐹 • Keep list of splitters • Iteratively divide partitions using splitter 〈𝑃, 𝑎〉 𝐵 ÷ 𝑃, 𝑎 = {𝐵1, 𝐵2} where 𝐵1 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∈ 𝑃} and 𝐵2 = {𝑞 ∈ 𝐵 ∶ 𝛿 𝑞, 𝑎 ∉ 𝑃} • Update the list of splitters • Complexity= 𝑂(𝑘𝑛 log 𝑛); Number of Iterations = 𝑂 𝑘𝑛 6
  8. 8. Hopcroft’s Algorithm (Example) 𝑃 𝐵 𝑄𝑈𝐸 = { 𝑃, 𝑎 , 𝑃1, 𝑎 , 𝑃2, 𝑎 } 7 𝑃1 𝑃2 𝑄𝑈𝐸 = 𝑄𝑈𝐸 ∪ 〈𝐵1, 𝑎〉 𝑃 𝐵1 𝐵2 𝑃1 𝑃2
  9. 9. Map-Reduce Model DFS Data 1 Data 2 Data 3 Data 4 Mapping Mapper 1 Mapper 2 Reduce Reducer 1 Reducer 2 Reducer 3 DFS Data 1 Data 2 Data 3 Original Data Mapped Data 𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒 ℛ = 𝑀𝑎𝑝𝑝𝑒𝑑 𝐷𝑎𝑡𝑎 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎 8
  10. 10. RelatedWorks in Parallel DFA Minimization 1) Employing EREW-PRAM model (Moore’s method) (𝑂(𝑘𝑛), 𝑂(𝑛)) (Ravikumar and Xiong 1996) 2) Employing CRCW-PRAM model (Moore’s method) (𝑂 𝑘𝑛 log 𝑛 , 𝑂 𝑛 log 𝑛 ) (Tewari et al. 2002) 3) Employing Map-Reduce model (Moore’s method) [Moore-MR] ℛ = 3 2 (Harrafi 2015) • Challenge is how to store block numbers: 1) Parallel in-block sorting and rename blocks in serial 2) Parallel Perfect Hashing Function and partial sum 3) No action is taken 9
  11. 11. Cost Model • Communication Complexity (Yao 1979 & Kushilevitz 1997) • The Lower Bound Recipe for Replication Rate (Afrati et al. 2013) • Computational Complexity of Map-Reduce (Turan 2015) 10
  12. 12. Cost Model – Communication Complexity • Yao’s two-party model Bob 𝑦 ∈ 0,1 𝑛 Alice 𝑥 ∈ 0,1 𝑛 𝑓: 0,1 𝑛 × 0,1 𝑛 → {0,1} How much communication is required? 𝒟(𝑓) Upper Bound (Worst Case): 𝒟 𝑓 ≤ 𝑛 + 1 𝐴 ⊂ 0,1 𝑛 𝐵 ⊂ 0,1 𝑛 Lower Bound: 𝒟 𝑓 ≥ log 𝒞(𝑓) where 𝒞(𝑓) is the number of rectangles Fooling set is a well-known method for finding f-monochromatic rectangles 11
  13. 13. Cost Model – Lower Bound Recipe (Afrati et al. 2013) Reducer 1 Reducer 2 Reducer n Reducer Capacity = 𝜌 Input = 𝐼 𝜌1 𝜌2 𝜌 𝑛 Output = O 𝑔(𝜌1) 𝑔(𝜌2) 𝑔(𝜌 𝑛) ℛ = 𝑖=1 𝑛 𝜌𝑖 |𝐼| 𝑖=1 𝑛 𝑔(𝜌𝑖) ≥ 𝑂 → 𝑖=1 𝑛 𝜌𝑖 𝑔 𝜌𝑖 𝜌𝑖 ≥ 𝑂 ⟹ 𝑔 𝜌 𝜌 𝑖=1 𝑛 𝜌𝑖 ≥ 𝑂 → ℛ ≥ 𝜌|𝑂| 𝑔 𝜌 |𝐼| 12
  14. 14. Cost Model – Computational Complexity (Turan 2015) • Lets denote aTuring machine 𝑀 = (𝑚, 𝑟, 𝑛, 𝜌) where: • 𝑚 indicates whether it is a mapping task (𝑚 = 1) or a reducer task (𝑚 = 0) • 𝑟 indicates the round number • 𝑛 indicates the input size • 𝜌 indicates the reducer size • 𝑀𝑅𝐶[𝑓 𝑛 , 𝑔 𝑛 ] • ∃𝑐, 0 < 𝑐 < 1: there is an 𝑂 𝑛 𝑐 -space and 𝑂(𝑔 𝑛 )-time Turing machine 𝑀 = (𝑚, 𝑟, 𝑛, 𝜌) and 𝑅 = 𝑂 𝑓 𝑛 . 13
  15. 15. DFA MINIMIZATION IN MAP-REDUCE Proposed algorithms for minimizing a DFA in Map-Reduce model 14
  16. 16. Enhancement to Moore-MR • Moore-MR (Harrafi 2015): • Input 𝐴 = 〈𝑄, Σ, 𝛿, 𝑠, 𝐹〉 • Pre-Processing: generate Δ with records 〈𝑝, 𝑎, 𝑞, 𝜋 𝑝 ∈ 0,1 , 𝐷 ∈ +, − 〉 from 𝛿 • Mapping Schema: map every transition record of Δ based on 𝑝 if 𝐷 = + and based on 𝑝 and 𝑞 if 𝐷 = − ℎ: 𝑄 → {1,2, … , 𝑛} • ReducerTask: Compute new block number using Moore method • Note that, in order to accomplish reducer task in reducer 𝑝, it requires 𝜋 𝑞 for every state it has a transition to.Transitions with 𝐷 = − are responsible to carry these data • Challenge is new block numbers are concatenation of 𝑘 other block numbers. After round 𝑟, the size of each is equal to 𝑘 + 1 r. 15
  17. 17. Enhancement to Moore-MR PPHF-MR • Having 𝑆 ⊂ 𝑆′ and 𝑅 where 𝑅 ≪ |𝑆′ | , then 𝑃𝐻𝐹: 𝑆 → 𝑅 is a one-to-one function • Mapping: map every record 〈𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝐷〉 to ℎ(𝜋 𝑝) • ReducerTask: assign new block number from range [𝑗 ⋅ 𝑛, 𝑗 + 1 ⋅ 𝑛 − 1] where 𝑗 is reducer number Moore-MR-PPHF is obtained by applying PPHF-MR after each iteration of Moore-MR 16
  18. 18. Hopcroft-MR Pre-Processing PreProcessing Mapper Reducer Iterate Until QUE is not empty PartitionDetect Mapper Reducer BlockUpdate Mapper Reducer PPHF-MR Mapper Reducer Construct Minimal DFA ℎ(𝑞) ℎ(𝑞) ℎ(𝑝) ℎ(𝜋 𝑝) Transition: 〈𝑝, 𝑎, 𝑞, 𝜋 𝑝, 𝜋 𝑞〉 Δ blocks[a,Bi] Block tuple: 〈𝑎, 𝑞, 𝜋 𝑞〉 Δ, blocks[a,Bi] Update tuple: 〈𝑝, 𝜋 𝑝, 𝜋 𝑝 𝑛𝑒𝑤〉 new Δ, blocks[a,Bi],new Δ, blocks[a,Bi] 17
  19. 19. Hopcroft-MR vs. Hopcroft-MR-PAR • In Hopcroft-MR we pick one splitter at a time while in Hopcroft-MR-PAR we pick all the splitters from QUE • In Hopcroft-MR, 𝜋 𝑝 𝑛𝑒𝑤 = 𝜋 𝑝 + |𝜋| • In Hopcroft-MR-PAR, 𝜋 𝑝 𝑛𝑒𝑤 = 𝜋 × A 𝑝 + 𝜋 𝑝 • Where A is bit vector 𝑃, 𝑎 ∈ 𝑄𝑈𝐸 ∧ 𝑞 ∈ 𝑃 ∧ 𝛿 𝑝, 𝑎 = 𝑞 → A 𝑝 𝑎 = 1 18
  20. 20. COST ANALYSIS Analyzing cost measures for the proposed algorithms as well as finding lower bound and upper bound on each 19
  21. 21. Communication Cost Bounds • Upper-Bound for DFA minimization problem in parallel environments 𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≤ 𝑂 𝑘𝑛3 log 𝑛 where 𝑘 = |Σ| and 𝑛 = |𝑄| • Lower-Bound on DFA minimization problem in parallel environments 𝐷𝐶𝐶 𝐷𝐹𝐴 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑎𝑡𝑖𝑜𝑛 ≥ 𝑂(𝑘𝑛 log 𝑛) 20
  22. 22. Lower Bound on Replication Rate • ℛ ≥ 𝜌×|𝑂| 𝑔(𝜌)×|𝐼| • 𝑔(𝜌): For every input record (transition) a reducer produces exactly one record of output. Hence 𝑔 𝜌 = 𝜌 • The output is exactly equal to input size containing updated transitions. Hence, 𝑂 ≤ |𝐼|. • ℛ ≥ 𝜌× 𝑂 𝑔 𝜌 × 𝐼 = 𝜌× 𝐼 𝜌× 𝐼 = 1 21
  23. 23. Moore-MR-PPHF • ℛ = 3 2 • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑟(ℛ 𝐼 + |𝑂|) where 𝑟 is number of Map-Reduce rounds • 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑟𝑒𝑐𝑜𝑟𝑑 = 𝑂(log 𝑛 + log 𝑘) • 𝑂 ∼ 𝐼 = 𝑘𝑛(log 𝑛 + log 𝑘) • 𝑟 = 𝑂(𝑛) 𝐶𝐶 = 𝑂 𝑛 ⋅ 𝑘𝑛 log 𝑛 + log 𝑘 = 𝑂(𝑘𝑛2 log 𝑛 + log 𝑘 ) 22
  24. 24. Hopcroft-MR • ℛ = 1 • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹 • 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 = 𝑂 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘 + 𝑛 log 𝑛 log 𝑛 + log 𝑘 = O(n log 𝑛 log 𝑛 + log 𝑘 ) • 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑀𝑎𝑝𝑝𝑒𝑟 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑅𝑒𝑑𝑢𝑐𝑒𝑟 = 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑀𝑎𝑝𝑝𝑒𝑟 • 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 𝑀𝑎𝑝𝑝𝑒𝑟 = 𝑂(𝑘𝑛 ⋅ (𝑘𝑛 ⋅ log 𝑘 + log 𝑛 + 𝑛 ⋅ (log 𝑘 + log 𝑛))) + 𝑂(𝑛𝑙𝑜𝑔 𝑛 ⋅ log 𝑛) = 𝑂( 𝑘𝑛 2 (log 𝑛 + log 𝑘)) • 𝐶𝐶 𝑃𝑃𝐻𝐹 = 𝑂(𝑘𝑛2 log 𝑛 + log 𝑘 ) • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂( 𝑘𝑛 2 log 𝑛 + log 𝑘 ) 23
  25. 25. Hopcroft-MR-PAR • ℛ = 1 • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝐶𝐶 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 + 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 + 𝐶𝐶 𝑃𝑃𝐻𝐹 • 𝐶𝐶 𝑈𝑝𝑑𝑎𝑡𝑒 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘)) • 𝐶𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝐶𝑜𝑠𝑡 = 𝑂(𝑘𝑛2(log 𝑛 + log 𝑘)) 24
  26. 26. Comparison of Complexity Measures Replication Rate Communication Cost Sensitive to Skewness Lower Bound 1 𝑂 𝑘𝑛 log 𝑛 - Moore-MR (Harrafi 2015) 3 2 𝑂 𝑛𝑘 𝑛 No Moore-MR-PPHF 3 2 𝑂 𝑘𝑛2 log 𝑛 + log 𝑘 No Hopcroft-MR 1 𝑂 𝑘𝑛 2 log 𝑛 + log 𝑘 Yes Hopcroft-MR-PAR 1 𝑂 𝑘𝑛2 log 𝑛 + log 𝑘 Yes 25
  27. 27. EXPERIMENTAL RESULTS Plotting the results gathered from running proposed algorithms on different data sets 26
  28. 28. Data Generator - Circular Input DFA Minimized DFA 27
  29. 29. Data Generator – Duplicated Random Input DFA Minimized DFA 28
  30. 30. Data Generator – Linear 29
  31. 31. Moore-MR vs. Moore-MR-PPHF 30
  32. 32. Circular DFA 31
  33. 33. Replicated Random DFA 32
  34. 34. Number of Rounds 33
  35. 35. CONCLUSION Concluding work done in this thesis and suggesting future works and further questions 34
  36. 36. Conclusion • In this work we studied DFA minimization algorithms in Map-Reduce and PRAM • Proposed an enhancement to a DFA minimization algorithm in Map-Reduce by introducing PPHF in Map-Reduce • Proposed a new algorithm in Map-Reduce based on Hopcroft’s method • Found lower bound on Replication Rate in Map-Reduce and Communication Cost in parallel environment for DFA minimization problem • Studied different measures of Map-Reduce algorithms • Found that two critical measures are missing: Sensitivity to Skewness and Horizontal growth of data 35
  37. 37. FutureWorks • Reducer Capacity vs. Number of Rounds trade-off • Investigating other methods of minimization • Extending complexity model and class • Is it possible to compare Map-Reduce algorithms with others in different models (PRAM, serial, and etc.)? 36
  38. 38. Thank you Questions & Answer 37

×