Successfully reported this slideshow.
Your SlideShare is downloading. ×

Learn from LL(1) to PEG parser the hard way

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Session2
Session2
Loading in …3
×

Check these out next

1 of 71 Ad

Learn from LL(1) to PEG parser the hard way

Download to read offline

Presented in EuroPython2021.

Special thanks for Jeremiah Paige (@ucodery) to point out the misused word memorization.

Presented in EuroPython2021.

Special thanks for Jeremiah Paige (@ucodery) to point out the misused word memorization.

Advertisement
Advertisement

More Related Content

Similar to Learn from LL(1) to PEG parser the hard way (20)

Advertisement

More from Kir Chou (20)

Recently uploaded (20)

Advertisement

Learn from LL(1) to PEG parser the hard way

  1. 1. Standing on the shoulders of giants: Learn from LL(1) to PEG parser the hard way Kir Chou 1
  2. 2. 2 https://www.youtube.com/watch?v=DZTLgVBxET4
  3. 3. About me Presented at PyCon TW/JP since 2017 https://note35.github.io/about/ https://github.com/note35/Parser-Learning 3
  4. 4. Agenda ● Motivation ● What is parser in CPython? ● Parser 101 - CFG ● Parser 101 - Traditional parser (LL(1) / LR(0)) ● Parser 102 - PEG and PEG parser ● Parser 102 - Packrat parser ● CPython’s PEG parser ● Take away 4
  5. 5. Motivation 5
  6. 6. Motivation What’s New In Python 3.9? PEP 617, CPython now uses a new parser based on PEG; “IIRC, I took a Compiler class in school…” 6
  7. 7. Motivation (Cont.) School taught us the brief concept of the Compiler’s frontend and backend. School’s parser assignment used Bison + YACC. And... 7
  8. 8. My motivation = Talk objectives What is PEG parser? Why did python use LL(1) parser before? Why did Guido choose PEG parser? What other parsers do we have? What’s the difference between those parsers? How to implement those parsers? 8
  9. 9. What is parser in CPython? CPython DevGuide - Design of CPython’s Compiler 9
  10. 10. Compilation Steps 10 Source Code Tokens Abstract Syntax Tree (AST) Bytecode Result Lexer Parser Compiler VM Import
  11. 11. 11 https://docs.python.org/3/library/tokenize.html#examples Lexer
  12. 12. 12 https://docs.python.org/3/library/ast.html Parser
  13. 13. 13 https://docs.python.org/3/library/dis.html#dis.disassemble Compiler = print(2*3+4)
  14. 14. 14 Source Code Tokens Abstract Syntax Tree (AST) Bytecode Result Lexer Parser Compiler VM Import Talk’s focus!
  15. 15. Parser 101 - CFG Uncode - GATE Computer Science - Compiler Design Lecture 15
  16. 16. Grammar Context Free Grammar (CFG) 16 Interpretation of this Grammar “Both B and a can be derived from A” Derivation *some paper write <- Non-terminal AND *support ambigious syntax A -> B | a Terminal rule
  17. 17. What is “Context Free”? Left-hand side in all the rules only contains 1 non-terminal. Valid CFG Example: Invalid CFG Example: 17 S -> aSb xSy -> axSyb
  18. 18. Semantic Analysis: Parse Tree Concret Syntax Tree (CST) An ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. Abstract Syntax Tree (AST) A tree representation of the abstract syntactic structure of source code written in a programming language. 18
  19. 19. CFG Simplification 1. Ambiguous -> Unambiguous 2. Nondeterministic -> Deterministic 3. Left recursion -> No left recursion 19
  20. 20. Ambiguious Definition A grammar contains rules that can generate more than one tree. 20 E -> E + E | E * E | Num N N N E E E + E * E N E E N N E E E * +
  21. 21. Ambiguious -> Unambiguous 21 N E E N N T F T * + E -> E + T | T T -> T * F | F F -> Num E -> E + E | E * E | Num Step1 Rewrite Grammar Step2 Make sure the grammar only generate one tree T F F
  22. 22. Non-deterministic -> Deterministic A grammar contains rules that have common prefix. 22 A -> ab | ac A -> aA’ A’ -> b | c Rewrite Grammar *A non-deterministic grammar can be rewritten into more than one deterministic grammar.
  23. 23. Left recursion -> No left recursion A grammar contains direct or indirect left recursion. 23 E -> E + T | T T -> T * F | F F -> Num E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Rewrite Grammar E in first E + T will recursively derives to second E + T, E in second E + T will repeat it to third E + T, and so on recursively.
  24. 24. Recap: CFG Simplification 24 Before After Ambiguous Non-deterministic Left Recursion
  25. 25. Parser 101 - Traditional parser Uncode - GATE Computer Science - Compiler Design Lecture 25
  26. 26. Parser classification 26 N E E N N E E E * + Top-down Type Bottom-up Type N E N E N E + N E E N E E N E E + N E E N N E E E * +
  27. 27. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 27 *Both LL/LR parser scan input string from left to right Input String: 2 + 3 * 4
  28. 28. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 28 *The derivation time of LL/LR parser is different. N E E N N E E E * + N E E N N E E E * + + → * * → +
  29. 29. LL / LR Parser LL(k) = Left-to-right, Leftmost derivation, k-token lookahead (k>0) LR(k) = Left-to-right, Rightmost derivation, k-token lookahead (k>=0) 29 Input String: 2 + 3 * 4 I am "a token of number". If I perform 1-token lookahead and meet "a token of +", what to do next?
  30. 30. Top Down - Recursive descent parser 30
  31. 31. LL(k) - Implementation 31 2 + 3 * 4 parse_E() E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num parse_Tp(parse_F()) parse_Ep( ) Step3 *recursively parse the input string started from first rule parse_E() Step2 *parse from left to right *perform k-lookahead parse_T() Step1 write function for each non-terminal
  32. 32. 32 Grammar E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num *perform 1-lookahead LL(1) - Example code Derivation x x
  33. 33. Top Down - Non recursive descent parser 33
  34. 34. LL(1) - Parsing table 34 Step1 Build first/follow table for each non-terminal Note: $ means endmark Step2 Build parsing table based on first/follow table
  35. 35. LL(1) - Implementation 35 Step3 Implement with stack (take shift/reduce action based on parsing table) N E E N N E E E * +
  36. 36. LL(1) - Example code 36 Grammar E -> TE’ E’ -> +TE’ | None T -> FT’ T’ -> *FT’ | None F -> Num Non-terminal stack Reduce (Derivation) Shift Reduce (Derivation)
  37. 37. Bottom Up - LR(0) parser 37
  38. 38. LR(0) - Deterministic finite automaton 38 E’ -> .E --- (1) E -> .E + T --- (2) E -> .T --- (3) T -> .T * Num --- (4) T -> .Num --- (5) Step1 Build Deterministic Finite Automaton(DFA) E’ -> E. E -> E. + T E -> T. T -> T. * Num T -> Num. E -> E + .T T -> .T * Num T -> .Num T -> T * .Num E -> E + T. T -> T. * Num T -> T * Num. E T Num * + T Num * Num S1 S2 S3 S4 S5 S6 S7 S8 Left recursion support
  39. 39. LR(0) - Parsing table 39 Step2 Build parsing table (For parser like SLR(1), it requires first/follow table) Shift acc Reduce (Derivation) acc
  40. 40. LR(0) - Implementation 40 Step3 Implement with stack (take shift/reduce action based on parsing table) N E E N N E E E * +
  41. 41. LR(0) - Example code 41 Grammar E -> E + T | T T -> T * F | F F -> Num Shift Reduce (Derivation)
  42. 42. Parser 102 - PEG and PEG parser 42
  43. 43. Grammar Parsing Expression Grammar (PEG) 43 *Difference from traditional CFG A will try A -> B first. Only after it fails at A -> B, A will only try A -> a. Derivation *some paper write <- Non-Terminal OR (if / elif / ...) *disallow ambigious syntax A -> B | a Terminal *Introduced in 2002 (Packrat Parsing: Simple, Powerful, Lazy, Linear Time) rule *support Regular Expression (EBNF grammar) in another paper
  44. 44. Example of difference 44 Grammar1: A -> a b | a Grammar2: A -> a | a b ● LL/LR parser will fail to complete when the input grammar is ambiguous. ● PEG parser only tries the first PEG rule. The latter rule will never succeed. “A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses.” (source)
  45. 45. PEG Parser PEG parser means “parser generated based on PEG”. PEG parser can be a Packrat parser, or other traditional parser with k-lookahead limitation. Mostly, PEG parser means Packrat parser. 45 CFG EBNF grammar PEG Packrat parser Traditional parser PEG Parser
  46. 46. Parser 102 - Packrat parser 46
  47. 47. Type of Packrat parser 47 Top-down Type N E E N E E N E E + N E E N N E E E * + Packrat parser is top-down type.
  48. 48. Packrat Parsing - Implementation 48 2 + 3 * 4 parse_E() E -> E + T | T T -> T * F | F F -> Num parse_T() and parse_F() parse_E() and parse_T() Step2 *parse from left to right *perform infinite lookahead + memoization Step1 *write function for each non-terminal (PEG rule) *Idea of memoization was Introduced in 1970 Step3 *recursively parse the input string started from first rule parse_E() Left recursion support
  49. 49. Packrat Parsing - Example code 49 Grammar E -> E + T | T T -> T * F | F F -> Num Derivation Memoization
  50. 50. Packrat - what is memoization? 50 509. Fibonacci Number 4 3 2 2 1 fib(0) = 0 fib(1) = 1 fib(2) = fib(1) + fib(0) = 1 fib(3) = fib(2) + fib(1) = fib(1) + fib(0) + fib(1) = 2 ... 1 0 1 0 if n = 4, we calculate fib(2), fib(0) twice, fib(1) thrice, fib(4), fib(3) once TIme Complexity: O(2^n)
  51. 51. Packrat - what is memoization? (Cont.) 51 509. Fibonacci Number if n = 4, we… calculate fib(4), fib(3), fib(2), fib(1), fib(0) once Time Complexity: O(2^n) => O(n) Space Complexity: O(1) => O(n)
  52. 52. Left recursion in Packrat parser 52 Approach 1 if (count of operator) < (count function call): return False Approach 2 reverse the call stack (adopted in CPython!) Source: Guido's Medium (Left-recursive PEG Grammars)
  53. 53. 53 Normal Memoization
  54. 54. 54 Left-recursion Memoization *perform infinite-lookahead
  55. 55. Traditional parser V.S Packrat parser 55
  56. 56. Traditional parser vs Packrat parser 56 Packrat Traditional Scan Left-to-right (*Right-to-left memo) Left-to-right Left Recursion Support (*Not support in first paper) LL needs to rewrite the grammar Ambigious Disallowed (determinism) Allowed Space Complexity O(Code Size) (space consumption) O(Depth of Parse Tree) Worst Time Complexity Super linear time (statelessness) *Because of feature like typedef in C Expotenial time Capability Basically covers all traditional cases (infinite lookahead) No left-recursion/ambigious for LL Has k lookup limitations for both (e.g. dangling else) Red text: 3 highlighted characteristics of Packrat parser.
  57. 57. 57 Parenthesized context managers PEP 622/634/635/636 - Structural Pattern Matching New rule in Python 3.10 based on PEG
  58. 58. CPython’s PEG parser 58
  59. 59. CPython Parser - Before/After CPython3.8 and before use LL(1) parser written by Guido 30 years ago The parser requires steps to generate CST and convert CST to AST. CPython3.9 uses PEG (Packrat) parser (Infinite lookahead) PEG rule supports left-recursion No more CST to AST step - source CPython3.10 drops LL(1) parser support 59 This answers “Why PEG?”
  60. 60. CPython Parser - Workflow 60 Meta Grammar Tools/peg_generator/ pegen/metagrammar.gram Grammar Grammar/python.gram Token Grammar/Tokens my_parser.py my_parser.c pegen (PEG Parser) Tools/peg_generator/ *CPython contains a peg parser generator written in python3.8+ (because of warlus operator)
  61. 61. Input: Meta Grammar Example Syntax Directed Translation (SDT) 61 rule non-Terminal return type PEG rule divider PEG rule action (python code) Parser header (python code)
  62. 62. Output: Generated PEG Parser (Partial code) 62
  63. 63. Recap: Benefit / Performance Benefit Grammar is more flexible: from LL(1) to LL(∞) (infinite lookahead) Hardware supports Packrat’s memory consumption now Skip intermediate parse tree (CST) construction Performance Within 10% of LL(1) parser both in speed and memory consumption (PEP 617) 63
  64. 64. Take away 64
  65. 65. Recap ● Parser 101 (Compiler class in school) ○ CFG ○ Traditional Parser ■ Top-down: LL(1) ■ Bottom-up: LR(0) ● Parser 102 ○ PEG ○ Packrat Parser ● CPython ○ Parser in CPython ○ CPython’s PEG parser 65
  66. 66. 66 Need Answer? note35/Parser-Learning You can implement traditional parser like LL(1) and LR(0) parser, and Packrat parser from scratch! Leetcode: 227. Basic Calculator II Q. How to verify my understanding? A. Get your hands dirty!
  67. 67. Q & A 67
  68. 68. Appendix 68
  69. 69. Related Articles Guido van Rossum PEG Parsing Series Overview Bryan Ford Packrat Parsing: Simple, Powerful, Lazy, Linear Time Parsing Expression Grammars: A Recognition-Based Syntactic Foundation 69
  70. 70. Related Talks Guido van Rossum @ North Bay Python 2019 Writing a PEG parser for fun and profit Pablo Galindo and Lysandros Nikolaou @ Podcast.__init__ The Journey To Replace Python's Parser And What It Means For The Future Emily Morehouse-Valcarcel @ PyCon 2018 The AST and Me Alex Gaynor @ PyCon 2013 So you want to write an interpreter? 70
  71. 71. Thanks for your listening! 71

×