Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lexing and parsing

3,367 views

Published on

Beginners guide of Lexing and Parsing for PHP developers - given at Zendcon 2014

Published in: Software
  • Be the first to comment

Lexing and parsing

  1. 1. LEXING AND PARSING THE BEGINNER’S GUIDE
  2. 2. WHY ARE WE DOING THIS? • bbcode • html • xml • programming language
  3. 3. BUT I CAN JUST REGEX • sometimes you can • sometimes you can’t • is your html well formed? (view source some time) • it depends!!
  4. 4. CHOMSKY HIERARCHY
  5. 5. COMPUTER SCIENCE WE LIKE ACRONYMS AND WEIRD WORDS
  6. 6. ENGLISH IS HARD! • tokenizer • scanner • lexer • parser • lexical analyzer • syntactic analyzer • formal grammar
  7. 7. LEXICAL ANALYSIS BREAK DOWN INPUT INTO A SEQUENCE OF TOKENS LEXING
  8. 8. SCANNING • Finite State Machine • Finds Lexemes • Might backtrack
  9. 9. FINITE STATE MACHINE
  10. 10. EVALUATOR • looks at lexeme to get value • lexeme + value = token
  11. 11. LEXING PHP - $Y = 5; • $y • array[309, ‘$y’, 1], • = • = • 5 • array[305, 5, 1] • 309 == T_VARIABLE • 305 == T_LNUMBER
  12. 12. LEXER GENERATORS DO NOT WRITE THIS BY HAND Famous • lex • flex • re2c • ANTLR • DFASTAR • jflex • jlex • quex PHP generators • https://github.com/oliverheins/PHPSimpleLexYacc • lex syntax • https://github.com/pear/PHP_LexerGenerator • re2c syntax • https://github.com/wez/JLexPHP • jlex syntax • token_get_all (see php-parser) • parse_ini_file/string (combined with parser)
  13. 13. RE2C
  14. 14. IN PHP LAND
  15. 15. SYNTACTIC ANALYSIS CONSTRUCTING SOMETHING BASED ON A GRAMMAR PARSING
  16. 16. THE PARSING PROCESS • Tokens come in • Magic • Data structure comes out • parse tree • AST
  17. 17. GRAMMAR (FORMAL OF COURSE) • "Brave men run in my family.” • I can't recommend this book too highly. • Prostitutes Appeal to Pope • I had had my car for four years before I ever learned to drive it.
  18. 18. TYPES OF PARSERS • Top Down • Recursive Decent • LL (left to right, leftmost derivation) • Earley parser • Bottom Up • Precedence parser • Operator-precedence parser • Simple precedence parser • BC (bounded context) parsing • LR parser (Left-to-right, Rightmost derivation) • Simple LR (SLR) parser • LALR parser • Canonical LR (LR(1)) parser • GLR parser • CYK parser • Recursive ascent parser
  19. 19. SENTENCE DIAGRAMMING • People who live in glass house shouldn't throw stones.
  20. 20. PARSE TREE
  21. 21. TOP DOWN VS. BOTTOM UP PARSING
  22. 22. PARSE TREES • Constituency-based parse trees • Dependency-based parse trees
  23. 23. AST • Not everything appears • additional information may be applied • can “improve” tree nodes • PHP is getting one!
  24. 24. LALR(K) • Look ahead prevents “ambiguous” parsing • I have one token, what token comes next?
  25. 25. PARSER GENERATORS Famous • bison • bison • bison • bison • yacc • lemon • ANTLR PHP versions • https://github.com/wez/lemon-php • https://github.com/pear/PHP_ParserGenerator • lemon • https://github.com/scato/phpeg • peg (peg.js) • https://github.com/jakubkulhan/pacc • yacc
  26. 26. BISON • Generates LALR (or GLR) parsers • Code in C, C++ or Java • reentrant with %define api.pure set • used by ALL THE THINGS • PHP • Ruby • Postgresql • Go
  27. 27. BISON IN C
  28. 28. LEMON • Generates LALR(1) parser • reentrant AND thread safe • non-terminal destructor (leak avoidance) • pull parsing • sqlite
  29. 29. PHP LEMON
  30. 30. REENTRANT VS THREAD SAFE • Process • Thread • Locking • Scope • Reentrant
  31. 31. COMPILE IT • transform programming language to computer language
  32. 32. INTERPRET IT • directly executes programming language
  33. 33. PROFIT
  34. 34. UNDER THE HOOD WHAT USES THIS STUFF?
  35. 35. PHP RE2C + Bison + these crazy opcodes….
  36. 36. LALR(1) WRITTEN BY HAND How - pythonic
  37. 37. HHVM Flex and Bison and JIT – OH MY!
  38. 38. SQLITE Lemon is tasty!
  39. 39. WRITING PARSERS AND LEXERS THEORIES OF CODING
  40. 40. STEP 1: THINK SMALL • Writing a general purpose parser is hard – that’s why you use PHP • Writing a single purpose parser is much easier • markup text (markdown) • configuration or definition files (behat/gherkin syntax) • complex validation (addresses in multiple formats)
  41. 41. STEP 2: SEPARATE AND UNOPTIMIZED • premature optimization yada yada • combine after it’s ready to be used (or not at if you’ll need to change it later) • lexer and parser each have unique, well defined goals • the ability to potentially switch parser styles later will help you!
  42. 42. STEP 3: LEXER • the lexer's job is to recognize tokens • it can do this via a giant switch statement of doom • or maybe a giant loop • or maybe a list of goto statements • or maybe a complex class with methods • …. or you can just use a generator
  43. 43. LET’S BREAK THAT DOWN 1. Define a token format 2. Define grammar format (what are we looking for?) 3. Go over the input data (usually a string) and make matches 1. compare or regex or ctype_* or however it make sense 4. Keep track of your current state 5. Have an output format – AST, tree, whatever
  44. 44. STEP 4: PARSER • Loop over our tokens • Look at the values and decide to what to do
  45. 45. STEP 5: DO SOMETHING WITH IT! 1. Compile – write out to something that can be run (html) 2. Interpret – run through another program to get output (templates to html) 3. Analyze – run through to analyze the data inside (code analysis/sniffer tools) 4. Validate – check for proper “spelling and grammar” 5. ??? 6. PROFIT
  46. 46. “If you’re not sure how to do a job – ask!” - silly poster on my laundry room wall
  47. 47. RESOURCES • http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html • http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html • https://github.com/hafriedlander/php-peg • https://github.com/nikic/PHP-Parser/ • http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html • http://wikipedia.org
  48. 48. CONTACT ME • auroraeosrose@gmail.com • auroraeosrose – freenode.net #phpmentoring #phpwomen • Twitter - @auroraeosrose • http://github.com/auroraeosrose

×