More Related Content


Lexing and parsing

  2. WHY ARE WE DOING THIS? • bbcode • html • xml • programming language
  3. BUT I CAN JUST REGEX • sometimes you can • sometimes you can’t • is your html well formed? (view source some time) • it depends!!
  6. ENGLISH IS HARD! • tokenizer • scanner • lexer • parser • lexical analyzer • syntactic analyzer • formal grammar
  8. SCANNING • Finite State Machine • Finds Lexemes • Might backtrack
  10. EVALUATOR • looks at lexeme to get value • lexeme + value = token
  11. LEXING PHP - $Y = 5; • $y • array[309, ‘$y’, 1], • = • = • 5 • array[305, 5, 1] • 309 == T_VARIABLE • 305 == T_LNUMBER
  12. LEXER GENERATORS DO NOT WRITE THIS BY HAND Famous • lex • flex • re2c • ANTLR • DFASTAR • jflex • jlex • quex PHP generators • • lex syntax • • re2c syntax • • jlex syntax • token_get_all (see php-parser) • parse_ini_file/string (combined with parser)
  13. RE2C
  16. THE PARSING PROCESS • Tokens come in • Magic • Data structure comes out • parse tree • AST
  17. GRAMMAR (FORMAL OF COURSE) • "Brave men run in my family.” • I can't recommend this book too highly. • Prostitutes Appeal to Pope • I had had my car for four years before I ever learned to drive it.
  18. TYPES OF PARSERS • Top Down • Recursive Decent • LL (left to right, leftmost derivation) • Earley parser • Bottom Up • Precedence parser • Operator-precedence parser • Simple precedence parser • BC (bounded context) parsing • LR parser (Left-to-right, Rightmost derivation) • Simple LR (SLR) parser • LALR parser • Canonical LR (LR(1)) parser • GLR parser • CYK parser • Recursive ascent parser
  19. SENTENCE DIAGRAMMING • People who live in glass house shouldn't throw stones.
  22. PARSE TREES • Constituency-based parse trees • Dependency-based parse trees
  23. AST • Not everything appears • additional information may be applied • can “improve” tree nodes • PHP is getting one!
  24. LALR(K) • Look ahead prevents “ambiguous” parsing • I have one token, what token comes next?
  25. PARSER GENERATORS Famous • bison • bison • bison • bison • yacc • lemon • ANTLR PHP versions • • • lemon • • peg (peg.js) • • yacc
  26. BISON • Generates LALR (or GLR) parsers • Code in C, C++ or Java • reentrant with %define api.pure set • used by ALL THE THINGS • PHP • Ruby • Postgresql • Go
  27. BISON IN C
  28. LEMON • Generates LALR(1) parser • reentrant AND thread safe • non-terminal destructor (leak avoidance) • pull parsing • sqlite
  30. REENTRANT VS THREAD SAFE • Process • Thread • Locking • Scope • Reentrant
  31. COMPILE IT • transform programming language to computer language
  32. INTERPRET IT • directly executes programming language
  33. PROFIT
  35. PHP RE2C + Bison + these crazy opcodes….
  36. LALR(1) WRITTEN BY HAND How - pythonic
  37. HHVM Flex and Bison and JIT – OH MY!
  38. SQLITE Lemon is tasty!
  40. STEP 1: THINK SMALL • Writing a general purpose parser is hard – that’s why you use PHP • Writing a single purpose parser is much easier • markup text (markdown) • configuration or definition files (behat/gherkin syntax) • complex validation (addresses in multiple formats)
  41. STEP 2: SEPARATE AND UNOPTIMIZED • premature optimization yada yada • combine after it’s ready to be used (or not at if you’ll need to change it later) • lexer and parser each have unique, well defined goals • the ability to potentially switch parser styles later will help you!
  42. STEP 3: LEXER • the lexer's job is to recognize tokens • it can do this via a giant switch statement of doom • or maybe a giant loop • or maybe a list of goto statements • or maybe a complex class with methods • …. or you can just use a generator
  43. LET’S BREAK THAT DOWN 1. Define a token format 2. Define grammar format (what are we looking for?) 3. Go over the input data (usually a string) and make matches 1. compare or regex or ctype_* or however it make sense 4. Keep track of your current state 5. Have an output format – AST, tree, whatever
  44. STEP 4: PARSER • Loop over our tokens • Look at the values and decide to what to do
  45. STEP 5: DO SOMETHING WITH IT! 1. Compile – write out to something that can be run (html) 2. Interpret – run through another program to get output (templates to html) 3. Analyze – run through to analyze the data inside (code analysis/sniffer tools) 4. Validate – check for proper “spelling and grammar” 5. ??? 6. PROFIT
  46. “If you’re not sure how to do a job – ask!” - silly poster on my laundry room wall
  47. RESOURCES • • • • • •
  48. CONTACT ME • • auroraeosrose – #phpmentoring #phpwomen • Twitter - @auroraeosrose •

Editor's Notes

  1. Why I got started with this I’ve never taken a computer class I wanted to understand why PHP worked the way it does because I’d been pondering putting some eventing/asyncn magic inside and I ended up down this deep computer science pit where compilers are at the bottom
  2. Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers. Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.
  3. Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
  4. A formal grammar defines (or generates) a formal language, which is a (usually infinite) set of finite-length sequences of symbols (i.e. strings) that may be constructed by applying production rules to another sequence of symbols which initially contains just the start symbol Type-0 grammars (unrestricted grammars) include all formal grammars. Type-1 grammars (context-sensitive grammars) generate the context-sensitive languages. Type-2 grammars (context-free grammars) generate the context-free languages. Type-3 grammars (regular grammars) generate the regular languages.
  5. So computer science is a really weird discipline quite a bit of what computer science is and does comes from – well – math and the other part – the “language” aspects and even concepts of grammar and meaning are from “English” or “language arts” as my kids school calls it the only “science” Part that I think really applies is that we test theories and apply logic  at it’s core remember computers are algorithms (rules) and information (data) but “computer science” has grown to encompass LOTS of things What we’re going to talk about is a small but fundamental window – lexing and parsing – so lets start with words
  6. Ask for people seeing these terms Ask if anyone knows a definition of these terms, even a non-computer science definition so almost all of these terms have different meanings depending on their context in computer science definitions are what we’re going to be using we’re also going to mention that some terms get thrown around a bit (parser and scanner are the two worst) but I’m also going to attempt to help you build your own internal rules so you don’t confuse yourself and others by always using them in the “computer science dictionary” manner
  7. Scanner == first stage of lexer Strictly speaking, a lexer is itself a kind of parser but we won’t EVER call it a parser cause CONFUSION the syntax of some programming languages are divided into two pieces: the lexical syntax (token structure), which is processed by the lexer; and the phrase syntax, which is processed by the parser The lexical syntax is usually a regular language, whose alphabet consists of the individual characters of the source code text. The phrase syntax is usually a context-free language, whose alphabet consists of the tokens produced by the lexer. While this is a common separation, alternatively, a lexer can be combined with the parser in scannerless parsing. I would say though _ DO NOT DO THIS it may seem easier in the short term but when you have to start changing stuff you will have PAIN
  8. Finite state machine – we have a finite (bounded) list of states and the machine can be in one state at any one time Because a finite state machine can represent any history and a reaction, by regarding the change of state as a response to the history it has been argued that it is a sufficient model of human behaviour  i.e. humans are finite state machines. lexeme == characters that have been matched by our state machine needs to be translated to a value
  9. States – happy, sad, angry inputs – money, food, kick in pants outputs – smile, frown, punch back set up example of state machine for people
  10. sometimes there isn’t’ a value (parentheses in a programming language, for example) sometimes a lexeme is suppressed (comments anyone?) sometimes even a lexeme or token is ADDED by the lexer line continuation (C code) semi-colon insertion (lazy bad javascript! and go? really!) off-side rule – blocks with indents (oh python) or braces (php and C and friends) context sensitivity good lexers are NOT context-sensitive the more look ahead, look back, and backtracking
  11. so discuss a little bit about PHP it’s lexer is exposed with token_get_all it’ll “parse”/”tokenize” lex is the correct term, the PHP fed to it this is why there are many parsers written in PHP but not really any lexers, it’s in there  This is GENERALLY the easy part! what is the 1? – line numbers
  12. ANTLR - Can generate lexical analyzers and parsers. DFASTAR - Generates DFA matrix table-driven lexers in C++. Flex - Alternative variant of the classic "lex" (C/C++). JFlex - A rewrite of JLex. Ragel - A state machine and lexer generator with output in C, C++, C#, Objective-C, D, Java, Go and Ruby. The following lexical analysers can handle Unicode: JavaCC - JavaCC generates lexical analyzers written in Java. JLex - A lexical analyzer generator for Java. Quex - A fast universal lexical analyzer generator for C and C++. SO if you’re generating
  13. rules, named definitions and in-place configurations.
  14. ah, the overloading of the word parsing syntactic analysis and grammar looks at the data sent and builds a model – usually some kind of data structure or tree, for what that model looks like just like in English we take grammar to define ideas
  15. A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process you can do scannerless (again with the silly overloading of words) – a “non lexed” parser but – sigh
  16. A formal grammar is a set of rules for rewriting strings, along with a "start symbol" from which rewriting starts Parsing is the process of recognizing an utterance (a string in natural languages) by breaking it down to a set of symbols and analyzing each one against the grammar of the language why what comes before and after can be important when parsing your brain is a very good parser
  17. one first looks at the highest level of the parse tree and works don the parse tree by using the rewriting rules of a formal grammar. top down parsers can be small and powerful and readable, although it can be slower a top down parser with a direct path is going to beat a more complex path a bottom up can be faster but you need to match the type of parser with what you’re doing
  18. so let’s take a theoretical piece of code that’s been lexed into these values into a “parse tree” – we’ll get into that in a moment
  19. The opposite of this are top-down parsing methods, in which the input's overall structure is decided (or guessed at) first, before dealing with mid-level parts, leaving the lowest-level small details to last. A top-down parser discovers and processes the hierarchical tree starting from the top, and incrementally works its way downwards and rightwards. Top-down parsing eagerly decides what a construct is much earlier, when it has only scanned the leftmost symbol of that construct and has not yet parsed any of its parts. Left corner parsing is a hybrid method which works bottom-up along the left edges of each subtree, and top-down on the rest of the parse tree. If a language grammar has multiple rules that may start with the same leftmost symbols but have different endings, then that grammar can be efficiently handled by a deterministic bottom-up parse but cannot be handled top-down without guesswork and backtracking. So bottom-up parsers handle a somewhat larger range of computer language grammars than do deterministic top-down parsers. Bottom-up parsing is sometimes done by backtracking. But much more commonly, bottom-up parsing is done by a shift-reduce parser such as a LALR parser.
  20. ordered, rooted tree that represents the syntactic structure of a string their structure and elements more concretely reflect the syntax of the input language constituency based – parts – noun, verb, adverb They are simpler on average than constituency-based parse trees because they contain many fewer nodes – so dependency would say noun, verb, adverb constituency would be sentence, noun phrase, verb phrase, and breaks it down into smaller pieces
  21. abstract syntax tree The syntax is "abstract" in not representing every detail appearing in the real syntax. grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. a
  22. LALR – look ahead left to right rightmost derivation – the look ahead can be different depending on the parser type – but bison and friends are all LALR(1) generators
  23. bison is re-entrant but NOT thread safe
  24. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser (either in C, C++, or Java) which reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar note that bison is re-entrant – it’s not by default thread safe (these are two different things)
  25. Lemon requires to write more rules in comparison with Bison because of simplified syntax: no repetitions and optionals, one action per rule, etc. Complete set of LALR(1) parser limitations. Only the C language.
  26. reentrant if it can be interrupted in the middle of its execution and then safely called again ("re-entered") before its previous invocations complete execution A reentrant subroutine can achieve thread-safety,[1] but being reentrant alone might not be sufficient to be thread-safe in all situations. Conversely, thread-safe code does not necessarily have to be reentrant (see below for examples). A piece of code is thread-safe if it only manipulates shared data structures in a manner that guarantees safe execution by multiple threads at the same time
  27. compilers generally write out to assembly or machine code but technically anything can be compiled down to something to be run (plug reckit)
  28. interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program
  29. PHP bison file PHP bison C output
  30. hand written lexer and lemon parser
  31. A parser is a program which processes an input and "understands" it a lexer is a program which splits something into tokens and assigns it a value There are steps you can take to make doing this easier and make you feel less “OMG I’m WRITING A PARSER” or you can cheat and just use a generator
  32. So when you first get started think of something small
  33. Each of these types of lexer’s are going to have their advantages and disavantages The trick here is not let the lexer do more than it’s supposed to it should be context free or you’ll hate yourself later if you absolutely positively have to lookahead or lookbehind you’ll hate yourself later put as much information into your token definition as you want