THE PARSING PROCESS
• Tokens come in
• Magic
• Data structure comes out
• parse tree
• AST
GRAMMAR (FORMAL OF COURSE)
• "Brave men run in my family.”
• I can't recommend this book too highly.
• Prostitutes Appeal to Pope
• I had had my car for four years before I ever learned to drive it.
TYPES OF PARSERS
• Top Down
• Recursive Decent
• LL (left to right, leftmost derivation)
• Earley parser
• Bottom Up
• Precedence parser
• Operator-precedence parser
• Simple precedence parser
• BC (bounded context) parsing
• LR parser (Left-to-right, Rightmost derivation)
• Simple LR (SLR) parser
• LALR parser
• Canonical LR (LR(1)) parser
• GLR parser
• CYK parser
• Recursive ascent parser
BISON
• Generates LALR (or GLR) parsers
• Code in C, C++ or Java
• reentrant with %define api.pure set
• used by ALL THE THINGS
• PHP
• Ruby
• Postgresql
• Go
STEP 1: THINK SMALL
• Writing a general purpose parser is hard – that’s why you use PHP
• Writing a single purpose parser is much easier
• markup text (markdown)
• configuration or definition files (behat/gherkin syntax)
• complex validation (addresses in multiple formats)
STEP 2: SEPARATE AND UNOPTIMIZED
• premature optimization yada yada
• combine after it’s ready to be used (or not at if you’ll need to change it later)
• lexer and parser each have unique, well defined goals
• the ability to potentially switch parser styles later will help you!
STEP 3: LEXER
• the lexer's job is to recognize tokens
• it can do this via a giant switch statement of doom
• or maybe a giant loop
• or maybe a list of goto statements
• or maybe a complex class with methods
• …. or you can just use a generator
LET’S BREAK THAT DOWN
1. Define a token format
2. Define grammar format (what are we looking for?)
3. Go over the input data (usually a string) and make matches
1. compare or regex or ctype_* or however it make sense
4. Keep track of your current state
5. Have an output format – AST, tree, whatever
STEP 4: PARSER
• Loop over our tokens
• Look at the values and decide to what to do
STEP 5: DO SOMETHING WITH IT!
1. Compile – write out to something that can be run (html)
2. Interpret – run through another program to get output (templates to html)
3. Analyze – run through to analyze the data inside (code analysis/sniffer tools)
4. Validate – check for proper “spelling and grammar”
5. ???
6. PROFIT
“If you’re not sure how to do a job – ask!”
- silly poster on my laundry room wall
Why I got started with this
I’ve never taken a computer class
I wanted to understand why PHP worked the way it does because I’d been pondering putting some eventing/asyncn magic inside
and I ended up down this deep computer science pit where compilers are at the bottom
Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.
Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
A formal grammar defines (or generates) a formal language, which is a (usually infinite) set of finite-length sequences of symbols (i.e. strings) that may be constructed by applying production rules to another sequence of symbols which initially contains just the start symbol
Type-0 grammars (unrestricted grammars) include all formal grammars.
Type-1 grammars (context-sensitive grammars) generate the context-sensitive languages.
Type-2 grammars (context-free grammars) generate the context-free languages.
Type-3 grammars (regular grammars) generate the regular languages.
So computer science is a really weird discipline
quite a bit of what computer science is and does comes from – well – math
and the other part – the “language” aspects and even concepts of grammar and meaning are from “English” or “language arts” as my kids school calls it
the only “science” Part that I think really applies is that we test theories and apply logic
at it’s core remember computers are algorithms (rules) and information (data) but “computer science” has grown to encompass LOTS of things
What we’re going to talk about is a small but fundamental window – lexing and parsing – so lets start with words
Ask for people seeing these terms
Ask if anyone knows a definition of these terms, even a non-computer science definition
so almost all of these terms have different meanings depending on their context
in computer science definitions are what we’re going to be using
we’re also going to mention that some terms get thrown around a bit (parser and scanner are the two worst)
but I’m also going to attempt to help you build your own internal rules so you don’t confuse yourself and others by always using them in the “computer science dictionary” manner
Scanner == first stage of lexer
Strictly speaking, a lexer is itself a kind of parser but we won’t EVER call it a parser cause CONFUSION
the syntax of some programming languages are divided into two pieces: the lexical syntax (token structure), which is processed by the lexer; and the phrase syntax, which is processed by the parser
The lexical syntax is usually a regular language, whose alphabet consists of the individual characters of the source code text.
The phrase syntax is usually a context-free language, whose alphabet consists of the tokens produced by the lexer.
While this is a common separation, alternatively, a lexer can be combined with the parser in scannerless parsing. I would say though _ DO NOT DO THIS
it may seem easier in the short term but when you have to start changing stuff you will have PAIN
Finite state machine – we have a finite (bounded) list of states and the machine can be in one state at any one time
Because a finite state machine can represent any history and a reaction, by regarding the change of state as a response to the history
it has been argued that it is a sufficient model of human behaviour i.e. humans are finite state machines.
lexeme == characters that have been matched by our state machine
needs to be translated to a value
States – happy, sad, angry
inputs – money, food, kick in pants
outputs – smile, frown, punch back
set up example of state machine for people
sometimes there isn’t’ a value (parentheses in a programming language, for example)
sometimes a lexeme is suppressed (comments anyone?)
sometimes even a lexeme or token is ADDED by the lexer
line continuation (C code)
semi-colon insertion (lazy bad javascript! and go? really!)
off-side rule – blocks with indents (oh python) or braces (php and C and friends)
context sensitivity
good lexers are NOT context-sensitive
the more look ahead, look back, and backtracking
so discuss a little bit about PHP
it’s lexer is exposed with token_get_all
it’ll “parse”/”tokenize” lex is the correct term, the PHP fed to it
this is why there are many parsers written in PHP but not really any lexers, it’s in there
This is GENERALLY the easy part!
what is the 1? – line numbers
ANTLR - Can generate lexical analyzers and parsers.
DFASTAR - Generates DFA matrix table-driven lexers in C++.
Flex - Alternative variant of the classic "lex" (C/C++).
JFlex - A rewrite of JLex.
Ragel - A state machine and lexer generator with output in C, C++, C#, Objective-C, D, Java, Go and Ruby.
The following lexical analysers can handle Unicode:
JavaCC - JavaCC generates lexical analyzers written in Java.
JLex - A lexical analyzer generator for Java.
Quex - A fast universal lexical analyzer generator for C and C++.
SO if you’re generating
rules, named definitions and in-place configurations.
ah, the overloading of the word parsing
syntactic analysis and grammar
looks at the data sent and builds a model – usually some kind of data structure or tree, for what that model looks like
just like in English we take grammar to define ideas
A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process
you can do scannerless (again with the silly overloading of words) – a “non lexed” parser but – sigh
A formal grammar is a set of rules for rewriting strings, along with a "start symbol" from which rewriting starts
Parsing is the process of recognizing an utterance (a string in natural languages) by breaking it down to a set of symbols and analyzing each one against the grammar of the language
why what comes before and after can be important when parsing
your brain is a very good parser
one first looks at the highest level of the parse tree and works don the parse tree by using the rewriting rules of a formal grammar.
top down parsers can be small and powerful and readable, although it can be slower
a top down parser with a direct path is going to beat a more complex path
a bottom up can be faster but you need to match the type of parser with what you’re doing
so let’s take a theoretical piece of code that’s been lexed into these values into a “parse tree” – we’ll get into that in a moment
The opposite of this are top-down parsing methods, in which the input's overall structure is decided (or guessed at) first, before dealing with mid-level parts, leaving the lowest-level small details to last. A top-down parser discovers and processes the hierarchical tree starting from the top, and incrementally works its way downwards and rightwards. Top-down parsing eagerly decides what a construct is much earlier, when it has only scanned the leftmost symbol of that construct and has not yet parsed any of its parts. Left corner parsing is a hybrid method which works bottom-up along the left edges of each subtree, and top-down on the rest of the parse tree.
If a language grammar has multiple rules that may start with the same leftmost symbols but have different endings, then that grammar can be efficiently handled by a deterministic bottom-up parse but cannot be handled top-down without guesswork and backtracking. So bottom-up parsers handle a somewhat larger range of computer language grammars than do deterministic top-down parsers.
Bottom-up parsing is sometimes done by backtracking. But much more commonly, bottom-up parsing is done by a shift-reduce parser such as a LALR parser.
ordered, rooted tree that represents the syntactic structure of a string
their structure and elements more concretely reflect the syntax of the input language
constituency based – parts – noun, verb, adverb
They are simpler on average than constituency-based parse trees because they contain many fewer nodes – so dependency would say noun, verb, adverb
constituency would be
sentence, noun phrase, verb phrase, and breaks it down into smaller pieces
abstract syntax tree
The syntax is "abstract" in not representing every detail appearing in the real syntax.
grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.
a
LALR – look ahead left to right rightmost derivation – the look ahead can be different depending on the parser type – but bison and friends are all LALR(1) generators
bison is re-entrant but NOT thread safe
Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser (either in C, C++, or Java) which reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar
note that bison is re-entrant – it’s not by default thread safe (these are two different things)
Lemon requires to write more rules in comparison with Bison because of simplified syntax: no repetitions and optionals, one action per rule, etc.
Complete set of LALR(1) parser limitations.
Only the C language.
reentrant if it can be interrupted in the middle of its execution and then safely called again ("re-entered") before its previous invocations complete execution
A reentrant subroutine can achieve thread-safety,[1] but being reentrant alone might not be sufficient to be thread-safe in all situations. Conversely, thread-safe code does not necessarily have to be reentrant (see below for examples).
A piece of code is thread-safe if it only manipulates shared data structures in a manner that guarantees safe execution by multiple threads at the same time
compilers generally write out to assembly or machine code
but technically anything can be compiled down to something to be run (plug reckit)
interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program
PHP bison file
PHP bison C output
hand written lexer and lemon parser
A parser is a program which processes an input and "understands" it
a lexer is a program which splits something into tokens and assigns it a value
There are steps you can take to make doing this easier and make you feel less “OMG I’m WRITING A PARSER”
or you can cheat and just use a generator
So when you first get started think of something small
Each of these types of lexer’s are going to have their advantages and disavantages
The trick here is not let the lexer do more than it’s supposed to
it should be context free or you’ll hate yourself later
if you absolutely positively have to lookahead or lookbehind you’ll hate yourself later
put as much information into your token definition as you want