Lexing and parsing

LEXING AND PARSING
THE BEGINNER’S GUIDE

WHY ARE WE DOING THIS?
• bbcode
• html
• xml
• programming language

BUT I CAN JUST REGEX
• sometimes you can
• sometimes you can’t
• is your html well formed? (view source some time)
• it depends!!

COMPUTER SCIENCE
WE LIKE ACRONYMS AND WEIRD WORDS

ENGLISH IS HARD!
• tokenizer
• scanner
• lexer
• parser
• lexical analyzer
• syntactic analyzer
• formal grammar

LEXICAL ANALYSIS
BREAK DOWN INPUT INTO A SEQUENCE OF TOKENS
LEXING

SCANNING
• Finite State Machine
• Finds Lexemes
• Might backtrack

EVALUATOR
• looks at lexeme to get value
• lexeme + value = token

LEXING PHP - $Y = 5;
• $y
• array[309, ‘$y’, 1],
• =
• =
• 5
• array[305, 5, 1]
• 309 == T_VARIABLE
• 305 == T_LNUMBER

LEXER GENERATORS
DO NOT WRITE THIS BY HAND
Famous
• lex
• flex
• re2c
• ANTLR
• DFASTAR
• jflex
• jlex
• quex
PHP generators
• https://github.com/oliverheins/PHPSimpleLexYacc
• lex syntax
• https://github.com/pear/PHP_LexerGenerator
• re2c syntax
• https://github.com/wez/JLexPHP
• jlex syntax
• token_get_all (see php-parser)
• parse_ini_file/string (combined with parser)

SYNTACTIC ANALYSIS
CONSTRUCTING SOMETHING BASED ON A GRAMMAR
PARSING

THE PARSING PROCESS
• Tokens come in
• Magic
• Data structure comes out
• parse tree
• AST

GRAMMAR (FORMAL OF COURSE)
• "Brave men run in my family.”
• I can't recommend this book too highly.
• Prostitutes Appeal to Pope
• I had had my car for four years before I ever learned to drive it.

TYPES OF PARSERS
• Top Down
• Recursive Decent
• LL (left to right, leftmost derivation)
• Earley parser
• Bottom Up
• Precedence parser
• Operator-precedence parser
• Simple precedence parser
• BC (bounded context) parsing
• LR parser (Left-to-right, Rightmost derivation)
• Simple LR (SLR) parser
• LALR parser
• Canonical LR (LR(1)) parser
• GLR parser
• CYK parser
• Recursive ascent parser

SENTENCE DIAGRAMMING
• People who live in glass house shouldn't throw
stones.

TOP DOWN VS. BOTTOM UP PARSING

PARSE TREES
• Constituency-based parse trees
• Dependency-based parse trees

AST
• Not everything appears
• additional information may be applied
• can “improve” tree nodes
• PHP is getting one!

LALR(K)
• Look ahead prevents “ambiguous” parsing
• I have one token, what token comes next?

PARSER GENERATORS
Famous
• bison
• bison
• bison
• bison
• yacc
• lemon
• ANTLR
PHP versions
• https://github.com/wez/lemon-php
• https://github.com/pear/PHP_ParserGenerator
• lemon
• https://github.com/scato/phpeg
• peg (peg.js)
• https://github.com/jakubkulhan/pacc
• yacc

BISON
• Generates LALR (or GLR) parsers
• Code in C, C++ or Java
• reentrant with %define api.pure set
• used by ALL THE THINGS
• PHP
• Ruby
• Postgresql
• Go

LEMON
• Generates LALR(1) parser
• reentrant AND thread safe
• non-terminal destructor (leak avoidance)
• pull parsing
• sqlite

REENTRANT VS THREAD SAFE
• Process
• Thread
• Locking
• Scope
• Reentrant

COMPILE IT
• transform programming language to computer language

INTERPRET IT
• directly executes programming language

UNDER THE HOOD
WHAT USES THIS STUFF?

PHP
RE2C + Bison + these crazy opcodes….

LALR(1) WRITTEN BY HAND
How - pythonic

HHVM
Flex and Bison and JIT – OH MY!

WRITING PARSERS AND LEXERS
THEORIES OF CODING

STEP 1: THINK SMALL
• Writing a general purpose parser is hard – that’s why you use PHP
• Writing a single purpose parser is much easier
• markup text (markdown)
• configuration or definition files (behat/gherkin syntax)
• complex validation (addresses in multiple formats)

STEP 2: SEPARATE AND UNOPTIMIZED
• premature optimization yada yada
• combine after it’s ready to be used (or not at if you’ll need to change it later)
• lexer and parser each have unique, well defined goals
• the ability to potentially switch parser styles later will help you!

STEP 3: LEXER
• the lexer's job is to recognize tokens
• it can do this via a giant switch statement of doom
• or maybe a giant loop
• or maybe a list of goto statements
• or maybe a complex class with methods
• …. or you can just use a generator

LET’S BREAK THAT DOWN
1. Define a token format
2. Define grammar format (what are we looking for?)
3. Go over the input data (usually a string) and make matches
1. compare or regex or ctype_* or however it make sense
4. Keep track of your current state
5. Have an output format – AST, tree, whatever

STEP 4: PARSER
• Loop over our tokens
• Look at the values and decide to what to do

STEP 5: DO SOMETHING WITH IT!
1. Compile – write out to something that can be run (html)
2. Interpret – run through another program to get output (templates to html)
3. Analyze – run through to analyze the data inside (code analysis/sniffer tools)
4. Validate – check for proper “spelling and grammar”
5. ???
6. PROFIT

“If you’re not sure how to do a job – ask!”
- silly poster on my laundry room wall

RESOURCES
• http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html
• http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html
• https://github.com/hafriedlander/php-peg
• https://github.com/nikic/PHP-Parser/
• http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
• http://wikipedia.org

CONTACT ME
• auroraeosrose@gmail.com
• auroraeosrose – freenode.net #phpmentoring #phpwomen
• Twitter - @auroraeosrose
• http://github.com/auroraeosrose

Lexing and parsing

More Related Content

What's hot

Viewers also liked

Similar to Lexing and parsing

More from Elizabeth Smith

Recently uploaded

Lexing and parsing

Editor's Notes