How to Parse a File (DDD North 2017)

Matt Ellis

@citizenmattHow to parse a ﬁle

@citizenmatt
Why would we write a parser?
• Speed, efficiency
• Reduce dependencies
• Custom or simple formats
• Things that aren’t files - DSLs 
Command line options, HTTP headers, stdout, natural language commands 
E.g. YouTrack queries
• When we’re just as interested in the structure of a file 
as its contents

Matt Ellis
Developer advocate 
JetBrains 
@citizenmatt

@citizenmatt
PSI
Features
Project Model
Base Platform
JetBrains IDE
architecture (kinda)

@citizenmatt
Unity and ShaderLab

@citizenmatt
What are we trying to build?

@citizenmatt
How to parse a ﬁle for an IDE

@citizenmatt
Hand rolled parser
var c = ReadChar(); 
switch (c) { 
case 's': 
c = ReadChar(); 
switch (c) { 
case 'h': 
// Parse rest of "Shader", then sub-elements, … 
// Create syntax tree node(s) … 
break;
default: 
SyntaxError(); 
break; 
} 
break;
case 'p': 
// Parse rest of "Properties", then sub-elements, … 
// Create syntax tree node(s) … 
break; 
}

@citizenmatt
Back endFront end
Compiler pipeline
Lexical
analysis
Syntactic
analysis
Semantic
analysis
Code
optimisation
Code
generation

@citizenmatt
IDE pipeline
Lexical
analysis
Syntactic
analysis
Semantic
analysis

@citizenmatt
IDE pipeline
Parser
Program
structureLexer

@citizenmatt
What is a lexer (aka scanner)?
• Performs lexical analysis 
Lexical - relating to the words or vocabulary of a language
• Converts a string into a stream of tokens 
Identiﬁer, comment, string literal, braces, parentheses, whitespace, etc.
• Tokens are lightweight - typically integer values 
(ReSharper uses singleton object instances)
• Parser pattern matches over tokens 
Integer or object reference comparisons

@citizenmatt
Lexer output
// Colored vertex lighting
Shader "MyShader"
{
// a single color property
Properties {
_Color ("Main Color", Color) = (1, .5,.5,1)
}
// define one subshader
SubShader
{
// a single pass in our subshader
Pass
{
Material
{
Diffuse [_Color]
}
Lighting On
}
}
}
0000: END_OF_LINE_COMMENT '// Colored vertex lighting'
0026: NEW_LINE 'rn'
0028: SHADER_KEYWORD 'Shader'
0034: WHITESPACE ' '
0035: STRING_LITERAL '"MyShader"'
0045: NEW_LINE 'rn'
0047: LBRACE '{'
0048: NEW_LINE 'rn'
0052: END_OF_LINE_COMMENT '// a single color property'
0078: NEW_LINE 'rn'
0082: PROPERTIES_KEYWORD 'Properties'
0093: LBRACE '{'
0094: NEW_LINE 'rn'
0100: IDENTIFIER '_Color'
0107: LPAREN '('
0108: STRING_LITERAL '"Main Color"'
0120: COMMA ','
0122: COLOR_KEYWORD 'Color'
0127: RPAREN ')'
0129: EQUALS '='
0131: LPAREN '('
…

@citizenmatt
Lexers are a solved problem
Use a lexer generator 
lex (1975), ﬂex, CsLex, FsLex, JFLex, etc.

@citizenmatt
Anatomy of a lexer input ﬁle
User code (e.g. using directives)
%%
directives 
set up namespaces, class names, interfaces 
declare regex macros 
declare states
%%
rules and actions 
<state> rule { action }

@citizenmatt
ShaderLab lexer
Demo

@citizenmatt
How does it work?
• Lexer generates source code
• Rules (regexes) converted into single Finite State Machine 
All regexes combined, matched at same time
• Encoded in state transition tables
• Lookup based on state and input char
• Very fast
• Not very maintainable 
Seriously

@citizenmatt
a(b|c)d*e+
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html

@citizenmatt
Rule: a(b|c)d*e+
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other
0 m(1) E E E E E
1 E m(2) m(2) E E E
2 E E E m(2) m(3) E
3 a a a a m(3) a
m(x) - match,
move to state x
a - accept
E - error
Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html

Rules: a(b|c)d*e+ and [0-9]+
[0-9]
4
[0-9]
‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other
0 m(1) E E E E m(4) E
1 E m(2) m(2) E E E E
2 E E E m(2) m(3) E E
3 a a a a m(3) a a
4 a a a a a m(4) a

@citizenmatt
What is a parser?
• Performs syntactic analysis 
Verifies and matches syntax of a file
• Pattern matching on stream of tokens from lexer 
Can look at token offsets and text, too
• Syntax is described by a grammar
• Grammar is represented as a recursive hierarchy of rules 
Top level is the whole file, composing down to structures and tokens

@citizenmatt
Example grammar
shaderFile: 
SHADER_KEYWORD 
STRING_LITERAL 
LBRACE 
propertiesBlock? 
tagsBlock? 
… 
RBRACE 
;
propertiesBlock: 
PROPERTIES_KEYWORD 
LBRACE 
property* 
RBRACE 
;
tagsBlock: 
TAGS_KEYWORD 
LBRACE 
tag* 
RBRACE 
;
Shader "MyShader" 
{ 
Properties { … } 
Tags { … } 
… 
}

@citizenmatt
Parsing is NOT a solved problem
Well, it is, kinda. There are just lots of solutions

@citizenmatt
Types of parsers
• Top down/recursive descent 
Match the root of the tree, recursively split up into child elements
• Bottom up/recursive ascent 
Start with matching the leaves of the tree, combine into larger
constructs as you go

@citizenmatt
Top down parser
parseShaderLabFile() 
parseShaderCommand() 
match(SHADER_KEYWORD) 
parseShaderValue() 
parseShaderValueName() 
match(STRING_LITERAL) 
match(LBRACE) 
if (tokenType == PROPERTIES_KEYWORD) 
parsePropertiesCommand() 
… 
match(RBRACE)

@citizenmatt
Bottom up parser
Shift/Reduce algorithm
Match token 
Shift token onto stack (e.g. INTEGER, OP_PLUS, INTEGER) 
Reduce larger construct (e.g. INTEGER + INTEGER becomes EXPRESSION)

@citizenmatt
Building a parser
• Hand rolled 
Mechanical process to build. Easy to understand 
Usually top down/recursive descent 
Can use grammar to build syntax tree classes
• Parser generators 
yacc/bison, ANTLR, etc. 
Usually bottom up. Can be hard to debug - table driven
• ReSharper mostly uses top-down procedural parsers 
Generated and hand rolled 
Mainly historical. Easier to maintain, easier error recovery, etc.

@citizenmatt
Parser combinators
• Build a parser by combining other, simpler parsers
• Monads! 
Think linq - similar idea, similar ease of use, similar cost

@citizenmatt
FParsec for F#
// pstring - parse a string 
// pfloat - parse a float 
// spaces1 - parse one or more whitespace chars 
 
let pforward = (pstring "fd" <|> pstring “forward”) >>. spaces1 >>. pfloat
|>> fun n -> Forward(int n)
let pleft = (pstring "left" <|> pstring "lt") >>. spaces1 >>. pfloat
|>> fun x -> Left(int -x)
let pright = (pstring "right" <|> pstring "right") >>. spaces1 >>. pfloat
|>> fun x -> Right(int x)
let pcommand = pforward <|> pleft <|> pright
Phil Trelford - http://trelford.com/blog/post/FParsec.aspx

@citizenmatt
Sprache for C#
Parser<string> identifier =
from leading in Parse.WhiteSpace.Many()
from first in Parse.Letter.Once()
from rest in Parse.LetterOrDigit.Many()
from trailing in Parse.WhiteSpace.Many()
select new string(first.Concat(rest).ToArray());
var id = identifier.Parse(" abc123 ");
Assert.AreEqual("abc123", id);

@citizenmatt
Problem #1
Whitespace and comments

@citizenmatt
We’d expect this to work:
shaderBlock: 
SHADER_KEYWORD 
STRING_LITERAL 
LBRACE 
… 
RBRACE 
;
Shader "MyShader" 
{ 
… 
}

@citizenmatt
But this is the actual input…
Shaderrn 
··········"MyShader"rn 
·······n 
/* Cool shader! */n 
{···…········}rn

@citizenmatt
Which lexes as…
SHADER_KEYWORD 
NEW_LINE 
WHITESPACE 
STRING_LITERAL 
NEW_LINE 
WHITESPACE 
NEW_LINE 
COMMENT 
NEW_LINE 
LBRACE 
WHITESPACE 
… 
WHITESPACE 
RBRACE
Shaderrn 
··········"MyShader"rn 
·······n 
/* Cool shader! */n 
{···…········}rn

@citizenmatt
Which doesn’t match the grammar
shaderBlock: 
SHADER_KEYWORD 
STRING_LITERAL 
LBRACE 
… 
RBRACE 
;
SHADER_KEYWORD 
NEW_LINE 
WHITESPACE 
STRING_LITERAL 
NEW_LINE 
WHITESPACE 
NEW_LINE 
COMMENT 
NEW_LINE 
LBRACE 
WHITESPACE 
… 
WHITESPACE 
RBRACE

@citizenmatt
• Filter whitespace and comments from the stream of tokens 
ReSharper’s tokens have IsFiltered property
• Decorator pattern 
Wrap original lexer, swallow ﬁltered tokens
Filtering lexers
Filtering
lexer
Lexer
Parser
Program
structure

@citizenmatt
What are we building?
Is it safe to lose the whitespace?

@citizenmatt
IDE requirements, Part 1
• Code editor features 
Syntax highlighting, code folding, etc.
• Syntax error highlighting
• Inspections
• Refactoring
• Formatting
• Etc.

@citizenmatt
IDE requirements, Part 1
• Need to work with the contents and structure of a file
• Contents give us semantic information
• Structure allows us to report inspections, refactor, etc. 
Map the semantics back to the file
• Need to represent the structure of the file
• Syntax tree is obvious choice 
Inspections walk the tree, refactorings rewrite the tree

@citizenmatt
Abstract Syntax Trees
1
+
2 3
+ 1
+
5
6
= =

@citizenmatt
Concrete Parse Trees
2 WS
+
WS 3
// …
+
NL1
WS

@citizenmatt
Side problem #1
No guidance for designing parse trees!

@citizenmatt
Back to Filtering Lexers
• If we ﬁlter tokens out, we have to add them back again
• We need a Missing Tokens Inserter to add whitespace
and comments back into parse tree
Filtering
lexer
Lexer
Parser
Concrete
parse tree
Missing
tokens
inserter

@citizenmatt
Missing Tokens Inserter
• Walk leaf elements of tree 
Tokens
• Advances (cached) lexer for each leaf element
• Check current lexer token has same offset as leaf
element
• If not, create leaf element and insert into tree

@citizenmatt
Problem #2
What about signiﬁcant whitespace?

@citizenmatt
How do we parse this?
There are no end of scope markers! 
And we’ve ﬁltered out the whitespace!
let ArraySample() =
let numLetters = 26
let results = Array.create numLetters 0
let data = "The quick brown fox"
for i = 0 to data.Length - 1 do
let c = data.Chars(i)
let c = Char.ToUpper(c)
if c >= 'A' && c <= 'Z' then
let i = Char.code c - Char.code 'A'
results.[i] <- results.[i] + 1
printf "done!n"

@citizenmatt
Insert zero-width tokens
• Another lexer decorator
• Keeps track of whitespace before it’s ﬁltered
• Inserts “invisible” tokens into token stream 
indicating indent/outdent or block start/end 
Possibly also token to indicate invalid indentation
• Token is zero-width. Doesn’t affect parse tree
• Parser can match these invisible tokens in grammar

@citizenmatt
Lexer ﬂexibility
It’s just nice to say

@citizenmatt
Altering tokens
• F# example: 2. and [2..0] ambiguous
• Original lexer matches 2. as FLOAT  
and 2.. as INT_DOT_DOT
• Another lexer decorator 
Augment generated rules with custom code
• Decorator recognises INT_DOT_DOT  
Splits into two tokens for parser

@citizenmatt
When regexes aren’t enough
• ShaderLab nested comments
• Not possible to match with regex 
Don’t even try
• Rule to match start of comment - /* 
Finish lexing by hand, counting start and end comment chars 
Ignore START_COMMENT and return different token - COMMENT
• It doesn’t have to be completely machine generated
/* This /* is */ valid */

@citizenmatt
Problem #3
Pre-processor tokens

@citizenmatt
Pre-processor tokens
• Pre-processor tokens can
appear anywhere
• How do you add them to
the grammar/parser?
• ShaderLab has CGPROGRAM
and CGINCLUDE which are
essentially pre-processor
tokens
• (Also nested language - Cg)

@citizenmatt
Parsing pre-processor tokens
• Two pass parsing
• First pass parses pre-processor tokens
• Filtering lexer strips pre-processor tokens
• Parse normally
• Parsed pre-processor tree nodes inserted as missing
tokens

Parsing pre-processor tokens
Including 
pre-processor
tokens
Filtering
lexer
Lexer
Parser
Concrete
parse
tree
Missing
tokens
inserter
Pre-processor
parser
Filtering
lexer

@citizenmatt
Problem #4
IDEs impose constraints

@citizenmatt
IDE Requirements, Part 2
• Error highlighting 
The code is broken every time you type
• Incremental lexing + parsing 
Performance
• Version tolerance 
E.g. multiple versions of C#
• Nested/composable languages

@citizenmatt
Problem #5
Error handling

@citizenmatt
Error handling is more of an art than a science

@citizenmatt
What happens when there’s an error?
• The parser adds an error element into the tree
• Error element spans whatever has been parsed so far 
Might just be unexpected token, or incorrect element construct
• Highlighting the error in the editor is trivial 
Inspection simply looks for error element, adds highlight

@citizenmatt
How do we ﬁnd an error?
• Error start is obvious 
mismatched rule, unexpected token
• Where does the error stop? 
Off by one token could affect rest of ﬁle
• IDE must try to recover 
How?

@citizenmatt
Error recovery
• Panic mode 
Eat tokens until ﬁnds a “follows” token
• Token insertion/removal/substitution
• Error rules in grammar

@citizenmatt
Shader "MyShader" { 
Properties { 
_RealProperty1("Real1", Color) = (1,1,1,1) 
_PropName SyntaxErrorPanicMode = (1,1,1,1) 
_Recovered("Real2", Color) = (1, 1, 1, 1) 
} 
}
Panic mode
Properties { 
_RealProperty1("Real1", Color) = (1,1,1,1) 
_PropName SyntaxError _AttemptedRecovery = (1,1,1,1) 
_Recovered("Real2", Color) = (1, 1, 1, 1) 
} 
}

@citizenmatt
• Expected RPAREN got EQUALS 
Assume RPAREN missing (insert it), EQUALS matches, continue
Token insertion
Properties { 
_RealProperty1("Real1", Color = (1,1,1,1) 
} 
}

@citizenmatt
• Token insertion fails 
Inserting EQUALS doesn’t sync back up
• Expected EQUALS got extra RPAREN 
Skip RPAREN (remove it), EQUALS matches, continue
Properties { 
_RealProperty1("Real1", Color)) = (1,1,1,1) 
} 
}
Token removal

@citizenmatt
Error production rules
• Create a rule that anticipates an error
• E.g. consume any tokens that shouldn’t be there
emptyBlock: 
LBRACE 
errorElementWithoutRBrace*
RBRACE 
;

@citizenmatt
Problem #6
Incremental lexing and parsing

@citizenmatt
What’s the problem?
• Don’t parse entire ﬁle on every change
• Only reparse smallest subtree that encloses change 
Block nodes (method bodies, classes, etc. Not if, for, etc.)
• Avoid re-lexing the entire ﬁle, too

@citizenmatt
Incremental lexing
• Requires a cache of the original token stream 
Token type, offsets and state of lexer (int)
• Copy cached tokens up to change position
• Restart lexer at change position with known state from
cache
• Lex until we can match tail of cached tokens

@citizenmatt
Incremental parsing
• Walk up syntax tree, ﬁnd nearest element that can
reparse and that encompasses change 
E.g. method/class body
• Find start of block 
E.g. opening LBRACE ‘{‘
• Use updated cached lexer to ﬁnd end of block 
E.g. closing RBRACE ‘}’
• Parse block, add new element into tree 
Uses custom entry point into parser

@citizenmatt
Problem #7
Composable languages

@citizenmatt
Three types
• Injected languages 
E.g. self-contained islands in a string literal (regex)
• Inherited languages 
E.g. TypeScript is a superset of JavaScript
• Nested languages 
E.g. JavaScript/CSS nested inside HTML. Razor and C#

@citizenmatt
Injected languages
• Build a parse tree for the contents of another node 
E.g. ShaderLab CG_PROGRAM, regular expressions, …
• Provides syntax highlighting, code completion, etc.
• Attaches a new parse tree to the node of another tree
• Changes to injected tree persisted to string and pushed
as change to the owning tree
• Changes to owning tree cause full reparse of injected
language

@citizenmatt
Inherited languages
• E.g. TypeScript is a superset of JavaScript
• TypeScriptParser derives from JavaScriptParser 
Share a lexer
• Custom hand rolled parsers 
Recursive descent
• Easier to inherit and override key methods 
Gang of Four Template pattern
• Also XamlParser, MSBuildParser, WebConﬁgParser 
Custom XML parsers

@citizenmatt
Nested languages
• E.g. .aspx, .cshtml - HTML superset, with C# “islands”
• ReSharper parses .aspx/.cshtml ﬁle 
Builds parse tree for ASPX/Razor syntax
• HTML superset requires lexer superset
• HtmlCompoundLexer lexes “outer” language’s tokens 
When encounters HTML, switches to standard HTML lexer
• How to handle C# islands?

@citizenmatt
Secondary documents
• ASPX/Razor - C# islands
• Create secondary in-memory C# file 
Mirrors what gets generated when .aspx file is compiled
• Maps C# islands in .aspx to in-memory C# file
• Inspections, code completion, etc. work through the
mapping

@citizenmatt
How do you parse a ﬁle?

@citizenmatt
Links
https://github.com/JetBrains/resharper-unity
Generating Fast, Error Recovering Parsers 
http://www.dtic.mil/dtic/tr/fulltext/u2/a196581.pdf
Effective and Comfortable Error Recovery in Recursive Descent Parsers 
http://www.cocolab.com/products/cocktail/doc.pdf/ell.pdf
The Deﬁnitive ANTLR4 Reference - Terrence Parr

How to Parse a File (DDD North 2017)

Recommended

Recommended

More Related Content

Similar to How to Parse a File (DDD North 2017)

Similar to How to Parse a File (DDD North 2017) (20)

More from citizenmatt

More from citizenmatt (9)

Recently uploaded

Recently uploaded (20)

How to Parse a File (DDD North 2017)