Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Parse a File (DDD North 2017)

352 views

Published on

Yes, we're going to look at file parsing. Sounds a bit boring, right? Wrong.

In this talk, just for fun, we'll find out how to parse a file. We'll look at simple, hand crafted parsers. We'll finally figure out just how lex and yacc work. And we'll pick apart structured parsers that build abstract syntax trees as you type - ReSharper style. How is an IDEs parser different to a compilers? How do you handle sensible error recovery? What about significant whitespace?

Everything you always wanted to know about parsing a file, but were too afraid to ask.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How to Parse a File (DDD North 2017)

  1. 1. Matt Ellis @citizenmattHow to parse a file
  2. 2. DON’T
  3. 3. @citizenmatt Why would we write a parser? • Speed, efficiency • Reduce dependencies • Custom or simple formats • Things that aren’t files - DSLs
 Command line options, HTTP headers, stdout, natural language commands
 E.g. YouTrack queries • When we’re just as interested in the structure of a file
 as its contents
  4. 4. Matt Ellis Developer advocate
 JetBrains
 @citizenmatt
  5. 5. @citizenmatt
  6. 6. @citizenmatt PSI Features Project Model Base Platform JetBrains IDE architecture (kinda)
  7. 7. @citizenmatt Unity and ShaderLab
  8. 8. @citizenmatt What are we trying to build?
  9. 9. @citizenmatt How to parse a file for an IDE
  10. 10. @citizenmatt Hand rolled parser var c = ReadChar();
 switch (c) {
 case 's':
 c = ReadChar();
 switch (c) {
 case 'h':
 // Parse rest of "Shader", then sub-elements, …
 // Create syntax tree node(s) …
 break; default:
 SyntaxError();
 break;
 }
 break; case 'p':
 // Parse rest of "Properties", then sub-elements, …
 // Create syntax tree node(s) …
 break;
 }
  11. 11. @citizenmatt Back endFront end Compiler pipeline Lexical analysis Syntactic analysis Semantic analysis Code optimisation Code generation
  12. 12. @citizenmatt IDE pipeline Lexical analysis Syntactic analysis Semantic analysis
  13. 13. @citizenmatt IDE pipeline Parser Program structureLexer
  14. 14. @citizenmatt Lexers
  15. 15. @citizenmatt What is a lexer (aka scanner)? • Performs lexical analysis
 Lexical - relating to the words or vocabulary of a language • Converts a string into a stream of tokens
 Identifier, comment, string literal, braces, parentheses, whitespace, etc. • Tokens are lightweight - typically integer values
 (ReSharper uses singleton object instances) • Parser pattern matches over tokens
 Integer or object reference comparisons
  16. 16. @citizenmatt Lexer output // Colored vertex lighting Shader "MyShader" { // a single color property Properties { _Color ("Main Color", Color) = (1, .5,.5,1) } // define one subshader SubShader { // a single pass in our subshader Pass { Material { Diffuse [_Color] } Lighting On } } } 0000: END_OF_LINE_COMMENT '// Colored vertex lighting' 0026: NEW_LINE 'rn' 0028: SHADER_KEYWORD 'Shader' 0034: WHITESPACE ' ' 0035: STRING_LITERAL '"MyShader"' 0045: NEW_LINE 'rn' 0047: LBRACE '{' 0048: NEW_LINE 'rn' 0050: WHITESPACE ' ' 0052: END_OF_LINE_COMMENT '// a single color property' 0078: NEW_LINE 'rn' 0080: WHITESPACE ' ' 0082: PROPERTIES_KEYWORD 'Properties' 0092: WHITESPACE ' ' 0093: LBRACE '{' 0094: NEW_LINE 'rn' 0096: WHITESPACE ' ' 0100: IDENTIFIER '_Color' 0106: WHITESPACE ' ' 0107: LPAREN '(' 0108: STRING_LITERAL '"Main Color"' 0120: COMMA ',' 0121: WHITESPACE ' ' 0122: COLOR_KEYWORD 'Color' 0127: RPAREN ')' 0128: WHITESPACE ' ' 0129: EQUALS '=' 0130: WHITESPACE ' ' 0131: LPAREN '(' …
  17. 17. @citizenmatt Lexers are a solved problem Use a lexer generator
 lex (1975), flex, CsLex, FsLex, JFLex, etc.
  18. 18. @citizenmatt Anatomy of a lexer input file User code (e.g. using directives) %% directives
 set up namespaces, class names, interfaces
 declare regex macros
 declare states %% rules and actions
 <state> rule { action }
  19. 19. @citizenmatt ShaderLab lexer Demo
  20. 20. @citizenmatt How does it work? • Lexer generates source code • Rules (regexes) converted into single Finite State Machine
 All regexes combined, matched at same time • Encoded in state transition tables • Lookup based on state and input char • Very fast • Not very maintainable
 Seriously
  21. 21. @citizenmatt a(b|c)d*e+ Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  22. 22. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  23. 23. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  24. 24. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  25. 25. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  26. 26. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  27. 27. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  28. 28. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  29. 29. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  30. 30. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  31. 31. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  32. 32. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  33. 33. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  34. 34. @citizenmatt Rule: a(b|c)d*e+ ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ other 0 m(1) E E E E E 1 E m(2) m(2) E E E 2 E E E m(2) m(3) E 3 a a a a m(3) a m(x) - match, move to state x a - accept E - error Pete Jinks - http://www.cs.man.ac.uk/~pjj/cs211/ho/node6.html
  35. 35. @citizenmatt It gets better
  36. 36. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  37. 37. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  38. 38. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  39. 39. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  40. 40. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  41. 41. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  42. 42. Rules: a(b|c)d*e+ and [0-9]+ [0-9] 4 [0-9] ‘a’ ‘b’ ‘c’ ‘d’ ‘e’ [0-9] other 0 m(1) E E E E m(4) E 1 E m(2) m(2) E E E E 2 E E E m(2) m(3) E E 3 a a a a m(3) a a 4 a a a a a m(4) a
  43. 43. @citizenmatt Parsing
  44. 44. @citizenmatt What is a parser? • Performs syntactic analysis
 Verifies and matches syntax of a file • Pattern matching on stream of tokens from lexer
 Can look at token offsets and text, too • Syntax is described by a grammar • Grammar is represented as a recursive hierarchy of rules
 Top level is the whole file, composing down to structures and tokens
  45. 45. @citizenmatt Example grammar shaderFile:
 SHADER_KEYWORD
 STRING_LITERAL
 LBRACE
 propertiesBlock?
 tagsBlock?
 …
 RBRACE
 ; propertiesBlock:
 PROPERTIES_KEYWORD
 LBRACE
 property*
 RBRACE
 ; tagsBlock:
 TAGS_KEYWORD
 LBRACE
 tag*
 RBRACE
 ; Shader "MyShader"
 {
 Properties { … }
 Tags { … }
 …
 }
  46. 46. @citizenmatt Parsing is NOT a solved problem Well, it is, kinda. There are just lots of solutions
  47. 47. @citizenmatt Types of parsers • Top down/recursive descent
 Match the root of the tree, recursively split up into child elements • Bottom up/recursive ascent
 Start with matching the leaves of the tree, combine into larger constructs as you go
  48. 48. @citizenmatt Top down parser parseShaderLabFile()
 parseShaderCommand()
 match(SHADER_KEYWORD)
 parseShaderValue()
 parseShaderValueName()
 match(STRING_LITERAL)
 match(LBRACE)
 if (tokenType == PROPERTIES_KEYWORD)
 parsePropertiesCommand()
 …
 match(RBRACE)
  49. 49. @citizenmatt Bottom up parser Shift/Reduce algorithm Match token
 Shift token onto stack (e.g. INTEGER, OP_PLUS, INTEGER)
 Reduce larger construct (e.g. INTEGER + INTEGER becomes EXPRESSION)
  50. 50. @citizenmatt Building a parser • Hand rolled
 Mechanical process to build. Easy to understand
 Usually top down/recursive descent
 Can use grammar to build syntax tree classes • Parser generators
 yacc/bison, ANTLR, etc.
 Usually bottom up. Can be hard to debug - table driven • ReSharper mostly uses top-down procedural parsers
 Generated and hand rolled
 Mainly historical. Easier to maintain, easier error recovery, etc.
  51. 51. @citizenmatt Parser combinators • Build a parser by combining other, simpler parsers • Monads!
 Think linq - similar idea, similar ease of use, similar cost
  52. 52. @citizenmatt FParsec for F# // pstring - parse a string
 // pfloat - parse a float
 // spaces1 - parse one or more whitespace chars
 
 let pforward = (pstring "fd" <|> pstring “forward”) >>. spaces1 >>. pfloat |>> fun n -> Forward(int n) let pleft = (pstring "left" <|> pstring "lt") >>. spaces1 >>. pfloat |>> fun x -> Left(int -x) let pright = (pstring "right" <|> pstring "right") >>. spaces1 >>. pfloat |>> fun x -> Right(int x) let pcommand = pforward <|> pleft <|> pright Phil Trelford - http://trelford.com/blog/post/FParsec.aspx
  53. 53. @citizenmatt Sprache for C# Parser<string> identifier = from leading in Parse.WhiteSpace.Many() from first in Parse.Letter.Once() from rest in Parse.LetterOrDigit.Many() from trailing in Parse.WhiteSpace.Many() select new string(first.Concat(rest).ToArray()); var id = identifier.Parse(" abc123 "); Assert.AreEqual("abc123", id);
  54. 54. @citizenmatt Problem #1 Whitespace and comments
  55. 55. @citizenmatt We’d expect this to work: shaderBlock:
 SHADER_KEYWORD
 STRING_LITERAL
 LBRACE
 …
 RBRACE
 ; Shader "MyShader"
 {
 …
 }
  56. 56. @citizenmatt But this is the actual input… Shaderrn
 ··········"MyShader"rn
 ·······n
 /* Cool shader! */n
 {···…········}rn
  57. 57. @citizenmatt Which lexes as… SHADER_KEYWORD
 NEW_LINE
 WHITESPACE
 STRING_LITERAL
 NEW_LINE
 WHITESPACE
 NEW_LINE
 COMMENT
 NEW_LINE
 LBRACE
 WHITESPACE
 …
 WHITESPACE
 RBRACE Shaderrn
 ··········"MyShader"rn
 ·······n
 /* Cool shader! */n
 {···…········}rn
  58. 58. @citizenmatt Which doesn’t match the grammar shaderBlock:
 SHADER_KEYWORD
 STRING_LITERAL
 LBRACE
 …
 RBRACE
 ; SHADER_KEYWORD
 NEW_LINE
 WHITESPACE
 STRING_LITERAL
 NEW_LINE
 WHITESPACE
 NEW_LINE
 COMMENT
 NEW_LINE
 LBRACE
 WHITESPACE
 …
 WHITESPACE
 RBRACE
  59. 59. @citizenmatt • Filter whitespace and comments from the stream of tokens
 ReSharper’s tokens have IsFiltered property • Decorator pattern
 Wrap original lexer, swallow filtered tokens Filtering lexers Filtering lexer Lexer Parser Program structure
  60. 60. @citizenmatt What are we building? Is it safe to lose the whitespace?
  61. 61. @citizenmatt IDE requirements, Part 1 • Code editor features
 Syntax highlighting, code folding, etc. • Syntax error highlighting • Inspections • Refactoring • Formatting • Etc.
  62. 62. @citizenmatt IDE requirements, Part 1 • Need to work with the contents and structure of a file • Contents give us semantic information • Structure allows us to report inspections, refactor, etc.
 Map the semantics back to the file • Need to represent the structure of the file • Syntax tree is obvious choice
 Inspections walk the tree, refactorings rewrite the tree
  63. 63. @citizenmatt Abstract Syntax Trees 1 + 2 3 + 1 + 5 6 = =
  64. 64. @citizenmatt Concrete Parse Trees 2 WS + WS 3 // … + NL1 WS
  65. 65. @citizenmatt Side problem #1 No guidance for designing parse trees!
  66. 66. @citizenmatt Back to Filtering Lexers • If we filter tokens out, we have to add them back again • We need a Missing Tokens Inserter to add whitespace and comments back into parse tree Filtering lexer Lexer Parser Concrete parse tree Missing tokens inserter
  67. 67. @citizenmatt Missing Tokens Inserter • Walk leaf elements of tree
 Tokens • Advances (cached) lexer for each leaf element • Check current lexer token has same offset as leaf element • If not, create leaf element and insert into tree
  68. 68. @citizenmatt Problem #2 What about significant whitespace?
  69. 69. @citizenmatt How do we parse this? There are no end of scope markers!
 And we’ve filtered out the whitespace! let ArraySample() = let numLetters = 26 let results = Array.create numLetters 0 let data = "The quick brown fox" for i = 0 to data.Length - 1 do let c = data.Chars(i) let c = Char.ToUpper(c) if c >= 'A' && c <= 'Z' then let i = Char.code c - Char.code 'A' results.[i] <- results.[i] + 1 printf "done!n"
  70. 70. @citizenmatt Insert zero-width tokens • Another lexer decorator • Keeps track of whitespace before it’s filtered • Inserts “invisible” tokens into token stream
 indicating indent/outdent or block start/end
 Possibly also token to indicate invalid indentation • Token is zero-width. Doesn’t affect parse tree • Parser can match these invisible tokens in grammar
  71. 71. @citizenmatt Lexer flexibility It’s just nice to say
  72. 72. @citizenmatt Altering tokens • F# example: 2. and [2..0] ambiguous • Original lexer matches 2. as FLOAT 
 and 2.. as INT_DOT_DOT • Another lexer decorator
 Augment generated rules with custom code • Decorator recognises INT_DOT_DOT 
 Splits into two tokens for parser
  73. 73. @citizenmatt When regexes aren’t enough • ShaderLab nested comments • Not possible to match with regex
 Don’t even try • Rule to match start of comment - /*
 Finish lexing by hand, counting start and end comment chars
 Ignore START_COMMENT and return different token - COMMENT • It doesn’t have to be completely machine generated /* This /* is */ valid */
  74. 74. @citizenmatt Problem #3 Pre-processor tokens
  75. 75. @citizenmatt Pre-processor tokens • Pre-processor tokens can appear anywhere • How do you add them to the grammar/parser? • ShaderLab has CGPROGRAM and CGINCLUDE which are essentially pre-processor tokens • (Also nested language - Cg)
  76. 76. @citizenmatt Parsing pre-processor tokens • Two pass parsing • First pass parses pre-processor tokens • Filtering lexer strips pre-processor tokens • Parse normally • Parsed pre-processor tree nodes inserted as missing tokens
  77. 77. Parsing pre-processor tokens Including
 pre-processor tokens Filtering lexer Lexer Parser Concrete parse tree Missing tokens inserter Pre-processor parser Filtering lexer
  78. 78. @citizenmatt Problem #4 IDEs impose constraints
  79. 79. @citizenmatt IDE Requirements, Part 2 • Error highlighting
 The code is broken every time you type • Incremental lexing + parsing
 Performance • Version tolerance
 E.g. multiple versions of C# • Nested/composable languages
  80. 80. @citizenmatt Problem #5 Error handling
  81. 81. @citizenmatt Error handling
  82. 82. @citizenmatt Error handling is more of an art than a science
  83. 83. @citizenmatt What happens when there’s an error? • The parser adds an error element into the tree • Error element spans whatever has been parsed so far
 Might just be unexpected token, or incorrect element construct • Highlighting the error in the editor is trivial
 Inspection simply looks for error element, adds highlight
  84. 84. @citizenmatt How do we find an error? • Error start is obvious
 mismatched rule, unexpected token • Where does the error stop?
 Off by one token could affect rest of file • IDE must try to recover
 How?
  85. 85. @citizenmatt Error recovery • Panic mode
 Eat tokens until finds a “follows” token • Token insertion/removal/substitution • Error rules in grammar
  86. 86. @citizenmatt Shader "MyShader" {
 Properties {
 _RealProperty1("Real1", Color) = (1,1,1,1)
 _PropName SyntaxErrorPanicMode = (1,1,1,1)
 _Recovered("Real2", Color) = (1, 1, 1, 1)
 }
 } Panic mode Shader "MyShader" {
 Properties {
 _RealProperty1("Real1", Color) = (1,1,1,1)
 _PropName SyntaxError _AttemptedRecovery = (1,1,1,1)
 _Recovered("Real2", Color) = (1, 1, 1, 1)
 }
 }
  87. 87. @citizenmatt • Expected RPAREN got EQUALS
 Assume RPAREN missing (insert it), EQUALS matches, continue Token insertion Shader "MyShader" {
 Properties {
 _RealProperty1("Real1", Color = (1,1,1,1)
 }
 }
  88. 88. @citizenmatt • Token insertion fails
 Inserting EQUALS doesn’t sync back up • Expected EQUALS got extra RPAREN
 Skip RPAREN (remove it), EQUALS matches, continue Shader "MyShader" {
 Properties {
 _RealProperty1("Real1", Color)) = (1,1,1,1)
 }
 } Token removal
  89. 89. @citizenmatt Error production rules • Create a rule that anticipates an error • E.g. consume any tokens that shouldn’t be there emptyBlock:
 LBRACE
 errorElementWithoutRBrace* RBRACE
 ;
  90. 90. @citizenmatt Problem #6 Incremental lexing and parsing
  91. 91. @citizenmatt What’s the problem? • Don’t parse entire file on every change • Only reparse smallest subtree that encloses change
 Block nodes (method bodies, classes, etc. Not if, for, etc.) • Avoid re-lexing the entire file, too
  92. 92. @citizenmatt Incremental lexing • Requires a cache of the original token stream
 Token type, offsets and state of lexer (int) • Copy cached tokens up to change position • Restart lexer at change position with known state from cache • Lex until we can match tail of cached tokens
  93. 93. @citizenmatt Incremental parsing • Walk up syntax tree, find nearest element that can reparse and that encompasses change
 E.g. method/class body • Find start of block
 E.g. opening LBRACE ‘{‘ • Use updated cached lexer to find end of block
 E.g. closing RBRACE ‘}’ • Parse block, add new element into tree
 Uses custom entry point into parser
  94. 94. @citizenmatt Problem #7 Composable languages
  95. 95. @citizenmatt Three types • Injected languages
 E.g. self-contained islands in a string literal (regex) • Inherited languages
 E.g. TypeScript is a superset of JavaScript • Nested languages
 E.g. JavaScript/CSS nested inside HTML. Razor and C#
  96. 96. @citizenmatt Injected languages • Build a parse tree for the contents of another node
 E.g. ShaderLab CG_PROGRAM, regular expressions, … • Provides syntax highlighting, code completion, etc. • Attaches a new parse tree to the node of another tree • Changes to injected tree persisted to string and pushed as change to the owning tree • Changes to owning tree cause full reparse of injected language
  97. 97. @citizenmatt Inherited languages • E.g. TypeScript is a superset of JavaScript • TypeScriptParser derives from JavaScriptParser
 Share a lexer • Custom hand rolled parsers
 Recursive descent • Easier to inherit and override key methods
 Gang of Four Template pattern • Also XamlParser, MSBuildParser, WebConfigParser
 Custom XML parsers
  98. 98. @citizenmatt Nested languages • E.g. .aspx, .cshtml - HTML superset, with C# “islands” • ReSharper parses .aspx/.cshtml file
 Builds parse tree for ASPX/Razor syntax • HTML superset requires lexer superset • HtmlCompoundLexer lexes “outer” language’s tokens
 When encounters HTML, switches to standard HTML lexer • How to handle C# islands?
  99. 99. @citizenmatt Secondary documents • ASPX/Razor - C# islands • Create secondary in-memory C# file
 Mirrors what gets generated when .aspx file is compiled • Maps C# islands in .aspx to in-memory C# file • Inspections, code completion, etc. work through the mapping
  100. 100. @citizenmatt How do you parse a file?
  101. 101. @citizenmatt DON’T
  102. 102. @citizenmatt Links https://github.com/JetBrains/resharper-unity Generating Fast, Error Recovering Parsers
 http://www.dtic.mil/dtic/tr/fulltext/u2/a196581.pdf Effective and Comfortable Error Recovery in Recursive Descent Parsers
 http://www.cocolab.com/products/cocktail/doc.pdf/ell.pdf The Definitive ANTLR4 Reference - Terrence Parr

×