Monadic parser combinators
in C#
Speaker: Alexey Golub @Tyrrrz
Speaker: Alexey Golub @Tyrrrz
name: Alexey Golub
primary_occupation: Open Source Developer
pays_the_bills:
position: Senior Software Developer
company: Svitla Systems
tech_stack:
- C#
- .NET Core
- Azure/AWS
links:
- https://github.com/tyrrrz
- https://twitter.com/tyrrrz
- https://tyrrrz.me
Agenda
• What is a parser and what does it do?
• Formal theory of language and grammar
• Structural representation of context-free grammars
• Different ways to build a parser
• The concept of “parser combinators”
• Live-coding session (writing a JSON parser)
Speaker: Alexey Golub @Tyrrrz
What is a parser?
Speaker: Alexey Golub @Tyrrrz
“123 456,93”
What we see:
123 456,93
numeric literals
thousands separator
decimal separator
numeric literal
What we understand:
What computer sees:
byte[10] { 49, 50, 51, 32, 52, 53,
54, 44, 57, 51 }
What we want computer to understand:
new SyntacticComponents[] {
new NumericLiteral(123),
new ThousandsSeparator(),
new NumericLiteral(456),
new DecimalSeparator(),
new NumericLiteral(93)
}
What does a parser do?
Speaker: Alexey Golub @Tyrrrz
Input
“<foo><bar/></foo>”
“<foo></bar>”
“hello world”
Parser
grammar + context
Rejected invalid input
Unexpected token “</bar>” expected “</foo>”
Unexpected token “hello world”
Domain objects
new XElement(“foo”) {
new XElement(“bar”)
}
What are parsers used for?
• Data deserialization (JSON, XML, YAML)
• Static code analysis (ReSharper, TSLint)
• Syntax highlighting (VS Code, Highlight.js)
• Compilers, transpilers, interpreters (Roslyn, Markdig, Babel, SQL)
• Template engines (Razor, Liquid, Scriban)
• Natural language processing (Spellchecking, Translation)
Speaker: Alexey Golub @Tyrrrz
Formal language theory
Speaker: Alexey Golub @Tyrrrz
Language
Alphabet
set of allowed
characters
Words
set of valid
combinations
of characters
or other words
Grammar
set of rules
that define
how words are
generated
Formal grammar
Regular grammar
A → a, where A is non-terminal and a is
terminal
A → aB, where A and B are non-terminals
and a is terminal
Context-free grammar
A → ⍺, where A is non-terminal and ⍺ is a
string of terminals and/or non-terminals
Speaker: Alexey Golub @Tyrrrz
Rule of thumb
Contains recursive
grammar
Context-free
Regular
Speaker: Alexey Golub @Tyrrrz
Syntax trees
• Context-free languages are structurally represented using syntax trees
• Syntax trees are used to make sense of the input text
Root
Terminal
node
Non-terminal
node
Terminal
node
Terminal
node
Speaker: Alexey Golub @Tyrrrz
Example AST produced by C-like code
while (a != 0)
{
if (a > b)
{
a = a - b;
}
else
{
b = b - a;
}
}
return a;
Speaker: Alexey Golub @Tyrrrz
Loop/stack-based manual parsers
• Loop through all characters in the input
• Maintain context on a stack
Pros:
• Performance
• Fine-tuning
• Debugging
Cons:
• Hard to write/read/maintain
• Code is not expressive
Speaker: Alexey Golub @Tyrrrz
Parser generators
• Define grammar in a specialized language
• Generate consuming code in one of the supported languages
Pros:
• Expressive
• Language-agnostic
Cons:
• Overhead of an extra language
• Can’t leverage the power of C# to write grammar
Speaker: Alexey Golub @Tyrrrz
Parser combinators
• Define grammar using higher-order functions
• Build complex parsers by combining simpler ones
Pros:
• Expressive
• Easy to write/read/maintain
• Everything is in C#
Cons:
• Performance
• Debugging
Speaker: Alexey Golub @Tyrrrz
Parsers vs combinators
Parser<T>:
(success, result, length) = f(input, offset=0)
Examples: Char('a'), String("foo"), Digit
Combinator<T>:
Parser<T> = f(parser1, parser2)
Examples: Or(p1, p2), Many(p), DelimitedBy(p1, p2)
Speaker: Alexey Golub @Tyrrrz
Parser combinators illustrated
Input: 10 + 5
Parser:
Number:
AtLeastOne(Digit)
THEN
Sign:
Many(WhiteSpace)
Or(‘+’, ‘-’, ‘*’, ‘/’)
Many(WhiteSpace)
THEN
Number:
AtLeastOne(Digit)
Speaker: Alexey Golub @Tyrrrz
-> “10”
-> ‘1’, ‘0’
-> “ + “
-> “ “
-> ‘+’
-> “ “
-> “5”
-> ‘5’
Number (5)Number (10)
PlusOperator
Live-coding time
Let’s develop a basic JSON parser using Sprache in C#
Speaker: Alexey Golub @Tyrrrz
Links
• JSON parser from earlier – https://github.com/Tyrrrz/DotNetFest2019
• Sprache – https://github.com/sprache/Sprache
• Parsing in C# by Federico Tomassetti –
https://tomassetti.me/parsing-in-csharp
• Formal grammar on Wikipedia –
https://en.wikipedia.org/wiki/Formal_grammar
Other .NET parser-combinator libraries:
Superpower (C#), Pidgin (C#), FParsec (F#)
Speaker: Alexey Golub @Tyrrrz
Thank you!
Speaker: Alexey Golub @Tyrrrz

.NET Fest 2019. Алексей Голуб. Монадные парсер-комбинаторы в C# (простой способ написания парсеров для сложных языков)

  • 1.
    Monadic parser combinators inC# Speaker: Alexey Golub @Tyrrrz
  • 2.
    Speaker: Alexey Golub@Tyrrrz name: Alexey Golub primary_occupation: Open Source Developer pays_the_bills: position: Senior Software Developer company: Svitla Systems tech_stack: - C# - .NET Core - Azure/AWS links: - https://github.com/tyrrrz - https://twitter.com/tyrrrz - https://tyrrrz.me
  • 3.
    Agenda • What isa parser and what does it do? • Formal theory of language and grammar • Structural representation of context-free grammars • Different ways to build a parser • The concept of “parser combinators” • Live-coding session (writing a JSON parser) Speaker: Alexey Golub @Tyrrrz
  • 4.
    What is aparser? Speaker: Alexey Golub @Tyrrrz “123 456,93” What we see: 123 456,93 numeric literals thousands separator decimal separator numeric literal What we understand: What computer sees: byte[10] { 49, 50, 51, 32, 52, 53, 54, 44, 57, 51 } What we want computer to understand: new SyntacticComponents[] { new NumericLiteral(123), new ThousandsSeparator(), new NumericLiteral(456), new DecimalSeparator(), new NumericLiteral(93) }
  • 5.
    What does aparser do? Speaker: Alexey Golub @Tyrrrz Input “<foo><bar/></foo>” “<foo></bar>” “hello world” Parser grammar + context Rejected invalid input Unexpected token “</bar>” expected “</foo>” Unexpected token “hello world” Domain objects new XElement(“foo”) { new XElement(“bar”) }
  • 6.
    What are parsersused for? • Data deserialization (JSON, XML, YAML) • Static code analysis (ReSharper, TSLint) • Syntax highlighting (VS Code, Highlight.js) • Compilers, transpilers, interpreters (Roslyn, Markdig, Babel, SQL) • Template engines (Razor, Liquid, Scriban) • Natural language processing (Spellchecking, Translation) Speaker: Alexey Golub @Tyrrrz
  • 7.
    Formal language theory Speaker:Alexey Golub @Tyrrrz Language Alphabet set of allowed characters Words set of valid combinations of characters or other words Grammar set of rules that define how words are generated
  • 8.
    Formal grammar Regular grammar A→ a, where A is non-terminal and a is terminal A → aB, where A and B are non-terminals and a is terminal Context-free grammar A → ⍺, where A is non-terminal and ⍺ is a string of terminals and/or non-terminals Speaker: Alexey Golub @Tyrrrz
  • 10.
    Rule of thumb Containsrecursive grammar Context-free Regular Speaker: Alexey Golub @Tyrrrz
  • 11.
    Syntax trees • Context-freelanguages are structurally represented using syntax trees • Syntax trees are used to make sense of the input text Root Terminal node Non-terminal node Terminal node Terminal node Speaker: Alexey Golub @Tyrrrz
  • 12.
    Example AST producedby C-like code while (a != 0) { if (a > b) { a = a - b; } else { b = b - a; } } return a; Speaker: Alexey Golub @Tyrrrz
  • 13.
    Loop/stack-based manual parsers •Loop through all characters in the input • Maintain context on a stack Pros: • Performance • Fine-tuning • Debugging Cons: • Hard to write/read/maintain • Code is not expressive Speaker: Alexey Golub @Tyrrrz
  • 14.
    Parser generators • Definegrammar in a specialized language • Generate consuming code in one of the supported languages Pros: • Expressive • Language-agnostic Cons: • Overhead of an extra language • Can’t leverage the power of C# to write grammar Speaker: Alexey Golub @Tyrrrz
  • 15.
    Parser combinators • Definegrammar using higher-order functions • Build complex parsers by combining simpler ones Pros: • Expressive • Easy to write/read/maintain • Everything is in C# Cons: • Performance • Debugging Speaker: Alexey Golub @Tyrrrz
  • 16.
    Parsers vs combinators Parser<T>: (success,result, length) = f(input, offset=0) Examples: Char('a'), String("foo"), Digit Combinator<T>: Parser<T> = f(parser1, parser2) Examples: Or(p1, p2), Many(p), DelimitedBy(p1, p2) Speaker: Alexey Golub @Tyrrrz
  • 17.
    Parser combinators illustrated Input:10 + 5 Parser: Number: AtLeastOne(Digit) THEN Sign: Many(WhiteSpace) Or(‘+’, ‘-’, ‘*’, ‘/’) Many(WhiteSpace) THEN Number: AtLeastOne(Digit) Speaker: Alexey Golub @Tyrrrz -> “10” -> ‘1’, ‘0’ -> “ + “ -> “ “ -> ‘+’ -> “ “ -> “5” -> ‘5’ Number (5)Number (10) PlusOperator
  • 18.
    Live-coding time Let’s developa basic JSON parser using Sprache in C# Speaker: Alexey Golub @Tyrrrz
  • 19.
    Links • JSON parserfrom earlier – https://github.com/Tyrrrz/DotNetFest2019 • Sprache – https://github.com/sprache/Sprache • Parsing in C# by Federico Tomassetti – https://tomassetti.me/parsing-in-csharp • Formal grammar on Wikipedia – https://en.wikipedia.org/wiki/Formal_grammar Other .NET parser-combinator libraries: Superpower (C#), Pidgin (C#), FParsec (F#) Speaker: Alexey Golub @Tyrrrz
  • 20.