Achieving Parsing
 Sanity in Erlang
       with Neotoma



         Sean Cribbs

       Web Consultant
    Ruby and Erlang...
Quick Review
context-free
 grammars
Chomsky et al,
 natural langs
lots of massaging
inherent ambiguity
if A then if B then C else D

if A then if B then C else D

if A then if B then C else D
focused on
generating
parsing expression
    grammars
top-down parsing
 language (70’s)
direct
representation of
parsing functions
Brian Ford 2002
focused on
recognizing
computer
languages
parsing
expressions
e1 e2
e1 / e2
e+
e*
&e
!e
e?
“string”
 [A-Z]
    .
PEG > regexps
combined
lex+parse
choice is ordered
no ambiguity
dangling else
  obviated
greedy repetition
unlimited
  lookahead
with predicates
no left-recursion!
       (use *,+)
Parsing Techniques
Tabular
test every rule
Recursive-descent
 call & consume
Predictive
yacc/yecc
Packrat
r.d. with memo
sacrifice memory
    for speed
  linear with input length ~ 400x
supports PEGs and
   some CFGs
Treetop
 Pappy
Neotoma
Neotoma
Behind the Code TM
can:has(cukes) ->
      false.
Cucumber uses
   Treetop
PEG → leex/yecc
     FAIL
Definitions.

D = [0-9]
IDENT = [a-z|A-Z|0-9|_|-]

Rules.

_         : {token, {underscore, TokenLine, TokenChars}}.
-     ...
Rootsymbol template_stmt.

template_stmt      ->   doctype : '$1'.
template_stmt      ->   var_ref : '$1'.
template_stmt  ...
parsec → eParSec
Higher Order
 Functions
functions as data
currying +
composition
HOF protocol

% A parser function
fun(Input, Index) ->
     {fail, Reason} |
     {AST, Remaining, NewIndex}.
% Implements "?" PEG operator
p_optional(P) ->
 fun(Input, Index) ->
   case P(Input, Index) of
    {fail, _} -> {[], Inpu...
% PEG
optional_space <- space?;


% Erlang
optional_space(Input,Index) ->
 (p_optional(fun space/2))(Input, Index).
Yay! RD!
make it memo
ets
Erlang Term
   Storage
{key, value}
key = Index
value = dict
 dict is an opaque hashtable
% Memoization wrapper
p(Inp, StartIndex, Name, ParseFun, TransformFun) ->
 % Grab the memo table from ets
 Memo = get_memo...
self-hosting
rules <- space? declaration_sequence space?;
declaration_sequence <- head:declaration tail:(space declaration)*;
declarati...
parse_transform
   f(AST) -> NewAST.
standalone,
code-generation
Future directions
inline code in PEG
atomic <- terminal / nonterminal / parenthesized_expression
"""
% Params: Node, Idx
case Node of
    {'nonterminal', Symbo...
'atomic'(Input, Index) ->
  p(Input, Index, 'atomic',
 fun(I,D) ->
  (p_choose([
      fun 'terminal'/2,
      fun 'nonter...
process dictionary
      BAD
Reia
retem
sedate
Demo
http://github.com/seancribbs/neotoma
questions?
Upcoming SlideShare
Loading in...5
×

Achieving Parsing Sanity In Erlang

938

Published on

Most developers will be familiar with lex, flex, yacc, bison, ANTLR, and other tools to generate parsers for use inside their own code. Erlang, the concurrent functional programming language, has its own pair, leex and yecc, for accomplishing most complicated text-processing tasks. This talk is about how the seemingly simple prospect of parsing text turned into a new parser toolkit for Erlang, and why functional programming makes parsing fun and awesome.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
938
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Achieving Parsing Sanity In Erlang

  1. 1. Achieving Parsing Sanity in Erlang with Neotoma Sean Cribbs Web Consultant Ruby and Erlang Hacker
  2. 2. Quick Review
  3. 3. context-free grammars
  4. 4. Chomsky et al, natural langs
  5. 5. lots of massaging
  6. 6. inherent ambiguity
  7. 7. if A then if B then C else D if A then if B then C else D if A then if B then C else D
  8. 8. focused on generating
  9. 9. parsing expression grammars
  10. 10. top-down parsing language (70’s)
  11. 11. direct representation of parsing functions
  12. 12. Brian Ford 2002
  13. 13. focused on recognizing
  14. 14. computer languages
  15. 15. parsing expressions
  16. 16. e1 e2
  17. 17. e1 / e2
  18. 18. e+
  19. 19. e*
  20. 20. &e
  21. 21. !e
  22. 22. e?
  23. 23. “string” [A-Z] .
  24. 24. PEG > regexps
  25. 25. combined lex+parse
  26. 26. choice is ordered
  27. 27. no ambiguity
  28. 28. dangling else obviated
  29. 29. greedy repetition
  30. 30. unlimited lookahead with predicates
  31. 31. no left-recursion! (use *,+)
  32. 32. Parsing Techniques
  33. 33. Tabular test every rule
  34. 34. Recursive-descent call & consume
  35. 35. Predictive yacc/yecc
  36. 36. Packrat r.d. with memo
  37. 37. sacrifice memory for speed linear with input length ~ 400x
  38. 38. supports PEGs and some CFGs
  39. 39. Treetop Pappy Neotoma
  40. 40. Neotoma Behind the Code TM
  41. 41. can:has(cukes) -> false.
  42. 42. Cucumber uses Treetop
  43. 43. PEG → leex/yecc FAIL
  44. 44. Definitions. D = [0-9] IDENT = [a-z|A-Z|0-9|_|-] Rules. _ : {token, {underscore, TokenLine, TokenChars}}. - : {token, {dash, TokenLine, TokenChars}}. % : {token, {tag_start, TokenLine, TokenChars}}. . : {token, {class_start, TokenLine, TokenChars}}. # : {token, {id_start, TokenLine, TokenChars}}. {D}+ : {token, {number, TokenLine, list_to_integer(TokenChars)}}. '(^.|.|[^'])*' : S = lists:sublist(TokenChars, 2, TokenLen - 2), {token, {string, TokenLine, S}}. {IDENT}+ : {token, {chr, TokenLine, TokenChars}}. { : {token, {lcurly, TokenLine, TokenChars}}. } : {token, {rcurly, TokenLine, TokenChars}}. [ : {token, {lbrace, TokenLine, TokenChars}}. ] : {token, {rbrace, TokenLine, TokenChars}}. @ : {token, {at, TokenLine, TokenChars}}. , : {token, {comma, TokenLine, TokenChars}}. ' : {token, {quote, TokenLine, TokenChars}}. : : {token, {colon, TokenLine, TokenChars}}. / : {token, {slash, TokenLine, TokenChars}}. ! : {token, {bang, TokenLine, TokenChars}}. ( : {token, {lparen, TokenLine, TokenChars}}. ) : {token, {rparen, TokenLine, TokenChars}}. | : {token, {pipe, TokenLine, TokenChars}}. < : {token, {lt, TokenLine, TokenChars}}. > : {token, {gt, TokenLine, TokenChars}}. s+ : {token, {space, TokenLine, TokenChars}}. Erlang code.
  45. 45. Rootsymbol template_stmt. template_stmt -> doctype : '$1'. template_stmt -> var_ref : '$1'. template_stmt -> iter : '$1'. template_stmt -> fun_call : '$1'. template_stmt -> tag_decl : '$1'. %% doctype selector doctype -> bang bang bang : {doctype, "Transitional", []}. doctype -> bang bang bang space : {doctype, "Transitional", []}. doctype -> bang bang bang space doctype_name : {doctype, '$5', []}. doctype -> bang bang bang space doctype_name space doctype_name : {doctype, '$5', '$7'}. doctype_name -> doctype_name_elem doctype_name : '$1' ++ '$2'. doctype_name -> doctype_name_elem : '$1'. doctype_name_elem -> chr : unwrap('$1'). doctype_name_elem -> dash : "-". doctype_name_elem -> class_start : ".". doctype_name_elem -> number : number_to_list('$1'). %% Variable reference for emitting, iterating, and passing to funcalls var_ref -> at name : {var_ref, unwrap('$2')}. var_ref -> at name lbrace number rbrace : {var_ref, unwrap('$2'), unwrap('$4')}. %% Iterator iter -> dash space list_open iter_item list_close space lt dash space var_ref : {iter, '$4', '$10'}. iter_list -> iter_item : ['$1']. iter_list -> iter_item list_sep iter_list : ['$1'|'$3']. iter_item -> underscore : ignore. iter_item -> var_ref : '$1'. iter_item -> tuple_open iter_list tuple_close: {tuple, '$2'}. iter_item -> list_open iter_list list_close: {list, '$2'}. %% Function calls fun_call -> at name colon name params_open params_close : {fun_call, name_to_atom('$2'), name_to_atom('$4'), []}. fun_call -> at name colon name params_open param_list params_close : {fun_call, name_to_atom('$2'), name_to_atom('$4'), '$6'}. fun_call -> at name colon name : {fun_call, name_to_atom('$2'), name_to_atom('$4'), []}. fun_call -> at at name colon name params_open params_close : {fun_call_env, name_to_atom('$3'), name_to_atom('$5'), []}. fun_call -> at at name colon name params_open param_list params_close : {fun_call_env, name_to_atom('$3'), name_to_atom('$5'), '$7'}. fun_call -> at at name colon name : {fun_call_env, name_to_atom('$3'), name_to_atom('$5'), []}. param_list -> param : ['$1'].
  46. 46. parsec → eParSec
  47. 47. Higher Order Functions
  48. 48. functions as data
  49. 49. currying + composition
  50. 50. HOF protocol % A parser function fun(Input, Index) -> {fail, Reason} | {AST, Remaining, NewIndex}.
  51. 51. % Implements "?" PEG operator p_optional(P) -> fun(Input, Index) -> case P(Input, Index) of {fail, _} -> {[], Input, Index}; {_,_,_} = Success -> Success % {Parsed, RemainingInput, NewIndex} end end.
  52. 52. % PEG optional_space <- space?; % Erlang optional_space(Input,Index) -> (p_optional(fun space/2))(Input, Index).
  53. 53. Yay! RD! make it memo
  54. 54. ets Erlang Term Storage
  55. 55. {key, value}
  56. 56. key = Index
  57. 57. value = dict dict is an opaque hashtable
  58. 58. % Memoization wrapper p(Inp, StartIndex, Name, ParseFun, TransformFun) -> % Grab the memo table from ets Memo = get_memo(StartIndex), % See if the current reduction is memoized case dict:find(Name, Memo) of % If it is, return the result {ok, Result} -> Result; % If not, attempt to parse _ -> case ParseFun(Inp, StartIndex) of % If it fails, memoize the failure {fail,_} = Failure -> memoize(StartIndex, dict:store(Name, Failure, Memo)), Failure; % If it passes, transform and memoize the result. {Result, InpRem, NewIndex} -> Transformed = TransformFun(Result, StartIndex), memoize(StartIndex, dict:store(Name, {Transformed, InpRem, NewIndex}, Memo)), {Transformed, InpRem, NewIndex} end end.
  59. 59. self-hosting
  60. 60. rules <- space? declaration_sequence space?; declaration_sequence <- head:declaration tail:(space declaration)*; declaration <- nonterminal space '<-' space parsing_expression space? ';'; parsing_expression <- choice / sequence / primary; choice <- head:alternative tail:(space '/' space alternative)+; alternative <- sequence / primary; primary <- prefix atomic / atomic suffix / atomic; sequence <- head:labeled_sequence_primary tail:(space labeled_sequence_primary)+; labeled_sequence_primary <- label? primary; label <- alpha_char alphanumeric_char* ':'; suffix <- repetition_suffix / optional_suffix; optional_suffix <- '?'; repetition_suffix <- '+' / '*'; prefix <- '&' / '!'; atomic <- terminal / nonterminal / parenthesized_expression; parenthesized_expression <- '(' space? parsing_expression space? ')'; nonterminal <- alpha_char alphanumeric_char*; terminal <- quoted_string / character_class / anything_symbol; quoted_string <- single_quoted_string / double_quoted_string; double_quoted_string <- '"' string:(!'"' ("" / '"' / .))* '"'; single_quoted_string <- "'" string:(!"'" ("" / "'" / .))* "'"; character_class <- '[' characters:(!']' ('' . / !'' .))+ '] anything_symbol <- '.'; alpha_char <- [a-z_]; alphanumeric_char <- alpha_char / [0-9]; space <- (white / comment_to_eol)+; comment_to_eol <- '%' (!"n" .)*; white <- [ tnr];
  61. 61. parse_transform f(AST) -> NewAST.
  62. 62. standalone, code-generation
  63. 63. Future directions
  64. 64. inline code in PEG
  65. 65. atomic <- terminal / nonterminal / parenthesized_expression """ % Params: Node, Idx case Node of {'nonterminal', Symbol} -> add_nt(Symbol, Idx), "fun '" ++ Symbol ++ "'/2"; Any -> Any end """
  66. 66. 'atomic'(Input, Index) -> p(Input, Index, 'atomic', fun(I,D) -> (p_choose([ fun 'terminal'/2, fun 'nonterminal'/2, fun 'parenthesized_expression'/2]))(I,D) end, fun(Node, Idx) -> case Node of {'nonterminal', Symbol} -> add_nt(Symbol, Idx), "fun '" ++ Symbol ++ "'/2"; Any -> Any end end).
  67. 67. process dictionary BAD
  68. 68. Reia retem sedate
  69. 69. Demo http://github.com/seancribbs/neotoma
  70. 70. questions?

×