Achieving Parsing Sanity In Erlang

1,270 views

Published on

Most developers will be familiar with lex, flex, yacc, bison, ANTLR, and other tools to generate parsers for use inside their own code. Erlang, the concurrent functional programming language, has its own pair, leex and yecc, for accomplishing most complicated text-processing tasks. This talk is about how the seemingly simple prospect of parsing text turned into a new parser toolkit for Erlang, and why functional programming makes parsing fun and awesome.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,270
On SlideShare
0
From Embeds
0
Number of Embeds
42
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Achieving Parsing Sanity In Erlang

  1. 1. Achieving Parsing Sanity in Erlang with Neotoma Sean Cribbs Web Consultant Ruby and Erlang Hacker
  2. 2. Quick Review
  3. 3. context-free grammars
  4. 4. Chomsky et al, natural langs
  5. 5. lots of massaging
  6. 6. inherent ambiguity
  7. 7. if A then if B then C else D if A then if B then C else D if A then if B then C else D
  8. 8. focused on generating
  9. 9. parsing expression grammars
  10. 10. top-down parsing language (70’s)
  11. 11. direct representation of parsing functions
  12. 12. Brian Ford 2002
  13. 13. focused on recognizing
  14. 14. computer languages
  15. 15. parsing expressions
  16. 16. e1 e2
  17. 17. e1 / e2
  18. 18. e+
  19. 19. e*
  20. 20. &e
  21. 21. !e
  22. 22. e?
  23. 23. “string” [A-Z] .
  24. 24. PEG > regexps
  25. 25. combined lex+parse
  26. 26. choice is ordered
  27. 27. no ambiguity
  28. 28. dangling else obviated
  29. 29. greedy repetition
  30. 30. unlimited lookahead with predicates
  31. 31. no left-recursion! (use *,+)
  32. 32. Parsing Techniques
  33. 33. Tabular test every rule
  34. 34. Recursive-descent call & consume
  35. 35. Predictive yacc/yecc
  36. 36. Packrat r.d. with memo
  37. 37. sacrifice memory for speed linear with input length ~ 400x
  38. 38. supports PEGs and some CFGs
  39. 39. Treetop Pappy Neotoma
  40. 40. Neotoma Behind the Code TM
  41. 41. can:has(cukes) -> false.
  42. 42. Cucumber uses Treetop
  43. 43. PEG → leex/yecc FAIL
  44. 44. Definitions. D = [0-9] IDENT = [a-z|A-Z|0-9|_|-] Rules. _ : {token, {underscore, TokenLine, TokenChars}}. - : {token, {dash, TokenLine, TokenChars}}. % : {token, {tag_start, TokenLine, TokenChars}}. . : {token, {class_start, TokenLine, TokenChars}}. # : {token, {id_start, TokenLine, TokenChars}}. {D}+ : {token, {number, TokenLine, list_to_integer(TokenChars)}}. '(^.|.|[^'])*' : S = lists:sublist(TokenChars, 2, TokenLen - 2), {token, {string, TokenLine, S}}. {IDENT}+ : {token, {chr, TokenLine, TokenChars}}. { : {token, {lcurly, TokenLine, TokenChars}}. } : {token, {rcurly, TokenLine, TokenChars}}. [ : {token, {lbrace, TokenLine, TokenChars}}. ] : {token, {rbrace, TokenLine, TokenChars}}. @ : {token, {at, TokenLine, TokenChars}}. , : {token, {comma, TokenLine, TokenChars}}. ' : {token, {quote, TokenLine, TokenChars}}. : : {token, {colon, TokenLine, TokenChars}}. / : {token, {slash, TokenLine, TokenChars}}. ! : {token, {bang, TokenLine, TokenChars}}. ( : {token, {lparen, TokenLine, TokenChars}}. ) : {token, {rparen, TokenLine, TokenChars}}. | : {token, {pipe, TokenLine, TokenChars}}. < : {token, {lt, TokenLine, TokenChars}}. > : {token, {gt, TokenLine, TokenChars}}. s+ : {token, {space, TokenLine, TokenChars}}. Erlang code.
  45. 45. Rootsymbol template_stmt. template_stmt -> doctype : '$1'. template_stmt -> var_ref : '$1'. template_stmt -> iter : '$1'. template_stmt -> fun_call : '$1'. template_stmt -> tag_decl : '$1'. %% doctype selector doctype -> bang bang bang : {doctype, "Transitional", []}. doctype -> bang bang bang space : {doctype, "Transitional", []}. doctype -> bang bang bang space doctype_name : {doctype, '$5', []}. doctype -> bang bang bang space doctype_name space doctype_name : {doctype, '$5', '$7'}. doctype_name -> doctype_name_elem doctype_name : '$1' ++ '$2'. doctype_name -> doctype_name_elem : '$1'. doctype_name_elem -> chr : unwrap('$1'). doctype_name_elem -> dash : "-". doctype_name_elem -> class_start : ".". doctype_name_elem -> number : number_to_list('$1'). %% Variable reference for emitting, iterating, and passing to funcalls var_ref -> at name : {var_ref, unwrap('$2')}. var_ref -> at name lbrace number rbrace : {var_ref, unwrap('$2'), unwrap('$4')}. %% Iterator iter -> dash space list_open iter_item list_close space lt dash space var_ref : {iter, '$4', '$10'}. iter_list -> iter_item : ['$1']. iter_list -> iter_item list_sep iter_list : ['$1'|'$3']. iter_item -> underscore : ignore. iter_item -> var_ref : '$1'. iter_item -> tuple_open iter_list tuple_close: {tuple, '$2'}. iter_item -> list_open iter_list list_close: {list, '$2'}. %% Function calls fun_call -> at name colon name params_open params_close : {fun_call, name_to_atom('$2'), name_to_atom('$4'), []}. fun_call -> at name colon name params_open param_list params_close : {fun_call, name_to_atom('$2'), name_to_atom('$4'), '$6'}. fun_call -> at name colon name : {fun_call, name_to_atom('$2'), name_to_atom('$4'), []}. fun_call -> at at name colon name params_open params_close : {fun_call_env, name_to_atom('$3'), name_to_atom('$5'), []}. fun_call -> at at name colon name params_open param_list params_close : {fun_call_env, name_to_atom('$3'), name_to_atom('$5'), '$7'}. fun_call -> at at name colon name : {fun_call_env, name_to_atom('$3'), name_to_atom('$5'), []}. param_list -> param : ['$1'].
  46. 46. parsec → eParSec
  47. 47. Higher Order Functions
  48. 48. functions as data
  49. 49. currying + composition
  50. 50. HOF protocol % A parser function fun(Input, Index) -> {fail, Reason} | {AST, Remaining, NewIndex}.
  51. 51. % Implements "?" PEG operator p_optional(P) -> fun(Input, Index) -> case P(Input, Index) of {fail, _} -> {[], Input, Index}; {_,_,_} = Success -> Success % {Parsed, RemainingInput, NewIndex} end end.
  52. 52. % PEG optional_space <- space?; % Erlang optional_space(Input,Index) -> (p_optional(fun space/2))(Input, Index).
  53. 53. Yay! RD! make it memo
  54. 54. ets Erlang Term Storage
  55. 55. {key, value}
  56. 56. key = Index
  57. 57. value = dict dict is an opaque hashtable
  58. 58. % Memoization wrapper p(Inp, StartIndex, Name, ParseFun, TransformFun) -> % Grab the memo table from ets Memo = get_memo(StartIndex), % See if the current reduction is memoized case dict:find(Name, Memo) of % If it is, return the result {ok, Result} -> Result; % If not, attempt to parse _ -> case ParseFun(Inp, StartIndex) of % If it fails, memoize the failure {fail,_} = Failure -> memoize(StartIndex, dict:store(Name, Failure, Memo)), Failure; % If it passes, transform and memoize the result. {Result, InpRem, NewIndex} -> Transformed = TransformFun(Result, StartIndex), memoize(StartIndex, dict:store(Name, {Transformed, InpRem, NewIndex}, Memo)), {Transformed, InpRem, NewIndex} end end.
  59. 59. self-hosting
  60. 60. rules <- space? declaration_sequence space?; declaration_sequence <- head:declaration tail:(space declaration)*; declaration <- nonterminal space '<-' space parsing_expression space? ';'; parsing_expression <- choice / sequence / primary; choice <- head:alternative tail:(space '/' space alternative)+; alternative <- sequence / primary; primary <- prefix atomic / atomic suffix / atomic; sequence <- head:labeled_sequence_primary tail:(space labeled_sequence_primary)+; labeled_sequence_primary <- label? primary; label <- alpha_char alphanumeric_char* ':'; suffix <- repetition_suffix / optional_suffix; optional_suffix <- '?'; repetition_suffix <- '+' / '*'; prefix <- '&' / '!'; atomic <- terminal / nonterminal / parenthesized_expression; parenthesized_expression <- '(' space? parsing_expression space? ')'; nonterminal <- alpha_char alphanumeric_char*; terminal <- quoted_string / character_class / anything_symbol; quoted_string <- single_quoted_string / double_quoted_string; double_quoted_string <- '"' string:(!'"' ("" / '"' / .))* '"'; single_quoted_string <- "'" string:(!"'" ("" / "'" / .))* "'"; character_class <- '[' characters:(!']' ('' . / !'' .))+ '] anything_symbol <- '.'; alpha_char <- [a-z_]; alphanumeric_char <- alpha_char / [0-9]; space <- (white / comment_to_eol)+; comment_to_eol <- '%' (!"n" .)*; white <- [ tnr];
  61. 61. parse_transform f(AST) -> NewAST.
  62. 62. standalone, code-generation
  63. 63. Future directions
  64. 64. inline code in PEG
  65. 65. atomic <- terminal / nonterminal / parenthesized_expression """ % Params: Node, Idx case Node of {'nonterminal', Symbol} -> add_nt(Symbol, Idx), "fun '" ++ Symbol ++ "'/2"; Any -> Any end """
  66. 66. 'atomic'(Input, Index) -> p(Input, Index, 'atomic', fun(I,D) -> (p_choose([ fun 'terminal'/2, fun 'nonterminal'/2, fun 'parenthesized_expression'/2]))(I,D) end, fun(Node, Idx) -> case Node of {'nonterminal', Symbol} -> add_nt(Symbol, Idx), "fun '" ++ Symbol ++ "'/2"; Any -> Any end end).
  67. 67. process dictionary BAD
  68. 68. Reia retem sedate
  69. 69. Demo http://github.com/seancribbs/neotoma
  70. 70. questions?

×