Your SlideShare is downloading. ×

Hacking parse.y (RubyConf 2009)

3,785

Published on

Nov 19, 2009

Nov 19, 2009

Published in: Technology, Design
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,785
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
29
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hacking parse.y Tatsuhiro Ujihisa ujihisa@gmail.com http://ujihisa.blogspot.com/ @ujm
  • 2. hi •I'm from Japan
  • 3. DISCLAIMER •This presentation is not for super rubyists or ruby committers, but for ordinary programmers.
  • 4. Hacking parse.y •Ruby's syntax
  • 5. Hacking parse.y Fixing ruby parser to understand ruby •Introducing new syntax • {:key :-) "value"} • 'symbol • ++i • def A#b(c) • {|x| x * 2 }
  • 6. MRI Inside •MRI (Matz Ruby Implementation) •$ ruby -v ruby 1.9.2dev (2009-11-19 trunk 25862) [i386-darwin9.8.0] •Written in C •array.c, vm.c, gc.c, etc...
  • 7. ruby 1.8 vs 1.9 •~1.8 • Parser: parse.y • Evaluator: eval.c •1.9~ • Parser: parse.y • Evaluator:YARV (vm*.c)
  • 8. Matz said •Ugly: eval.c and parse.y RubyConf2006 •Now the original evaluator was all replaced withYARV
  • 9. MRI Parser •MRI uses yacc (parser generator for C) •parse.y bison -d -o y.tab.c parse.y sed -f ./tool/ytab.sed -e "/^#/s!y.tab.c! parse.c!" y.tab.c > parse.c.new ...
  • 10. parse.y •One of the darkest side •$ wc -l *{c,h,y} | sort -n ... 9961 io.c 10474 parse.y 16367 parse.c # (automatically generated) 188656 total
  • 11. (Broad) Parser •Lexer (yylex) •Bytes → Symbols •Parser (yyparse) •Symbols → Syntax Tree
  • 12. Tokens in Lexer %token tUPLUS /* unary+ */ %token tUMINUS /* unary- */ %token tPOW /* ** */ %token tCMP /* <=> */ %token tEQ /* == */ %token tEQQ /* === */ %token tNEQ /* != */ %token tGEQ /* >= */ %token tLEQ /* <= */ %token tANDOP tOROP /* && and || */ %token tMATCH tNMATCH/* =~ and !~ */ %token tDOT2 tDOT3 /* .. and ... */ %token tAREF tASET /* [] and []= */ %token tLSHFT tRSHFT /* << and >> */ %token tCOLON2 /* :: */ %token tCOLON3 /* :: at EXPR_BEG */ %token <id> tOP_ASGN /* +=, -= et %token tASSOC /* => */ %token tLPAREN /* ( */ %token tLPAREN_ARG /* ( */ %token tRPAREN /* ) */ %token tLBRACK /* [ */ %token tLBRACE /* { */ %token tLBRACE_ARG /* { */ %token tSTAR /* * */ %token tAMPER /* & */ %token tLAMBDA /* -> */ %token tSYMBEG tSTRING_BEG tXSTRING_ tWORDS_BEG tQWORDS_BEG %token tSTRING_DBEG tSTRING_DVAR tST
  • 13. (detour) n MRI: parse.y (10474 lines) n JRuby: src/org/jruby/{parser, lexer}/* (24983 lines) n parser/DefaultRubyParser.y (1880 lines) parser/Ruby19Parser.y (2076 lines) n Rubinius: lib/ext/melbourne/grammer.y (5891 lines) and others
  • 14. Case 1: :-) •Hash literal {:key => 'value'} {:key :-) 'value'} •:-) is just an alias of =>
  • 15. Mastering “Colon”
  • 16. Colons in Ruby •A::B, ::C •:symbol, :"sy-m-bol" •a ? b : c •{a: b} •when 1: something (in 1.8)
  • 17. static int parser_yylex(struct parser_params *parser) { ... switch (c = nextc()) { ... case '#': /* it's a comment */ ... case ':': c = nextc(); if (c == ':') { if (IS_BEG() ||... ... } ... (about 1300 lines)
  • 18. How does parser deal with colon? •:: → tCOLON2 or tCOLON3 •tCOLON2 Net::URI •tCOLON3 ::Kernel
  • 19. enum lex_state_e { EXPR_BEG, /* ignore newline, +/- is a sign. */ EXPR_END, /* newline significant, +/- is an operator. * EXPR_ENDARG, /* ditto, and unbound braces. */ EXPR_ARG, /* newline significant, +/- is an operator. * EXPR_CMDARG, /* newline significant, +/- is an operator. * EXPR_MID, /* newline significant, +/- is an operator. * EXPR_FNAME, /* ignore newline, no reserved words. */ EXPR_DOT, /* right after `.' or `::', no reserved words EXPR_CLASS, /* immediate after `class', no here document. EXPR_VALUE /* alike EXPR_BEG but label is disallowed. */ }; lex_state
  • 20. case ':': c = nextc(); if (c == ':') { if (IS_BEG() || lex_state == EXPR_CLASS || (IS_ARG() && space_seen)) { lex_state = EXPR_BEG; return tCOLON3; } lex_state = EXPR_DOT; return tCOLON2; }
  • 21. ... if (lex_state == EXPR_END || lex_state == EXPR_ENDARG || (c != -1 && ISSPACE(c))) { pushback(c); lex_state = EXPR_BEG; return ':'; } switch (c) { case ''': lex_strterm = NEW_STRTERM(str_ssym, c, 0); break; case '"': lex_strterm = NEW_STRTERM(str_dsym, c, 0); break; default: pushback(c); break; } lex_state = EXPR_FNAME; return tSYMBEG;
  • 22. How does parser deal with colon? (summary) •:: → tCOLON2 or tCOLON3 •EXPR_END or →: (else) •otherwise → tSYMBEG •:' → str_ssym •:" → str_dsym
  • 23. So, •:-) → tASSOC •:: → tCOLON2 or tCOLON3 •EXPR_END or →: (else) •otherwise → tSYMBEG •:' → str_ssym •:" → str_dsym
  • 24. :-)
  • 25. DISCLAIMER •This presentation is not for super rubyists or ruby committers, but for ordinary programmers.
  • 26. Case 2: Lisp Like Symbol •Symbol Literal :vancouver 'vancouver •Ad-hoc p :a, :b p 'a, 'b
  • 27. Single Quote (in parser_yylex) ... case ''': lex_strterm = NEW_STRTERM(str_squote, ''', 0); return tSTRING_BEG; ...
  • 28. Single Quote (in parser_yylex) ... case ''': if (??? condition ???) { lex_state = EXPR_FNAME; return tSYMBEG; } lex_strterm = NEW_STRTERM(str_squote, ''', 0); return tSTRING_BEG; ...
  • 29. (loop (lambda (p 'good)))
  • 30. Case3: Pre Incremental Operator •++i •i = i.succ (NOT i = i + 1)
  • 31. Lexer @@ -685,6 +685,7 @@ static void token_info_pop(struct parser_params*, const char *token); %type <val> program reswords then do dot_or_colon %*/ %token tUPLUS /* unary+ */ +%token tINCR /* ++var */ %token tUMINUS /* unary- */ %token tPOW /* ** */ %token tCMP /* <=> */ (Actually there are more trivial fixes)
  • 32. regenerate id.h •id.h is automatically generated by parse.y in make •$ rm id.h $ make
  • 33. parser example variable : tIDENTIFIER | tIVAR | tGVAR | tCONSTANT | tCVAR | keyword_nil {ifndef_ripper($$ = keyword_nil);} | keyword_self {ifndef_ripper($$ = keyword_self);} | keyword_true {ifndef_ripper($$ = keyword_true);} | keyword_false {ifndef_ripper($$ = keyword_false);} | keyword__FILE__ {ifndef_ripper($$ = keyword__FILE__);} | keyword__LINE__ {ifndef_ripper($$ = keyword__LINE__);} | keyword__ENCODING__ {ifndef_ripper($$ = keyword__ENCODING_ ;
  • 34. lhs : variable { /*%%%*/ if (!($$ = assignable($1, 0))) $$ = NEW_BEGIN(0); /*% $$ = dispatch1(var_field, $1); %*/ } | primary_value '[' opt_call_args rbracket { /*%%%*/ $$ = aryset($1, $3); /*% $$ = dispatch2(aref_field, $1, escape_Qundef($3)); %*/ } ...
  • 35. BNF (part) program : compstmt compstmt : stmts opt_terms stmts : none | stmt | stmts terms stmt stmt : kALIAS fitem fitem | kALIAS tGVAR tGVAR : : | expr expr : kRETURN call_args | kBREAK call_args : : | '!' command_call | arg arg : lhs '=' arg | var_lhs tOP_ASGN arg | primary_value '[' aref_args ']' tOP : : | arg '?' arg ':' arg | primary primary : literal | strings : : | tLPAREN_ARG expr ')' | tLPAREN compstmt ')' : : | kREDO | kRETRY
  • 36. Assign stmt : ... | mlhs '=' command_call { /*%%%*/ value_expr($3); $1->nd_value = $3; $$ = $1; /*% $$ = dispatch2(massign, $1, $3); %*/ }
  • 37. mlhs mlhs: mlhs_basic | ... mlhs_basic: mlhs_head | ... mlhs_head: mlhs_item ',' | ... mlhs_item: mlhs_node | ... mlhs_node: variable { $$ = assignable($1, 0); }
  • 38. Method call block_command : block_call | block_call '.' operation2 command_args { /*%%%*/ $$ = NEW_CALL($1, $3, $4); /*% $$ = dispatch3(call, $1, ripper_id2sym('.'), $$ = method_arg($$, $4); %*/ }
  • 39. Mix! var_ref: ... | tINCR variable { /*%%%*/ $$ = assignable($2, 0); $$->nd_value = NEW_CALL(gettable($$->nd_vid), rb_intern("succ"), 0); /*% $$ = dispatch2(unary, ripper_intern("++@"), $2); %*/ }
  • 40. ++ruby
  • 41. Case 4: def A#b •A#b instance method b of class A •A.b class method b of class A
  • 42. A#b class A def b ... end end def A.b ... end
  • 43. A#b def A#b ... end def A.b ... end
  • 44. # (in parser_yylex) case '#': /* it's a comment */ /* no magic_comment in shebang line */ if (!parser_magic_comment(parser, lex_p, lex_pend - lex_p)) { if (comment_at_top(parser)) { set_file_encoding(parser, lex_p, lex_pend); } } lex_p = lex_pend;
  • 45. # (in parser_yylex) case '#': /* it's a comment */ c = nextc(); pushback(c); if(lex_state == EXPR_END && ISALNUM(c)) return '#'; /* no magic_comment in shebang line */ if (!parser_magic_comment(parser, lex_p, lex_pend - lex_p)) { if (comment_at_top(parser)) { set_file_encoding(parser, lex_p, lex_pend);
  • 46. Primary primary: literal | ... | k_def singleton dot_or_colon {lex_state = EXPR_FNAME;} fname { in_single++; lex_state = EXPR_END; /* force for args */ /*%%%*/ local_push(0); /*% %*/ } f_arglist bodystmt k_end { /*%%%*/ NODE *body = remove_begin($8); reduce_nodes(&body); $$ = NEW_DEFS($2, $5, $7, body); fixpos($$, $2); local_pop(); /*% $$ = dispatch5(defs, $2, $3, $5, $7, $8); %*/ in_single--; }
  • 47. | k_def cname '#' {lex_state = EXPR_FNAME;} fname { $<id>$ = cur_mid; cur_mid = $5; in_def++; /*%%%*/ local_push(0); /*% %*/ } f_arglist bodystmt k_end { /*%%%*/ NODE *body = remove_begin($8); reduce_nodes(&body); $$ = NEW_DEFN($5, $7, body, NOEX_PRIVATE); fixpos($$, $7); fixpos($$->nd_defn, $7); $$ = NEW_CLASS(NEW_COLON3($2), $$, 0); nd_set_line($$, $<num>6); local_pop(); /*% $$ = dispatch4(defi, $2, $5, $7, $8); %*/ in_def--; cur_mid = $<id>6; }
  • 48. Reference Rubyソースコード完全解説 青木峰郎 著、まつもとゆ きひろ 監修 Minero AOKI,Yukihiro MATSUMOTO "Ruby Hacking Guide" HTMLVersion is available
  • 49. Reference •My blog http://ujihisa.blogspot.com •All patches I showed are there
  • 50. end
  • 51. Appendix: Imaginary Numbers •Matz wrote a patch in [ruby-dev:38843] •translation: [ruby-core:24730] •It won't be accepted
  • 52. Appendix: Imaginary Numbers > 3i => (0 + 3i) > 3i.class => Complex
  • 53. Appendix •{you <3 ruby} •f(x, y) = z (like f[x, y] = z as f.[]=(x, y, z)) •Annotations!

×