Hacking parse.y
   Tatsuhiro UJIHISA
Me

• Ruby experience: 4 years
 • Rails application
 • Data mining tool
• Learning English here: 5 mths
• Looking for a jo...
Me


• Presentations in Japan
 • Kansai Ruby Workshop
 • RubyKaigi2008, 2009
This is my first English
    presentation.
Hacking parse.y
Fixing ruby parser to understand ruby
 • Introducing new syntax
  • {:key :-) "value"}
  • 'symbol
  • ++i...
MRI Inside

• MRI (Matz Ruby Implementation)
• $ ruby -v
  ruby 1.9.2dev (2009-08-05 trunk 24397) [i386-darwin9.7.0]


• W...
ruby 1.8 vs 1.9

• ~1.8
 • Parser: parse.y
 • Evaluator: eval.c
• 1.9~
 • Parser: parse.y
 • Evaluator:YARV (vm*.c)
Matz said

• Ugly: eval.c and parse.y
 RubyConf2006
• Now the original evaluator
 was all replaced with YARV
MRI Parser

• MRI uses yacc
  (parser generator for C)
• parse.y-o y.tab.c parse.y
  bison -d
  sed -f ./tool/ytab.sed -e ...
parse.y

• One of the darkest side
• $ wc -l *{c,h,y} | sort -n
  ...
 9261 io.c
 10350 parse.y
 16352 parse.c # (automati...
(Broad) Parser

• Lexer (yylex)
 • Bytes → Symbols
• Parser (yyparse)
 • Symbols → Syntax Tree
Tokens in Lexer
                                           %token <id> tOP_ASGN /* +=, -= et
%token   tUPLUS     /* unary+...
(detour)

n   MRI: parse.y (10350 lines)

n   JRuby: src/org/jruby/parser/{DefaultRubyParser.y,
    Ruby19Parser.y}
    (1...
Case 1:
             :-)

• Hash literal
  {:key => 'value'}
  {:key :-) 'value'}
• :-) is just an alias of =>
Mastering “Colon”
Colons in Ruby

• A::B, ::C
• :symbol, :"sy-m-bol"
•a ? b : c
• {a: b}
• when 1: something (in 1.8)
static int
parser_yylex(struct parser_params *parser) {
    ...
    switch (c = nextc()) {
      ...
      case '#': /* it...
How does parser deal
    with colon?

• :: → tCOLON2 or tCOLON3
 • tCOLON2 Net::URI
 • tCOLON3 ::Kernel
lex_state
enum lex_state_e {
    EXPR_BEG,        /* ignore newline, +/- is a sign. */
    EXPR_END,        /* newline sig...
case ':':
  c = nextc();
  if (c == ':') {
      if (IS_BEG() ||
          lex_state == EXPR_CLASS ||
          (IS_ARG() ...
...
  if (lex_state == EXPR_END ||
      lex_state == EXPR_ENDARG ||
      (c != -1 && ISSPACE(c))) {
      pushback(c);
 ...
How does parser deal
with colon? (summary)
• :: → tCOLON2 or tCOLON2
• EXPR_END or →: (else)
• otherwise → tSYMBEG
 • :' →...
So,
• :-) → tASSOC
• :: → tCOLON2 or tCOLON2
• EXPR_END or →: (else)
• otherwise → tSYMBEG
 • :' → str_ssym
 • :" → str_ds...
:-)
Case 2:
    Lisp Like Symbol
• Symbol Literal
 :vancouver
 'vancouver
• Ad-hoc
 p :a, :b
 p 'a, 'b
Single Quote
(in parser_yylex)
...
case ''':
      lex_strterm = NEW_STRTERM(str_squote, ''', 0);
      return tSTRING_BEG...
Single Quote
(in parser_yylex)
...
case ''':
      if (??? condition ???) {
           lex_state = EXPR_FNAME;
           ...
(loop
  (lambda (p 'good)))
Case3: Pre
Incremental Operator

• ++i
• i = i.succ
  (NOT i = i + 1)
Lexer
@@ -685,6 +685,7 @@ static void
token_info_pop(struct parser_params*, const
char *token);
 %type <val> program reswo...
regenerate id.h

• id.h is automatically
 generated by parse.y in make
• $ rm id.h
 $ make
parser example
variable     : tIDENTIFIER
        |   tIVAR
        |   tGVAR
        |   tCONSTANT
        |   tCVAR
    ...
lhs     : variable
         {
         /*%%%*/
        if (!($$ = assignable($1, 0))) $$ = NEW_BEGIN(0);
         /*%
    ...
BNF (part)
program    : compstmt             arg       : lhs '=' arg
                                            | var_lhs...
Assign
stmt : ...
 | mlhs '=' command_call
     {
     /*%%%*/
         value_expr($3);
         $1->nd_value = $3;
      ...
mlhs
mlhs: mlhs_basic | ...
mlhs_basic: mlhs_head | ...
mlhs_head: mlhs_item ',' | ...
mlhs_item: mlhs_node | ...
mlhs_nod...
Method call
block_command        : block_call
| block_call '.' operation2 command_args
    {
    /*%%%*/
        $$ = NEW_...
Mix!
var_ref: ...
| tINCR variable
    {
    /*%%%*/
        $$ = assignable($2, 0);
        $$->nd_value = NEW_CALL(getta...
++ruby
Case 4:
         def A#b

• A#b
 instance method b of class A
• A.b
 class method b of class A
A#b
class A    def A.b
  def b      ...
    ...    end
  end
end
A#b
def A#b    def A.b
  ...        ...
end        end
#
(in parser_yylex)
case '#':                 /* it's a comment */
 /* no magic_comment in shebang line */
 if (!parser_ma...
#
(in parser_yylex)
case '#':                 /* it's a comment */
 c = nextc();
 pushback(c);
 if(lex_state == EXPR_END &...
Primary
primary: literal | ...
       | k_def singleton dot_or_colon {lex_state = EXPR_FNAME;} fname
           {
        ...
| k_def cname '#' {lex_state = EXPR_FNAME;} fname
    {
        $<id>$ = cur_mid;
        cur_mid = $5;
        in_def++;
...
Reference
Ruby




Minero AOKI,Yukihiro
MATSUMOTO
"Ruby Hacking Guide"


HTML Version is available
Reference


• My blog
 http://ujihisa.blogspot.com
• All patches I showed are there
end
Appendix:
Imaginary Numbers
• Matz wrote a patch in
 [ruby-dev:38843]
• translation:
 [ruby-core:24730]
• It won't be acce...
Appendix:
 Imaginary Numbers

> 3i
=> (0 + 3i)
> 3i.class
=> Complex
Applendix2:
  I'm looking for job!
• ujihisa at gmail com
• Ruby, Rails, Merb, Sinatra, etc
• C, JavaScript,Vim script,
 H...
Hacking Parse.y with ujihisa
Hacking Parse.y with ujihisa
Upcoming SlideShare
Loading in...5
×

Hacking Parse.y with ujihisa

1,185

Published on

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,185
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hacking Parse.y with ujihisa

  1. 1. Hacking parse.y Tatsuhiro UJIHISA
  2. 2. Me • Ruby experience: 4 years • Rails application • Data mining tool • Learning English here: 5 mths • Looking for a job!
  3. 3. Me • Presentations in Japan • Kansai Ruby Workshop • RubyKaigi2008, 2009
  4. 4. This is my first English presentation.
  5. 5. Hacking parse.y Fixing ruby parser to understand ruby • Introducing new syntax • {:key :-) "value"} • 'symbol • ++i • def A#b(c)
  6. 6. MRI Inside • MRI (Matz Ruby Implementation) • $ ruby -v ruby 1.9.2dev (2009-08-05 trunk 24397) [i386-darwin9.7.0] • Written in C • array.c, vm.c, gc.c, etc...
  7. 7. ruby 1.8 vs 1.9 • ~1.8 • Parser: parse.y • Evaluator: eval.c • 1.9~ • Parser: parse.y • Evaluator:YARV (vm*.c)
  8. 8. Matz said • Ugly: eval.c and parse.y RubyConf2006 • Now the original evaluator was all replaced with YARV
  9. 9. MRI Parser • MRI uses yacc (parser generator for C) • parse.y-o y.tab.c parse.y bison -d sed -f ./tool/ytab.sed -e "/^#/s!y.tab.c! parse.c!" y.tab.c > parse.c.new ...
  10. 10. parse.y • One of the darkest side • $ wc -l *{c,h,y} | sort -n ... 9261 io.c 10350 parse.y 16352 parse.c # (automatically generated) 183370 total
  11. 11. (Broad) Parser • Lexer (yylex) • Bytes → Symbols • Parser (yyparse) • Symbols → Syntax Tree
  12. 12. Tokens in Lexer %token <id> tOP_ASGN /* +=, -= et %token tUPLUS /* unary+ */ %token tASSOC /* => */ %token tUMINUS /* unary- */ %token tLPAREN /* ( */ %token tPOW /* ** */ %token tLPAREN_ARG /* ( */ %token tCMP /* <=> */ %token tRPAREN /* ) */ %token tEQ /* == */ %token tLBRACK /* [ */ %token tEQQ /* === */ %token tLBRACE /* { */ %token tNEQ /* != */ %token tLBRACE_ARG /* { */ %token tGEQ /* >= */ %token tSTAR /* * */ %token tLEQ /* <= */ %token tAMPER /* & */ %token tANDOP tOROP /* && and || */ %token tLAMBDA /* -> */ %token tMATCH tNMATCH/* =~ and !~ */ %token tSYMBEG tSTRING_BEG tXSTRING_ %token tDOT2 tDOT3 /* .. and ... */ tWORDS_BEG tQWORDS_BEG %token tAREF tASET /* [] and []= */ %token tSTRING_DBEG tSTRING_DVAR tST %token tLSHFT tRSHFT /* << and >> */ %token tCOLON2 /* :: */ %token tCOLON3 /* :: at EXPR_BEG */
  13. 13. (detour) n MRI: parse.y (10350 lines) n JRuby: src/org/jruby/parser/{DefaultRubyParser.y, Ruby19Parser.y} (1886, 2076 lines) n Rubinius: lib/ruby_parser.y (1795 lines)
  14. 14. Case 1: :-) • Hash literal {:key => 'value'} {:key :-) 'value'} • :-) is just an alias of =>
  15. 15. Mastering “Colon”
  16. 16. Colons in Ruby • A::B, ::C • :symbol, :"sy-m-bol" •a ? b : c • {a: b} • when 1: something (in 1.8)
  17. 17. static int parser_yylex(struct parser_params *parser) { ... switch (c = nextc()) { ... case '#': /* it's a comment */ ... case ':': c = nextc(); if (c == ':') { if (IS_BEG() ||... ... } ... (about 1300 lines)
  18. 18. How does parser deal with colon? • :: → tCOLON2 or tCOLON3 • tCOLON2 Net::URI • tCOLON3 ::Kernel
  19. 19. lex_state enum lex_state_e { EXPR_BEG, /* ignore newline, +/- is a sign. */ EXPR_END, /* newline significant, +/- is an operator. * EXPR_ENDARG, /* ditto, and unbound braces. */ EXPR_ARG, /* newline significant, +/- is an operator. * EXPR_CMDARG, /* newline significant, +/- is an operator. * EXPR_MID, /* newline significant, +/- is an operator. * EXPR_FNAME, /* ignore newline, no reserved words. */ EXPR_DOT, /* right after `.' or `::', no reserved words EXPR_CLASS, /* immediate after `class', no here document. EXPR_VALUE /* alike EXPR_BEG but label is disallowed. */ };
  20. 20. case ':': c = nextc(); if (c == ':') { if (IS_BEG() || lex_state == EXPR_CLASS || (IS_ARG() && space_seen)) { lex_state = EXPR_BEG; return tCOLON3; } lex_state = EXPR_DOT; return tCOLON2; }
  21. 21. ... if (lex_state == EXPR_END || lex_state == EXPR_ENDARG || (c != -1 && ISSPACE(c))) { pushback(c); lex_state = EXPR_BEG; return ':'; } switch (c) { case ''': lex_strterm = NEW_STRTERM(str_ssym, c, 0); break; case '"': lex_strterm = NEW_STRTERM(str_dsym, c, 0); break; default: pushback(c); break; } lex_state = EXPR_FNAME; return tSYMBEG;
  22. 22. How does parser deal with colon? (summary) • :: → tCOLON2 or tCOLON2 • EXPR_END or →: (else) • otherwise → tSYMBEG • :' → str_ssym • :" → str_dsym
  23. 23. So, • :-) → tASSOC • :: → tCOLON2 or tCOLON2 • EXPR_END or →: (else) • otherwise → tSYMBEG • :' → str_ssym • :" → str_dsym
  24. 24. :-)
  25. 25. Case 2: Lisp Like Symbol • Symbol Literal :vancouver 'vancouver • Ad-hoc p :a, :b p 'a, 'b
  26. 26. Single Quote (in parser_yylex) ... case ''': lex_strterm = NEW_STRTERM(str_squote, ''', 0); return tSTRING_BEG; ...
  27. 27. Single Quote (in parser_yylex) ... case ''': if (??? condition ???) { lex_state = EXPR_FNAME; return tSYMBEG; } lex_strterm = NEW_STRTERM(str_squote, ''', 0); return tSTRING_BEG; ...
  28. 28. (loop (lambda (p 'good)))
  29. 29. Case3: Pre Incremental Operator • ++i • i = i.succ (NOT i = i + 1)
  30. 30. Lexer @@ -685,6 +685,7 @@ static void token_info_pop(struct parser_params*, const char *token); %type <val> program reswords then do dot_or_colon %*/ %token tUPLUS /* unary+ */ +%token tINCR /* ++var */ %token tUMINUS /* unary- */ %token tPOW /* ** */ %token tCMP /* <=> */ (Actually there are more trivial fixes)
  31. 31. regenerate id.h • id.h is automatically generated by parse.y in make • $ rm id.h $ make
  32. 32. parser example variable : tIDENTIFIER | tIVAR | tGVAR | tCONSTANT | tCVAR | keyword_nil {ifndef_ripper($$ = keyword_nil);} | keyword_self {ifndef_ripper($$ = keyword_self);} | keyword_true {ifndef_ripper($$ = keyword_true);} | keyword_false {ifndef_ripper($$ = keyword_false);} | keyword__FILE__ {ifndef_ripper($$ = keyword__FILE__);} | keyword__LINE__ {ifndef_ripper($$ = keyword__LINE__);} | keyword__ENCODING__ {ifndef_ripper($$ = keyword__ENCODING_ ;
  33. 33. lhs : variable { /*%%%*/ if (!($$ = assignable($1, 0))) $$ = NEW_BEGIN(0); /*% $$ = dispatch1(var_field, $1); %*/ } | primary_value '[' opt_call_args rbracket { /*%%%*/ $$ = aryset($1, $3); /*% $$ = dispatch2(aref_field, $1, escape_Qundef($3)); %*/ } ...
  34. 34. BNF (part) program : compstmt arg : lhs '=' arg | var_lhs tOP_ASGN arg compstmt : stmts opt_terms | primary_value '[' aref_args ']' tOP stmts : none | stmt | arg '?' arg ':' arg | stmts terms stmt | primary stmt : kALIAS fitem fitem primary : literal | kALIAS tGVAR tGVAR | strings | expr | tLPAREN_ARG expr ')' | tLPAREN compstmt ')' expr : kRETURN call_args | kBREAK call_args | kREDO | kRETRY | '!' command_call | arg
  35. 35. Assign stmt : ... | mlhs '=' command_call { /*%%%*/ value_expr($3); $1->nd_value = $3; $$ = $1; /*% $$ = dispatch2(massign, $1, $3); %*/ }
  36. 36. mlhs mlhs: mlhs_basic | ... mlhs_basic: mlhs_head | ... mlhs_head: mlhs_item ',' | ... mlhs_item: mlhs_node | ... mlhs_node: variable { $$ = assignable($1, 0); }
  37. 37. Method call block_command : block_call | block_call '.' operation2 command_args { /*%%%*/ $$ = NEW_CALL($1, $3, $4); /*% $$ = dispatch3(call, $1, ripper_id2sym('.'), $$ = method_arg($$, $4); %*/ }
  38. 38. Mix! var_ref: ... | tINCR variable { /*%%%*/ $$ = assignable($2, 0); $$->nd_value = NEW_CALL(gettable($$->nd_vid), rb_intern("succ"), 0); /*% $$ = dispatch2(unary, ripper_intern("++@"), $2); %*/ }
  39. 39. ++ruby
  40. 40. Case 4: def A#b • A#b instance method b of class A • A.b class method b of class A
  41. 41. A#b class A def A.b def b ... ... end end end
  42. 42. A#b def A#b def A.b ... ... end end
  43. 43. # (in parser_yylex) case '#': /* it's a comment */ /* no magic_comment in shebang line */ if (!parser_magic_comment(parser, lex_p, lex_pend - lex_p)) { if (comment_at_top(parser)) { set_file_encoding(parser, lex_p, lex_pend); } } lex_p = lex_pend;
  44. 44. # (in parser_yylex) case '#': /* it's a comment */ c = nextc(); pushback(c); if(lex_state == EXPR_END && ISALNUM(c)) return '#'; /* no magic_comment in shebang line */ if (!parser_magic_comment(parser, lex_p, lex_pend - lex_p)) { if (comment_at_top(parser)) { set_file_encoding(parser, lex_p, lex_pend);
  45. 45. Primary primary: literal | ... | k_def singleton dot_or_colon {lex_state = EXPR_FNAME;} fname { in_single++; lex_state = EXPR_END; /* force for args */ /*%%%*/ local_push(0); /*% %*/ } f_arglist bodystmt k_end { /*%%%*/ NODE *body = remove_begin($8); reduce_nodes(&body); $$ = NEW_DEFS($2, $5, $7, body); fixpos($$, $2); local_pop(); /*% $$ = dispatch5(defs, $2, $3, $5, $7, $8); %*/ in_single--; }
  46. 46. | k_def cname '#' {lex_state = EXPR_FNAME;} fname { $<id>$ = cur_mid; cur_mid = $5; in_def++; /*%%%*/ local_push(0); /*% %*/ } f_arglist bodystmt k_end { /*%%%*/ NODE *body = remove_begin($8); reduce_nodes(&body); $$ = NEW_DEFN($5, $7, body, NOEX_PRIVATE); fixpos($$, $7); fixpos($$->nd_defn, $7); $$ = NEW_CLASS(NEW_COLON3($2), $$, 0); nd_set_line($$, $<num>6); local_pop(); /*% $$ = dispatch4(defi, $2, $5, $7, $8); %*/ in_def--; cur_mid = $<id>6; }
  47. 47. Reference Ruby Minero AOKI,Yukihiro MATSUMOTO "Ruby Hacking Guide" HTML Version is available
  48. 48. Reference • My blog http://ujihisa.blogspot.com • All patches I showed are there
  49. 49. end
  50. 50. Appendix: Imaginary Numbers • Matz wrote a patch in [ruby-dev:38843] • translation: [ruby-core:24730] • It won't be accepted
  51. 51. Appendix: Imaginary Numbers > 3i => (0 + 3i) > 3i.class => Complex
  52. 52. Applendix2: I'm looking for job! • ujihisa at gmail com • Ruby, Rails, Merb, Sinatra, etc • C, JavaScript,Vim script, HTML, Java, Haskell, Scheme • Fluent in Japanese
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×