Successfully reported this slideshow.
Your SlideShare is downloading. ×

Inside PHP [OSCON 2012]

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Inside Python [OSCON 2012]
Inside Python [OSCON 2012]
Loading in …3
×

Check these out next

1 of 30 Ad

Inside PHP [OSCON 2012]

Download to read offline

My slides from "Inside PHP", a talk about how to change the syntax of the PHP programming language.

Modified PHP 5.4.4 source code (with the "until" keyword added during this presentation) is available here:

http://github.com/thomaslee/oscon2012-inside-php

My slides from "Inside PHP", a talk about how to change the syntax of the PHP programming language.

Modified PHP 5.4.4 source code (with the "until" keyword added during this presentation) is available here:

http://github.com/thomaslee/oscon2012-inside-php

Advertisement
Advertisement

More Related Content

Advertisement

Similar to Inside PHP [OSCON 2012] (20)

Recently uploaded (20)

Advertisement

Inside PHP [OSCON 2012]

  1. 1. Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012
  2. 2. Overview • About me! • New Relic’s PHP Agent escapee. • Now on New Projects, doing unspeakably un-PHP things. • Wannabe compiler nerd. • Terminology & brief intro to compilers: • Grammars, Scanners & Parsers • General architecture of a bytecode compiler • Hands on: Modifying the PHP language • PHP/Zend compiler architecture & summary • Case study in adding a new keyword
  3. 3. “Zend” vs. “Zend Engine” vs. “PHP” •I will use all of these interchangeably throughout this talk. • Referring to the bytecode compiler in the “Zend Engine 2” in most cases. • The distinction doesn’t really matter here.
  4. 4. Compilers 101: Scanners • Or lexical analyzers, or tokenizers T_WHILE • Input: raw source code '(' • Output: a stream of tokens T_VARIABLE("x") while ($x == $y) T_IS_EQUAL T_VARIABLE("y") ')'
  5. 5. Compilers 101: Parsers • Input: a stream of tokens from the scanner T_WHILE • Output is implementation dependent '(' • Often an intermediate, in-memory representation of the program in tree form. T_VARIABLE("x") 0: ZEND_IS_EQUAL ~0 !0 !1 • e.g. Parse Tree or Abstract Syntax Tree 1: ZEND_JMPZ ~0 ->3 2: … • Or directly generate bytecode. 3: … T_IS_EQUAL • Goal of a parser is to structure T_VARIABLE("y") the token stream. • Parsers are frequently generated from a DSL ')' • Seeparser generators like Yacc/Bison, ANTLR, etc. or e.g. parser combinators in Haskell, Scala, ML.
  6. 6. Compilers 101: Context-free grammars • Or simply “grammar” •A grammar describes the complete syntax of a (programming) language. • Usually expressed in Extended Backus-Naur Form (EBNF) • Or some variant thereof. • Variants of EBNF used for a lot of DSL-based parser generators • e.g. Yacc/Bison, ANTLR, etc.
  7. 7. Generalized Compiler Architecture* Source files Source code Scanner Token stream Parser Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree * Actually a generalized *bytecode* compiler architecture
  8. 8. Generalized *PHP* Compiler Architecture Source files Source code Scanner Token stream nguage_ scanner.l Zend /zend_la Parser y languag e_parser. Ze nd/zend_ Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree xecute.c compile.c PHP d_e Ze nd/zend_ compil Zend/zen es directly to byteco de!
  9. 9. Case Study: The “until” statement <?php It’s basically while (!...) ... $x = 5; until ($x == 0) { $x--; echo “Oh hi, Mark [$x]n”; } -- output -- Oh hi, Mark [4] Oh hi, Mark [3] Oh hi, Mark [2] Oh hi, Mark [1] Oh hi, Mark [0]
  10. 10. How to add “until” to the PHP language 1.Tell the scanner how to tokenize new keyword(s) 2.Describe the syntax of the new construct 3.Emit bytecode
  11. 11. Before you start... • You’ll need the usual gcc toolchain, GNU Bison, etc. • Debian/Ubuntuapt-get install build-essential • OSX Xcode command line tools should give you most of what you need. • Also ensure that you have re2c • Debian/Ubuntu apt-get install re2c • OSX (Homebrew) brew install re2c • Used to generate the scanner • Silently ignored if not found by the configure script! • And, of course, source code for some recent version of PHP 5. • I’m working with PHP 5.4.4
  12. 12. 1. Tell the scanner how to tokenize “until” T_UNTIL • Zend/zend_language_scanner.l • Inputfor re2c, which will generate the Zend language scanner. '(' • Describes how raw source code should be converted into tokens. • Note that no structure is implied here: that’s the parser’s job. T_VARIABLE("x") • Tell the scanner that the word “until” is special. until ($x == $y) T_IS_EQUAL • The parser also needs to know about new tokens! • How is this done for the while keyword? T_VARIABLE("y") ')'
  13. 13. 2. Describe the syntax of “until” • Zend/zend_language_parser.y • Essentially serves as the grammar for the Zend language. • Also describes actions to perform during parsing. • Input for the the parser generator (Bison) used to generate the PHP parser. • Tell PHP how until statements are structured syntactically. • How was it done for a while statement? T_UNTIL '(' expr ')' statement
  14. 14. 3. Emit bytecode • Add actions to Zend/zend_language_parser.y • What should they do? • Recall that PHP generates bytecode during the parsing process. • Generate bytecode describing the semantics of until in terms of the PHP VM. • Er, wait -- what bytecode do we need to generate? Compiler Bytecode
  15. 15. Intermission: PHP bytecode intro • opline <opcode> <result?> <op1?> <op2?> • Data structure representing a single line of PHP VM “assembly” • Includes opcode + operands ZEND_JMP <op1> Unconditional jump to the opline # in op1 • opline # associated with each opline e.g. jump to opline #10 • Different variable types, differentiated by prefix: ZEND_JMP ->10 • Variables ($) ZEND_JMPZ <op1> <op2> • Compiled variables (!) Conditional jump to the opline # in op2 • Temporary variables (~) iff op1 is zero e.g. jump to opline #3 if ~0 is zero • ZEND_JMP ZEND_JMPZ ~0 ->3 • “goto” • Conditional variants: ZEND_JMPZ, ZEND_JMPNZ ZEND_IS_EQUAL <result> <op1> <op2> • opline #s used as address operand for JMP instructions (->) result=1 if op1 == op2, otherwise result=0 e.g. set ~0=1 if !0 == 10 ZEND_IF_EQUAL ~0 !0 10
  16. 16. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
  17. 17. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
  18. 18. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
  19. 19. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
  20. 20. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  21. 21. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  22. 22. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  23. 23. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  24. 24. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  25. 25. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  26. 26. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
  27. 27. 4. Emit bytecode (cont.) • Zend/zend_compile.c • The Zend language’s code generation logic lives here. • No DSLs here: plain old C source code. • First, let’s try to understand the bytecode for while • How do we need to modify it for until?
  28. 28. Demo! • Time to build! • The usual ./configure && make dance on Linux & OSX. • Tobe thorough, regenerate data used by the tokenizer extension. (cd ext/tokenizer && ./tokenizer_data_gen.sh) • http://php.net/manual/en/book.tokenizer.php • You’ll need to run make again once you’ve done this. • With a little luck, magic happens and you get a binary in sapi/cli/php • Take until out for a spin!
  29. 29. And exhale. • Lots to take in, right? • In my experience, this stuff is best learned bit-by-bit through practice. • Ask questions! • Google • php-internals • Or hey, ask me...
  30. 30. Thanks! oscon@tomlee.co @tglee http://newrelic.com ... and come see Inside Python @ 5pm in D135 :)

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

×