Inside PHP [OSCON 2012]

2,582 views

Published on

My slides from "Inside PHP", a talk about how to change the syntax of the PHP programming language.

Modified PHP 5.4.4 source code (with the "until" keyword added during this presentation) is available here:

http://github.com/thomaslee/oscon2012-inside-php

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,582
On SlideShare
0
From Embeds
0
Number of Embeds
140
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Inside PHP [OSCON 2012]

    1. 1. Inside PHPTom Lee @tglee OSCON 201219th July, 2012
    2. 2. Overview• About me! • New Relic’s PHP Agent escapee. • Now on New Projects, doing unspeakably un-PHP things. • Wannabe compiler nerd.• Terminology & brief intro to compilers: • Grammars, Scanners & Parsers • General architecture of a bytecode compiler• Hands on: Modifying the PHP language • PHP/Zend compiler architecture & summary • Case study in adding a new keyword
    3. 3. “Zend” vs. “Zend Engine” vs. “PHP”•I will use all of these interchangeably throughout this talk.• Referring to the bytecode compiler in the “Zend Engine 2” in most cases.• The distinction doesn’t really matter here.
    4. 4. Compilers 101: Scanners• Or lexical analyzers, or tokenizers T_WHILE• Input: raw source code (• Output: a stream of tokens T_VARIABLE("x") while ($x == $y) T_IS_EQUAL T_VARIABLE("y") )
    5. 5. Compilers 101: Parsers• Input: a stream of tokens from the scanner T_WHILE• Output is implementation dependent ( • Often an intermediate, in-memory representation of the program in tree form. T_VARIABLE("x") 0: ZEND_IS_EQUAL ~0 !0 !1 • e.g. Parse Tree or Abstract Syntax Tree 1: ZEND_JMPZ ~0 ->3 2: … • Or directly generate bytecode. 3: … T_IS_EQUAL• Goal of a parser is to structure T_VARIABLE("y") the token stream.• Parsers are frequently generated from a DSL ) • Seeparser generators like Yacc/Bison, ANTLR, etc. or e.g. parser combinators in Haskell, Scala, ML.
    6. 6. Compilers 101: Context-free grammars• Or simply “grammar”•A grammar describes the complete syntax of a (programming) language.• Usually expressed in Extended Backus-Naur Form (EBNF) • Or some variant thereof.• Variants of EBNF used for a lot of DSL-based parser generators • e.g. Yacc/Bison, ANTLR, etc.
    7. 7. Generalized Compiler Architecture* Source files Source code Scanner Token stream Parser Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree * Actually a generalized *bytecode* compiler architecture
    8. 8. Generalized *PHP* Compiler Architecture Source files Source code Scanner Token stream nguage_ scanner.l Zend /zend_la Parser y languag e_parser. Ze nd/zend_ Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree xecute.c compile.c PHP d_e Ze nd/zend_ compil Zend/zen es directly to byteco de!
    9. 9. Case Study: The “until” statement <?php It’s basically while (!...) ... $x = 5; until ($x == 0) { $x--; echo “Oh hi, Mark [$x]n”; } -- output -- Oh hi, Mark [4] Oh hi, Mark [3] Oh hi, Mark [2] Oh hi, Mark [1] Oh hi, Mark [0]
    10. 10. How to add “until” to the PHP language1.Tell the scanner how to tokenize new keyword(s)2.Describe the syntax of the new construct3.Emit bytecode
    11. 11. Before you start...• You’ll need the usual gcc toolchain, GNU Bison, etc. • Debian/Ubuntuapt-get install build-essential • OSX Xcode command line tools should give you most of what you need.• Also ensure that you have re2c • Debian/Ubuntu apt-get install re2c • OSX (Homebrew) brew install re2c • Used to generate the scanner • Silently ignored if not found by the configure script!• And, of course, source code for some recent version of PHP 5. • I’m working with PHP 5.4.4
    12. 12. 1. Tell the scanner how to tokenize “until” T_UNTIL• Zend/zend_language_scanner.l • Inputfor re2c, which will generate the Zend language scanner. ( • Describes how raw source code should be converted into tokens. • Note that no structure is implied here: that’s the parser’s job. T_VARIABLE("x")• Tell the scanner that the word “until” is special. until ($x == $y) T_IS_EQUAL• The parser also needs to know about new tokens!• How is this done for the while keyword? T_VARIABLE("y") )
    13. 13. 2. Describe the syntax of “until”• Zend/zend_language_parser.y • Essentially serves as the grammar for the Zend language. • Also describes actions to perform during parsing. • Input for the the parser generator (Bison) used to generate the PHP parser.• Tell PHP how until statements are structured syntactically.• How was it done for a while statement? T_UNTIL ( expr ) statement
    14. 14. 3. Emit bytecode• Add actions to Zend/zend_language_parser.y • What should they do?• Recall that PHP generates bytecode during the parsing process.• Generate bytecode describing the semantics of until in terms of the PHP VM. • Er, wait -- what bytecode do we need to generate? Compiler Bytecode
    15. 15. Intermission: PHP bytecode intro• opline <opcode> <result?> <op1?> <op2?> • Data structure representing a single line of PHP VM “assembly” • Includes opcode + operands ZEND_JMP <op1> Unconditional jump to the opline # in op1 • opline # associated with each opline e.g. jump to opline #10• Different variable types, differentiated by prefix: ZEND_JMP ->10 • Variables ($) ZEND_JMPZ <op1> <op2> • Compiled variables (!) Conditional jump to the opline # in op2 • Temporary variables (~) iff op1 is zero e.g. jump to opline #3 if ~0 is zero• ZEND_JMP ZEND_JMPZ ~0 ->3 • “goto” • Conditional variants: ZEND_JMPZ, ZEND_JMPNZ ZEND_IS_EQUAL <result> <op1> <op2> • opline #s used as address operand for JMP instructions (->) result=1 if op1 == op2, otherwise result=0 e.g. set ~0=1 if !0 == 10 ZEND_IF_EQUAL ~0 !0 10
    16. 16. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
    17. 17. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
    18. 18. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
    19. 19. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0
    20. 20. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    21. 21. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    22. 22. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    23. 23. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    24. 24. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    25. 25. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    26. 26. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
    27. 27. 4. Emit bytecode (cont.)• Zend/zend_compile.c • The Zend language’s code generation logic lives here. • No DSLs here: plain old C source code.• First, let’s try to understand the bytecode for while• How do we need to modify it for until?
    28. 28. Demo!• Time to build! • The usual ./configure && make dance on Linux & OSX.• Tobe thorough, regenerate data used by the tokenizer extension. (cd ext/tokenizer && ./tokenizer_data_gen.sh) • http://php.net/manual/en/book.tokenizer.php • You’ll need to run make again once you’ve done this.• With a little luck, magic happens and you get a binary in sapi/cli/php• Take until out for a spin!
    29. 29. And exhale.• Lots to take in, right? • In my experience, this stuff is best learned bit-by-bit through practice.• Ask questions! • Google • php-internals • Or hey, ask me...
    30. 30. Thanks! oscon@tomlee.co @tglee http://newrelic.com ... and come see Inside Python @ 5pm in D135 :)

    ×