HHVM: Efficient and Scalable
PHP/Hack Execution
Guilherme Ottoni
HHVM Team
Facebook
November 2016
Speaker Intro
3
• Background in compilers,
performance analysis and
optimizations
• 15+ years working in these areas
• Spent last 5.5 years pushing the
limits of PHP performance at
Facebook
Motivation: Why PHP?
• Many claim it’s a poorly designed language
• But it’s pretty successful and widely used,
especially for server-side web development
• Biggest strengths:
– Integration with web servers
– Rapid development cycle
4
FAST
Motivation: Why PHP?
• Top websites [http://www.alexa.com/topsites]:
1. Google
2. Youtube
3. Facebook
4. Baidu
5. Wikipedia
6. Yahoo
5
PHP
Motivation: Why HHVM?
• High load requires high
performance
• Standard PHP
– Not very efficient
– Interpreter-based
– Only industry-strength
implementation of PHP
that existed
6
• HHVM’s goals:
1. Improve performance of PHP execution
2. Compatibility with Standard PHP
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28PerformanceRelativetoPHP5.1running
facebook.com
More Efficient… By How Much?
7
HHVM released to
production
HipHop
Compiler
Motivation: Why HHVM?
• Top websites [http://www.alexa.com/topsites]:
1. Google
2. Youtube
3. Facebook
4. Baidu
5. Wikipedia
6. Yahoo
8
HHVM
• Many other adopters, e.g. Box, Etsy, Slack, Wordpress
• Wikipedia’s CPU usage dropped by 6x when it
switched to HHVM
[http://hhvm.com/blog/7205/wikipedia-on-hhvm]
Overview of HHVM Pipeline
9
PHP AST
Optimize
Parser
HHBC
Bytecode
Emitter
Optimize,
Type Inference
AheadofTime
HHBC HHIR
Optimize
x86Vasm
Optimize
Runtime
Challenges for PHP Performance
1. Lack of static type information
2. Huge amounts of code
a. JIT speed
b. Code locality
3. Reference counting
10
Challenge #1: Dynamic Typing
• HHVM’s basic principle to achieve
performance: operate on type-
specialized code
• Ahead-of-time type inference
– AST-based
– Bytecode-based
11
HHBC
Optimize,
Type Inference
Challenge #1: Dynamic Typing
• Runtime type specialization
1. Tracelet JIT
• Inspect live types & JIT small code blocks as you go
2. Profile-driven Region JIT
• Collect type information in the beginning, then
later compile larger code regions
12
$elem: uncount; stk: dbl
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
$sum: dbl ; $elem: int
$sum: dbl ; $elem: dbl
$sum: int ; $elem: dbl
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
Example
13
81: CGetM <L:0 EL:3>
94: SetL 4
96: PopC
97: Int 0
106: CGetL2 4
108: Gt
109: JmpZ 13 (122)
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
71: CGetL 1
73: CGetL2 3
75: Lt
76: JmpZ 55 (131)
122: IncDecL 3 PreInc
125: PopC
126: Jmp -55
function addPositive($arr, $n) {
$sum = 0;
for ($i = 0; $i < $n; $i++) {
$elem = $arr[$i];
if ($elem > 0) {
$sum = $sum + $elem;
}
}
return $sum;
}
94: SetL 4
96: PopC
97: Int 0
106: CGetL2 4
108: Gt
109: JmpZ 13 (122)
$elem: uncount; stk: int
$i: int ; $n: int
$arr: array ; $i: int
$sum: int ; $elem: int
$i: int
$sum: dbl ; $elem: int
$elem: uncount; stk: dbl
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
$sum: dbl ; $elem: dbl
$sum: int ; $elem: dbl
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
Example
14
81: CGetM <L:0 EL:3>
94: SetL 4
96: PopC
97: Int 0
106: CGetL2 4
108: Gt
109: JmpZ 13 (122)
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
71: CGetL 1
73: CGetL2 3
75: Lt
76: JmpZ 55 (131)
122: IncDecL 3 PreInc
125: PopC
126: Jmp -55
function addPositive($arr, $n) {
$sum = 0;
for ($i = 0; $i < $n; $i++) {
$elem = $arr[$i];
if ($elem > 0) {
$sum = $sum + $elem;
}
}
return $sum;
}
94: SetL 4
96: PopC
97: Int 0
106: CGetL2 4
108: Gt
109: JmpZ 13 (122)
$elem: uncount; stk: int
$i: int ; $n: int
$arr: array ; $i: int
$sum: int ; $elem: int
$i: int
$elem: uncount; stk: dbl
$sum: dbl ; $elem: dbl
Example
15
81: CGetM <L:0 EL:3>
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
71: CGetL 1
73: CGetL2 3
75: Lt
76: JmpZ 55 (131)
122: IncDecL 3 PreInc
125: PopC
126: Jmp -55
function addPositive($arr, $n) {
$sum = 0;
for ($i = 0; $i < $n; $i++) {
$elem = $arr[$i];
if ($elem > 0) {
$sum = $sum + $elem;
}
}
return $sum;
}
94: SetL 4
96: PopC
97: Int 0
106: CGetL2 4
108: Gt
109: JmpZ 13 (122)
$i: int ; $n: int
$arr: array ; $i: int
$i: int
$elem: uncount; stk: dbl
$sum: dbl
Example
16
81: CGetM <L:0 EL:3>
114: CGetL 4
116: CGetL2 2
118: Add
119: SetL 2
121: PopC
71: CGetL 1
73: CGetL2 3
75: Lt
76: JmpZ 55 (131)
122: IncDecL 3 PreInc
125: PopC
126: Jmp -55
function addPositive($arr, $n) {
$sum = 0;
for ($i = 0; $i < $n; $i++) {
$elem = $arr[$i];
if ($elem > 0) {
$sum = $sum + $elem;
}
}
return $sum;
}
94: SetL 4
96: PopC
97: Int 0
106: CGetL2 4
108: Gt
109: JmpZ 13 (122)
$i: int ; $n: int
$arr: array
Challenge #2: Huge Code Size
• HHVM dynamic (JITed) code: ~300 MB
• HHVM static code: ~130 MB
• Intel IvyBridge cache sizes:
– L1 i-cache: 32 KB / core
– LLC: 25 MB (shared)
– L1 I-TLB: 16.5 MB
• 8 x 2 MB = 16 MB
• 128 x 4 KB = 512 KB
17
Challenge #2a: JIT Speed
• Custom JIT compiler
– E.g. LLVM isn’t appropriate
• Translation Cache (TC)
– Reuse machine code across requests
• Only perform more expensive code
optimizations when there’s a big performance
impact on the generated code
18
Challenge #2b: Code Locality
1. Very selective function inlining in the JIT
2. Profile-guided JITed code layout
3. Hot/cold splitting of JITed code
4. Highly tuned runtime helpers (even hand-written in
assembly)
5. Profile-guided binary layout for the C++ written
runtime, using HFSort tool:
https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort
6. Selective use of Intel x86’s huge (2 MB) pages to
reduce I-TLB misses
19
Challenge #3: Reference Counting
• PHP uses reference counting for memory
management
– And it’s visible at the language level:
• Precise object destruction
• Copy-on-write of arrays
• Reference-counting optimization in the JIT
– Very complex and expensive optimization
– 5% performance impact
20
SetL 4
07: t4:Int = LdStack<Int,0> t3:StkPtr
09: StLoc<4> t1:FramePtr, t4:Int
10: IncRef t4:Str
...
PopC
11: DecRef t4:Str
...
count=3 object
Challenge #3: Reference Counting
• Ongoing effort to move to a tracing garbage
collector
– Currently used to collect cycles
21
More Than Just Performance:
Hack Language Support
• PHP dialect with extended type annotations,
among other language features
• Gradual typing to allow gradual migration of
large code bases
• Powerful static type checker
22
Summary
• High-load websites need high-performance software
• PHP is very popular and used to build many large-
scale websites
• Many challenges to efficiently execute PHP
• A lot of effort put into building and optimizing HHVM
• HHVM is open source
(https://github.com/facebook/hhvm)
• And it also supports the Hack language
(http://hacklang.org/)
23
Questions?
HHVM: Efficient and Scalable PHP/Hack Execution / Guilherme Ottoni (Facebook)

HHVM: Efficient and Scalable PHP/Hack Execution / Guilherme Ottoni (Facebook)

  • 2.
    HHVM: Efficient andScalable PHP/Hack Execution Guilherme Ottoni HHVM Team Facebook November 2016
  • 3.
    Speaker Intro 3 • Backgroundin compilers, performance analysis and optimizations • 15+ years working in these areas • Spent last 5.5 years pushing the limits of PHP performance at Facebook
  • 4.
    Motivation: Why PHP? •Many claim it’s a poorly designed language • But it’s pretty successful and widely used, especially for server-side web development • Biggest strengths: – Integration with web servers – Rapid development cycle 4 FAST
  • 5.
    Motivation: Why PHP? •Top websites [http://www.alexa.com/topsites]: 1. Google 2. Youtube 3. Facebook 4. Baidu 5. Wikipedia 6. Yahoo 5 PHP
  • 6.
    Motivation: Why HHVM? •High load requires high performance • Standard PHP – Not very efficient – Interpreter-based – Only industry-strength implementation of PHP that existed 6 • HHVM’s goals: 1. Improve performance of PHP execution 2. Compatibility with Standard PHP
  • 7.
  • 8.
    Motivation: Why HHVM? •Top websites [http://www.alexa.com/topsites]: 1. Google 2. Youtube 3. Facebook 4. Baidu 5. Wikipedia 6. Yahoo 8 HHVM • Many other adopters, e.g. Box, Etsy, Slack, Wordpress • Wikipedia’s CPU usage dropped by 6x when it switched to HHVM [http://hhvm.com/blog/7205/wikipedia-on-hhvm]
  • 9.
    Overview of HHVMPipeline 9 PHP AST Optimize Parser HHBC Bytecode Emitter Optimize, Type Inference AheadofTime HHBC HHIR Optimize x86Vasm Optimize Runtime
  • 10.
    Challenges for PHPPerformance 1. Lack of static type information 2. Huge amounts of code a. JIT speed b. Code locality 3. Reference counting 10
  • 11.
    Challenge #1: DynamicTyping • HHVM’s basic principle to achieve performance: operate on type- specialized code • Ahead-of-time type inference – AST-based – Bytecode-based 11 HHBC Optimize, Type Inference
  • 12.
    Challenge #1: DynamicTyping • Runtime type specialization 1. Tracelet JIT • Inspect live types & JIT small code blocks as you go 2. Profile-driven Region JIT • Collect type information in the beginning, then later compile larger code regions 12
  • 13.
    $elem: uncount; stk:dbl 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC $sum: dbl ; $elem: int $sum: dbl ; $elem: dbl $sum: int ; $elem: dbl 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC Example 13 81: CGetM <L:0 EL:3> 94: SetL 4 96: PopC 97: Int 0 106: CGetL2 4 108: Gt 109: JmpZ 13 (122) 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC 71: CGetL 1 73: CGetL2 3 75: Lt 76: JmpZ 55 (131) 122: IncDecL 3 PreInc 125: PopC 126: Jmp -55 function addPositive($arr, $n) { $sum = 0; for ($i = 0; $i < $n; $i++) { $elem = $arr[$i]; if ($elem > 0) { $sum = $sum + $elem; } } return $sum; } 94: SetL 4 96: PopC 97: Int 0 106: CGetL2 4 108: Gt 109: JmpZ 13 (122) $elem: uncount; stk: int $i: int ; $n: int $arr: array ; $i: int $sum: int ; $elem: int $i: int
  • 14.
    $sum: dbl ;$elem: int $elem: uncount; stk: dbl 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC $sum: dbl ; $elem: dbl $sum: int ; $elem: dbl 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC Example 14 81: CGetM <L:0 EL:3> 94: SetL 4 96: PopC 97: Int 0 106: CGetL2 4 108: Gt 109: JmpZ 13 (122) 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC 71: CGetL 1 73: CGetL2 3 75: Lt 76: JmpZ 55 (131) 122: IncDecL 3 PreInc 125: PopC 126: Jmp -55 function addPositive($arr, $n) { $sum = 0; for ($i = 0; $i < $n; $i++) { $elem = $arr[$i]; if ($elem > 0) { $sum = $sum + $elem; } } return $sum; } 94: SetL 4 96: PopC 97: Int 0 106: CGetL2 4 108: Gt 109: JmpZ 13 (122) $elem: uncount; stk: int $i: int ; $n: int $arr: array ; $i: int $sum: int ; $elem: int $i: int
  • 15.
    $elem: uncount; stk:dbl $sum: dbl ; $elem: dbl Example 15 81: CGetM <L:0 EL:3> 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC 71: CGetL 1 73: CGetL2 3 75: Lt 76: JmpZ 55 (131) 122: IncDecL 3 PreInc 125: PopC 126: Jmp -55 function addPositive($arr, $n) { $sum = 0; for ($i = 0; $i < $n; $i++) { $elem = $arr[$i]; if ($elem > 0) { $sum = $sum + $elem; } } return $sum; } 94: SetL 4 96: PopC 97: Int 0 106: CGetL2 4 108: Gt 109: JmpZ 13 (122) $i: int ; $n: int $arr: array ; $i: int $i: int
  • 16.
    $elem: uncount; stk:dbl $sum: dbl Example 16 81: CGetM <L:0 EL:3> 114: CGetL 4 116: CGetL2 2 118: Add 119: SetL 2 121: PopC 71: CGetL 1 73: CGetL2 3 75: Lt 76: JmpZ 55 (131) 122: IncDecL 3 PreInc 125: PopC 126: Jmp -55 function addPositive($arr, $n) { $sum = 0; for ($i = 0; $i < $n; $i++) { $elem = $arr[$i]; if ($elem > 0) { $sum = $sum + $elem; } } return $sum; } 94: SetL 4 96: PopC 97: Int 0 106: CGetL2 4 108: Gt 109: JmpZ 13 (122) $i: int ; $n: int $arr: array
  • 17.
    Challenge #2: HugeCode Size • HHVM dynamic (JITed) code: ~300 MB • HHVM static code: ~130 MB • Intel IvyBridge cache sizes: – L1 i-cache: 32 KB / core – LLC: 25 MB (shared) – L1 I-TLB: 16.5 MB • 8 x 2 MB = 16 MB • 128 x 4 KB = 512 KB 17
  • 18.
    Challenge #2a: JITSpeed • Custom JIT compiler – E.g. LLVM isn’t appropriate • Translation Cache (TC) – Reuse machine code across requests • Only perform more expensive code optimizations when there’s a big performance impact on the generated code 18
  • 19.
    Challenge #2b: CodeLocality 1. Very selective function inlining in the JIT 2. Profile-guided JITed code layout 3. Hot/cold splitting of JITed code 4. Highly tuned runtime helpers (even hand-written in assembly) 5. Profile-guided binary layout for the C++ written runtime, using HFSort tool: https://github.com/facebook/hhvm/tree/master/hphp/tools/hfsort 6. Selective use of Intel x86’s huge (2 MB) pages to reduce I-TLB misses 19
  • 20.
    Challenge #3: ReferenceCounting • PHP uses reference counting for memory management – And it’s visible at the language level: • Precise object destruction • Copy-on-write of arrays • Reference-counting optimization in the JIT – Very complex and expensive optimization – 5% performance impact 20 SetL 4 07: t4:Int = LdStack<Int,0> t3:StkPtr 09: StLoc<4> t1:FramePtr, t4:Int 10: IncRef t4:Str ... PopC 11: DecRef t4:Str ... count=3 object
  • 21.
    Challenge #3: ReferenceCounting • Ongoing effort to move to a tracing garbage collector – Currently used to collect cycles 21
  • 22.
    More Than JustPerformance: Hack Language Support • PHP dialect with extended type annotations, among other language features • Gradual typing to allow gradual migration of large code bases • Powerful static type checker 22
  • 23.
    Summary • High-load websitesneed high-performance software • PHP is very popular and used to build many large- scale websites • Many challenges to efficiently execute PHP • A lot of effort put into building and optimizing HHVM • HHVM is open source (https://github.com/facebook/hhvm) • And it also supports the Hack language (http://hacklang.org/) 23
  • 24.

Editor's Notes

  • #12 Philosophy to obtain high performance: only operate on typed code
  • #13 Philosophy to obtain high performance: only operate on typed code
  • #14 Runs 17% faster with profile-guided trace formation compared to the baseline tracelet translator.
  • #15 Runs 17% faster with profile-guided trace formation compared to the baseline tracelet translator.
  • #16 Runs 17% faster with profile-guided trace formation compared to the baseline tracelet translator.
  • #17 Runs 17% faster with profile-guided trace formation compared to the baseline tracelet translator.
  • #26 END