Processor
Benchmarking
Brendan Gregg
Senior Performance Engineer
IntelON, Oct 2021
Case Study (2021)
New processor
Popular CPU benchmark: 2.6x faster than Intel
What would you do?
~100% of benchmarks are wrong
Active Benchmarking
Low-level analysis while it is still running
Not just statistical analysis of the results
Flame Graphs
Showed CPU time was
in a single function
Flame Graphs are now in Intel vTune!
Instruction-Level Profiling...
linux$ perf top -e cycles:ppp -p 18641
Samples: 274K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 61489970617
│ for(l = 2; l <= t; l++)
0.02 │20290: comisd %xmm2,%xmm1
0.05 │20294: ↑ jb 20270 <cpu_execute_event+0x30>
│ if (c % l == 0)
0.15 │20296: test $0x1,%bl
0.15 │20299: ↑ je 20270 <cpu_execute_event+0x30>
│ for(l = 2; l <= t; l++)
│2029b: mov $0x2,%ecx
│202a0: ↓ jmp 202c4 <cpu_execute_event+0x84>
│202a2: nopw 0x0(%rax,%rax,1)
3.57 │202a8: pxor %xmm0,%xmm0
0.21 │202ac: cvtsi2sd %rcx,%xmm0
0.26 │202b1: comisd %xmm0,%xmm1
3.51 │202b5: ↑ jb 20270 <cpu_execute_event+0x30>
│ if (c % l == 0)
0.09 │202b7: mov %rbx,%rax
0.02 │202ba: xor %edx,%edx
85.00 │202bc: div %rcx
0.12 │202bf: test %rdx,%rdx
linux$ perf top -e cycles:ppp -p 18641
Samples: 274K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 61489970617
│ for(l = 2; l <= t; l++)
0.02 │20290: comisd %xmm2,%xmm1
0.05 │20294: ↑ jb 20270 <cpu_execute_event+0x30>
│ if (c % l == 0)
0.15 │20296: test $0x1,%bl
0.15 │20299: ↑ je 20270 <cpu_execute_event+0x30>
│ for(l = 2; l <= t; l++)
│2029b: mov $0x2,%ecx
│202a0: ↓ jmp 202c4 <cpu_execute_event+0x84>
│202a2: nopw 0x0(%rax,%rax,1)
3.57 │202a8: pxor %xmm0,%xmm0
0.21 │202ac: cvtsi2sd %rcx,%xmm0
0.26 │202b1: comisd %xmm0,%xmm1
3.51 │202b5: ↑ jb 20270 <cpu_execute_event+0x30>
│ if (c % l == 0)
0.09 │202b7: mov %rbx,%rax
0.02 │202ba: xor %edx,%edx
85.00 │202bc: div %rcx
0.12 │202bf: test %rdx,%rdx
85% of cycles in
the div instruction
Instruction-level Analysis
● Determined it’s really a div benchmark
● Other processor has a faster div
Netflix Cloud
● <1% div cycles
● Therefore, perf win should be <1% (not 2.6x!)
Challenges
● This benchmark is widely used
● Cycle analysis is nearly impossible in the cloud
○ Under hypervisors: Limited PMCs; no PEBS
● Accurate benchmarking needs senior engineers
~100% of benchmarks are wrong
My Benchmarking Checklist
1. Why not double?
2. Was it tuned?
3. Did it break limits?
4. Did it error?
5. Does it reproduce?
6. Does it matter?
7. Did it even happen?
https://www.brendangregg.com/blog/2018-06-30/benchmarking-checklist.html
An Exciting New Era of
Processor Innovation
Vertical stacking, new capabilities
More processors & competition
But also a Challenging New Era of
Processor Benchmarking
Increased demand
Hard to do debug in the cloud
Popular benchmarks can be wrong
Good benchmarking
drives innovation
Thank you.
Brendan Gregg
@brendangregg

IntelON 2021 Processor Benchmarking

  • 1.
  • 2.
    Case Study (2021) Newprocessor Popular CPU benchmark: 2.6x faster than Intel What would you do?
  • 3.
  • 4.
    Active Benchmarking Low-level analysiswhile it is still running Not just statistical analysis of the results
  • 5.
    Flame Graphs Showed CPUtime was in a single function Flame Graphs are now in Intel vTune!
  • 6.
  • 7.
    linux$ perf top-e cycles:ppp -p 18641 Samples: 274K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 61489970617 │ for(l = 2; l <= t; l++) 0.02 │20290: comisd %xmm2,%xmm1 0.05 │20294: ↑ jb 20270 <cpu_execute_event+0x30> │ if (c % l == 0) 0.15 │20296: test $0x1,%bl 0.15 │20299: ↑ je 20270 <cpu_execute_event+0x30> │ for(l = 2; l <= t; l++) │2029b: mov $0x2,%ecx │202a0: ↓ jmp 202c4 <cpu_execute_event+0x84> │202a2: nopw 0x0(%rax,%rax,1) 3.57 │202a8: pxor %xmm0,%xmm0 0.21 │202ac: cvtsi2sd %rcx,%xmm0 0.26 │202b1: comisd %xmm0,%xmm1 3.51 │202b5: ↑ jb 20270 <cpu_execute_event+0x30> │ if (c % l == 0) 0.09 │202b7: mov %rbx,%rax 0.02 │202ba: xor %edx,%edx 85.00 │202bc: div %rcx 0.12 │202bf: test %rdx,%rdx
  • 8.
    linux$ perf top-e cycles:ppp -p 18641 Samples: 274K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 61489970617 │ for(l = 2; l <= t; l++) 0.02 │20290: comisd %xmm2,%xmm1 0.05 │20294: ↑ jb 20270 <cpu_execute_event+0x30> │ if (c % l == 0) 0.15 │20296: test $0x1,%bl 0.15 │20299: ↑ je 20270 <cpu_execute_event+0x30> │ for(l = 2; l <= t; l++) │2029b: mov $0x2,%ecx │202a0: ↓ jmp 202c4 <cpu_execute_event+0x84> │202a2: nopw 0x0(%rax,%rax,1) 3.57 │202a8: pxor %xmm0,%xmm0 0.21 │202ac: cvtsi2sd %rcx,%xmm0 0.26 │202b1: comisd %xmm0,%xmm1 3.51 │202b5: ↑ jb 20270 <cpu_execute_event+0x30> │ if (c % l == 0) 0.09 │202b7: mov %rbx,%rax 0.02 │202ba: xor %edx,%edx 85.00 │202bc: div %rcx 0.12 │202bf: test %rdx,%rdx 85% of cycles in the div instruction
  • 9.
    Instruction-level Analysis ● Determinedit’s really a div benchmark ● Other processor has a faster div
  • 10.
    Netflix Cloud ● <1%div cycles ● Therefore, perf win should be <1% (not 2.6x!)
  • 11.
    Challenges ● This benchmarkis widely used ● Cycle analysis is nearly impossible in the cloud ○ Under hypervisors: Limited PMCs; no PEBS ● Accurate benchmarking needs senior engineers
  • 12.
  • 13.
    My Benchmarking Checklist 1.Why not double? 2. Was it tuned? 3. Did it break limits? 4. Did it error? 5. Does it reproduce? 6. Does it matter? 7. Did it even happen? https://www.brendangregg.com/blog/2018-06-30/benchmarking-checklist.html
  • 14.
    An Exciting NewEra of Processor Innovation Vertical stacking, new capabilities More processors & competition
  • 15.
    But also aChallenging New Era of Processor Benchmarking Increased demand Hard to do debug in the cloud Popular benchmarks can be wrong
  • 16.
  • 17.