How shit works: the CPU

Tomer Gabel
Tomer GabelConsulting Engineer at Substrate Software Services
How shit works:
the CPU
Tomer Gabel
BuildStuff 2016 Lithuania
Image: Telecarlos (CC BY-SA 3.0)
Full Disclosure
Bullshit ahead!
• I’m not an expert
• Explanations may be:
– Simplified
– Inaccurate
– Wrong :-)
• We’ll barely scratch the
surface
Image: Public Domain
A CONUNDRUM?
Are you ready for…
Image: Louis Reed (CC BY-SA 4.0)
Setting the Stage
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
1. Which is faster?
2. By how much?
3. And crucially…
why?!
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Score Error Units
Baseline.sum avgt 6 115.666 ± 3.137 us/op
Presorted.sum avgt 6 13.741 ± 0.524 us/op
Surprise, Terror and Ruthless Efficiency
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Error Units
Baseline.sum avgt 6 ± 3.137 us/op
Presorted.sum avgt 6 ± 0.524 us/op
* Ignoring setup cost
CPUS ARE
COMPLEX
BEASTS.
Image: Pauli Rautakorpi (CC BY 3.0)
It Is Known
• Your high-level code…
long sum = 0;
for (i = 0; i < length; i++)
if (data[i] >= 0)
sum += data[i];
• Gets compiled down to…
movsx eax,BYTE PTR [rax+rdx*1+0x10]
cmp eax,0x0
movabs rdx,0x11f3a9f60
movabs rcx,0x128
jl 0x000000010679e077
movabs rcx,0x138
mov r8,QWORD PTR [rdx+rcx*1]
lea r8,[r8+0x1]
mov QWORD PTR [rdx+rcx*1],r8
jl 0x000000010679e092
movsxd rax,eax
add rax,rbx
mov rbx,rax
inc edi
It Is Less Known
• What happens then?
• The instruction goes through phases…
Fetch Decode Execute
Memory
Access
Write-
back
Instruction
Stream
CPU Architecture 101
Image: Appaloosa (CC BY-SA 3.0)
CPU Architecture 101
• What does a CPU do?
– Reads the program
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
• Immense complexity!
Execution Units
• Arithmetic-Logic Unit (ALU)
– Boolean algebra
– Arithmetic
– Memory accesses
– Flow control
• Floating Point Unit (FPU)
• Memory Management Unit (MMU)
– Memory mapping
– Paging
– Access control
Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
DESIGN
CONSIDERATIONS
Image: William M. Plate Jr. (Public Domain)
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
I1
I0
I2
Pipelining
Sequential Execution
Latency = 5 cycles
Throughput= 0.2 ops / cycle
Fetch Decode Execute
Memory
Access
Write-
back
I1
I0
I2
Fetch Decode Execute
Memory
Access
Fetch Decode Execute
Pipelining
Sequential Execution Pipelined Execution
Latency = 5 cycles
Throughput= 0.2 ops / cycle
Latency = 5 cycles
Throughput= 1 ops / cycle
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
Fetch Decode Execute
Memory
Access
Write-
back
I1
I0
I2
Pipelining
• A pipeline can stall
• This happens with:
– Branches
if (i < 0) i++ else i--;
F D E M WMemory Load
F D E MTest
F D EConditional
Jump
? ????
F D E M WIncrement
memory address
F D E M
F D Stall
F D
Load from
memory
Add +1
Store in
memory
Pipelining
• A pipeline can stall
• This happens with:
– Branches
– Dependent Instructions
• A.K.A pipeline bubbling
i++;
x = i + 1;
Stall
PRACTICAL
RAMIFICATIONS
Image: Hangsna (CC BY-SA 3.0)
1. Memory is Slow
• RAM access is ~60ns
• Random access on a
4GHz, 64-bit CPU:
– 250 cycles / memory access
– 130MB / second bandwidth
• Surely we can do better!
Image: Noah Wieder (Public Domain)
Source: 7-cpu.com
Enter: CPU Cache
Level Size Latency
L1 32KB + 32KB 1ns
L2 256KB 3ns
L3 4MB 11ns
Main Memory 62ns
Intel i7-6700 “Skylake” at 4 GHz
Image: Ferry24.Milan (CC BY-SA 3.0)
Source: 7-cpu.com
Enter: CPU Cache
• A unit of work is
called cache line
– 64 bytes on x86
– LRU eviction policy
• Why is sequential
access fast?
– Cache prefetching
In Real Life
• Let’s rotate an image!
for (y = 0; y < height; y++)
for (x = 0; x < width; x++) {
int from = y * width + x;
int to = x * height + y;
target[to] = source[from];
}
Image: EgoAltere (CC0 Public Domain)
In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0
1
2
3
…
9
In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0 0 1 2 3 … 9
1
2
3
…
9
In Real Life
• This is not efficient
• Reads are sequential
• Writes aren’t, though
• Different strides
– Worst case wins :-(
0 1 2 3 ... 9
0 0 1 2 3 … 9
1 10
2 20
3 30
… …
9 90
Cache-Friendly Algorithms
• Use blocking or tiling
for (y = 0; y < height; y += blockHeight)
for (x = 0; x < width; x += blockWidth)
for (by = 0; by < blockHeight; by++)
for (bx = 0; bx < blockWidth; bx++) {
int from = (y + by) * width + (x + bx);
int to = (x + bx) * height + (y + by);
target[to] = source[from];
}
Cache-Friendly Algorithms
• The results?
Benchmark Mode Cnt Score Error Units
CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op
CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op
CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op
CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op
• The results?
Benchmark Mode Cnt Error Units
CachingShowcase.transpose avgt 10 ± 6.000 ms/op
CachingShowcase.transpose avgt 10 ± 1.646 ms/op
CachingShowcase.transpose avgt 10 ± 1.833 ms/op
CachingShowcase.transpose avgt 10 ± 1.954 ms/op
x2.37 speedup!
2. Those Pesky Branches
• Do I go left or right?
• Need input!
• … but can’t wait for it
• Maybe...
– Take a guess?
– Based on historic trends?
• Sounds speculative
Image: Michael Dolan (CC BY 2.0)
Those Pesky Branches
• Enter: Branch Prediction
• Concurrently:
– Speculate branch
– Evaluate condition
• It’s now a tradeoff
– Commit is fast
– Rollback is slow
Image: Alejandro C. (CC BY-NC 2.0)
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Back to Our Conundrum
• Can you guess?
– 3…
– 2...
– 1...
• Here it is!
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Catharsis
54 10 -4 -2 15 41
-
37
13 0 -9 14 25
-
61
40
Original data array:
Catharsis
-
61
-
37
-9 -4 -2 0 10 13 14 15 25 40 41 54
After sorting:
0
data[i] >= 0
Always false!
data[i] >= 0
Always true!
QUESTIONS?
Thank you for listening
tomer@tomergabel.com
@tomerg
http://engineering.wix.com
Sources and Examples:
https://goo.gl/f7NfGT
This work is licensed under a Creative
Commons Attribution-ShareAlike 4.0
International License.
Further Reading
• Jason Robert Carey Patterson –
Modern Microprocessors, a 90-Minute Guide
• Igor Ostrovsky - Gallery of Processor Cache
Effects
• Piyush Kumar –
Cache Oblivious Algorithms
1 of 38

More Related Content

What's hot(20)

eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
SUSE Labs Taipei8.5K views
GDB Rocks!GDB Rocks!
GDB Rocks!
Kent Chen12.3K views
The Microkernel Mach Under NeXTSTEPThe Microkernel Mach Under NeXTSTEP
The Microkernel Mach Under NeXTSTEP
Gregor Schmidt3.8K views
Virtual Machine Constructions for DummiesVirtual Machine Constructions for Dummies
Virtual Machine Constructions for Dummies
National Cheng Kung University8.8K views
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF Abyss
Sasha Goldshtein3.1K views
C#で速度を極めるいろはC#で速度を極めるいろは
C#で速度を極めるいろは
Core Concept Technologies10.4K views
Understand more about CUnderstand more about C
Understand more about C
Yi-Hsiu Hsu3.9K views
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
Brendan Gregg422.5K views
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
Thomas Graf25.6K views
Gstreamer BasicsGstreamer Basics
Gstreamer Basics
Seiji Hiraki1.1K views
C# 8.0 非同期ストリームC# 8.0 非同期ストリーム
C# 8.0 非同期ストリーム
信之 岩永11.2K views
できる!並列・並行プログラミングできる!並列・並行プログラミング
できる!並列・並行プログラミング
Preferred Networks26.6K views
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg15.3K views
レシピの作り方入門レシピの作り方入門
レシピの作り方入門
Nobuhiro Iwamatsu63.1K views
Dpdk pmdDpdk pmd
Dpdk pmd
Masaru Oki5.9K views
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg8.3K views

Viewers also liked(12)

How Shit Works: StorageHow Shit Works: Storage
How Shit Works: Storage
Tomer Gabel914 views
The Wix Microservice StackThe Wix Microservice Stack
The Wix Microservice Stack
Tomer Gabel1.7K views
Onboarding at ScaleOnboarding at Scale
Onboarding at Scale
Tomer Gabel1.5K views
5 Bullets to Scala Adoption5 Bullets to Scala Adoption
5 Bullets to Scala Adoption
Tomer Gabel2.7K views
Four handsFour hands
Four hands
StuartMMills233 views
Disturbios de aprendizagemDisturbios de aprendizagem
Disturbios de aprendizagem
Beneditaarruda216 views
Scala Back to Basics: Type ClassesScala Back to Basics: Type Classes
Scala Back to Basics: Type Classes
Tomer Gabel3.7K views
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP On
Tomer Gabel3.5K views
Scala in practiceScala in practice
Scala in practice
Tomer Gabel25.4K views

Recently uploaded(20)

Best Mics For Your Live StreamingBest Mics For Your Live Streaming
Best Mics For Your Live Streaming
ontheflystream6 views
Winter '24 Release Chat.pdfWinter '24 Release Chat.pdf
Winter '24 Release Chat.pdf
melbourneauuser9 views
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI
Neo4j27 views
El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
Neo4j28 views

How shit works: the CPU

  • 1. How shit works: the CPU Tomer Gabel BuildStuff 2016 Lithuania Image: Telecarlos (CC BY-SA 3.0)
  • 2. Full Disclosure Bullshit ahead! • I’m not an expert • Explanations may be: – Simplified – Inaccurate – Wrong :-) • We’ll barely scratch the surface Image: Public Domain
  • 3. A CONUNDRUM? Are you ready for… Image: Louis Reed (CC BY-SA 4.0)
  • 4. Setting the Stage // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; 1. Which is faster? 2. By how much? 3. And crucially… why?!
  • 5. # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Score Error Units Baseline.sum avgt 6 115.666 ± 3.137 us/op Presorted.sum avgt 6 13.741 ± 0.524 us/op Surprise, Terror and Ruthless Efficiency # Run complete. Total time: 00:00:32 Benchmark Mode Cnt Error Units Baseline.sum avgt 6 ± 3.137 us/op Presorted.sum avgt 6 ± 0.524 us/op * Ignoring setup cost
  • 6. CPUS ARE COMPLEX BEASTS. Image: Pauli Rautakorpi (CC BY 3.0)
  • 7. It Is Known • Your high-level code… long sum = 0; for (i = 0; i < length; i++) if (data[i] >= 0) sum += data[i]; • Gets compiled down to… movsx eax,BYTE PTR [rax+rdx*1+0x10] cmp eax,0x0 movabs rdx,0x11f3a9f60 movabs rcx,0x128 jl 0x000000010679e077 movabs rcx,0x138 mov r8,QWORD PTR [rdx+rcx*1] lea r8,[r8+0x1] mov QWORD PTR [rdx+rcx*1],r8 jl 0x000000010679e092 movsxd rax,eax add rax,rbx mov rbx,rax inc edi
  • 8. It Is Less Known • What happens then? • The instruction goes through phases… Fetch Decode Execute Memory Access Write- back Instruction Stream
  • 9. CPU Architecture 101 Image: Appaloosa (CC BY-SA 3.0)
  • 10. CPU Architecture 101 • What does a CPU do? – Reads the program
  • 11. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out
  • 12. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it
  • 13. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory
  • 14. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O
  • 15. CPU Architecture 101 • What does a CPU do? – Reads the program – Figures it out – Executes it – Talks to memory – Performs I/O • Immense complexity!
  • 16. Execution Units • Arithmetic-Logic Unit (ALU) – Boolean algebra – Arithmetic – Memory accesses – Flow control • Floating Point Unit (FPU) • Memory Management Unit (MMU) – Memory mapping – Paging – Access control Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
  • 17. DESIGN CONSIDERATIONS Image: William M. Plate Jr. (Public Domain)
  • 18. Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2 Pipelining Sequential Execution Latency = 5 cycles Throughput= 0.2 ops / cycle
  • 19. Fetch Decode Execute Memory Access Write- back I1 I0 I2 Fetch Decode Execute Memory Access Fetch Decode Execute Pipelining Sequential Execution Pipelined Execution Latency = 5 cycles Throughput= 0.2 ops / cycle Latency = 5 cycles Throughput= 1 ops / cycle Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back Fetch Decode Execute Memory Access Write- back I1 I0 I2
  • 20. Pipelining • A pipeline can stall • This happens with: – Branches if (i < 0) i++ else i--; F D E M WMemory Load F D E MTest F D EConditional Jump ? ????
  • 21. F D E M WIncrement memory address F D E M F D Stall F D Load from memory Add +1 Store in memory Pipelining • A pipeline can stall • This happens with: – Branches – Dependent Instructions • A.K.A pipeline bubbling i++; x = i + 1; Stall
  • 23. 1. Memory is Slow • RAM access is ~60ns • Random access on a 4GHz, 64-bit CPU: – 250 cycles / memory access – 130MB / second bandwidth • Surely we can do better! Image: Noah Wieder (Public Domain) Source: 7-cpu.com
  • 24. Enter: CPU Cache Level Size Latency L1 32KB + 32KB 1ns L2 256KB 3ns L3 4MB 11ns Main Memory 62ns Intel i7-6700 “Skylake” at 4 GHz Image: Ferry24.Milan (CC BY-SA 3.0) Source: 7-cpu.com
  • 25. Enter: CPU Cache • A unit of work is called cache line – 64 bytes on x86 – LRU eviction policy • Why is sequential access fast? – Cache prefetching
  • 26. In Real Life • Let’s rotate an image! for (y = 0; y < height; y++) for (x = 0; x < width; x++) { int from = y * width + x; int to = x * height + y; target[to] = source[from]; } Image: EgoAltere (CC0 Public Domain)
  • 27. In Real Life • This is not efficient • Reads are sequential 0 1 2 3 ... 9 0 1 2 3 … 9
  • 28. In Real Life • This is not efficient • Reads are sequential 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 2 3 … 9
  • 29. In Real Life • This is not efficient • Reads are sequential • Writes aren’t, though • Different strides – Worst case wins :-( 0 1 2 3 ... 9 0 0 1 2 3 … 9 1 10 2 20 3 30 … … 9 90
  • 30. Cache-Friendly Algorithms • Use blocking or tiling for (y = 0; y < height; y += blockHeight) for (x = 0; x < width; x += blockWidth) for (by = 0; by < blockHeight; by++) for (bx = 0; bx < blockWidth; bx++) { int from = (y + by) * width + (x + bx); int to = (x + bx) * height + (y + by); target[to] = source[from]; }
  • 31. Cache-Friendly Algorithms • The results? Benchmark Mode Cnt Score Error Units CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op • The results? Benchmark Mode Cnt Error Units CachingShowcase.transpose avgt 10 ± 6.000 ms/op CachingShowcase.transpose avgt 10 ± 1.646 ms/op CachingShowcase.transpose avgt 10 ± 1.833 ms/op CachingShowcase.transpose avgt 10 ± 1.954 ms/op x2.37 speedup!
  • 32. 2. Those Pesky Branches • Do I go left or right? • Need input! • … but can’t wait for it • Maybe... – Take a guess? – Based on historic trends? • Sounds speculative Image: Michael Dolan (CC BY 2.0)
  • 33. Those Pesky Branches • Enter: Branch Prediction • Concurrently: – Speculate branch – Evaluate condition • It’s now a tradeoff – Commit is fast – Rollback is slow Image: Alejandro C. (CC BY-NC 2.0)
  • 34. // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i]; Back to Our Conundrum • Can you guess? – 3… – 2... – 1... • Here it is! // Generate a bunch of bytes byte[] data = new byte[32768]; new Random().nextBytes(data); Arrays.sort(data); // Sum positive elements long sum = 0; for (int i = 0; i < data.length; i++) if (data[i] >= 0) sum += data[i];
  • 35. Catharsis 54 10 -4 -2 15 41 - 37 13 0 -9 14 25 - 61 40 Original data array:
  • 36. Catharsis - 61 - 37 -9 -4 -2 0 10 13 14 15 25 40 41 54 After sorting: 0 data[i] >= 0 Always false! data[i] >= 0 Always true!
  • 37. QUESTIONS? Thank you for listening tomer@tomergabel.com @tomerg http://engineering.wix.com Sources and Examples: https://goo.gl/f7NfGT This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
  • 38. Further Reading • Jason Robert Carey Patterson – Modern Microprocessors, a 90-Minute Guide • Igor Ostrovsky - Gallery of Processor Cache Effects • Piyush Kumar – Cache Oblivious Algorithms