PERFORMANCE AND
PREDICTABILITY
Richard Warburton
@richardwarburto
insightfullogic.com
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
Performance Discussion
Product Solutions
“Just use our library/tool/framework, and everything is web-scale!”
Architecture ...
Do you care?
DON'T look at access patterns first
Many problems not Pattern Related
Networking
Database or External Service...
Low Hanging is a cost-benefit analysis
So when does it matter?
Informed Design & Architecture *
* this is not a call for premature optimisation
That 10% that actually matters
Performance of Predictable Accesses
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
What 4 things do CPUs actually do?
Fetch, Decode, Execute, Writeback
Pipelining speeds this up
Pipelined
What about branches?
public static int simple(int x, int y, int z) {
int ret;
if (x > 5) {
ret = y + z;
} else {
ret = y;
...
Branches cause stalls, stalls kill performance
Super-pipelined & Superscalar
Can we eliminate branches?
Naive Compilation
if (x > 5) {
ret = z;
} else {
ret = y;
}
cmp x, 5
jmp L1
mov z, ret
br L2
L1: mov y, ret
L2: ...
Conditional Branches
if (x > 5) {
ret = z;
} else {
ret = y;
}
cmp x, 5
mov z, ret
cmovle y, ret
Strategy: predict branches and speculatively
execute
Static Prediction
A forward branch defaults to not taken
A backward branch defaults to taken
Static Hints (Pentium 4 or later)
__emit 0x3E defaults to taken
__emit 0x2E defaults to not taken
don’t use them, flip the...
Dynamic prediction: record history and
predict future
Branch Target Buffer (BTB)
a log of the history of each branch
also stores the address
its finite!
Local
record per conditional branch histories
Global
record shared history of conditional jumps
Loop
specialised predictor when there’s a loop (jumping in a
cycle n times)
Function
specialised buffer for predicted near...
Measurement and Monitoring
Use Performance Event Counters (Model Specific
Registers)
Can be configured to store branch pre...
Approach: Loop Unrolling
for (int x = 0; x < 100; x+=5)
{
foo(x);
foo(x+1);
foo(x+2);
foo(x+3);
foo(x+4);
}
Approach: minimise branches in loops
Approach: Separation of Concerns
Little branching on a latency critical path or thread
Heavily branchy administrative code...
One more thing ...
Division/Modulus
index = (index + 1) % SIZE;
vs
index = index + 1;
if (index == SIZE) {
index = 0;
}
/ or % are 92 Cycles on Core i*
Summary
CPUs are Super-pipelined and Superscalar
Branches cause stalls
Branches can be removed, minimised or simplified
Fr...
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
The Problem Very Fast
Relatively Slow
The Solution: CPU Cache
Core Demands Data, looks at its cache
If present (a "hit") then data returned to register
If absen...
Multilevel Cache: Intel Sandybridge
Shared Level 3 Cache
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physic...
How bad is a miss?
Location Latency in Clockcycles
Register 0
L1 Cache 3
L2 Cache 9
L3 Cache 21
Main Memory 150-400
Cache Lines
Data transferred in cache lines
Fixed size block of memory
Usually 64 bytes in current x86 CPUs
Between 32 and...
Prefetch = Eagerly load data
Adjacent Cache Line Prefetch
Data Cache Unit (Streaming) Prefetch
Problem: CPU Prediction isn...
Sequential Locality
Referring to data that is arranged linearly in memory
Spatial Locality
Referring to data that is close...
General Principles
Use smaller data types (-XX:+UseCompressedOops)
Avoid 'big holes' in your data
Make accesses as linear ...
Primitive Arrays
// Sequential Access = Predictable
for (int i=0; i<someArray.length; i++)
someArray[i]++;
Primitive Arrays - Skipping Elements
// Holes Hurt
for (int i=0; i<someArray.length; i += SKIP)
someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays
Multidimensional Arrays are really Arrays of
Arrays in Java. (Unlike C)
Some people realign their ...
Bad Access Alignment
Strides the wrong way, bad
locality.
array[COLS * row + col]++;
Strides the right way, good
locality....
Full Random Access
L1D - 5 clocks
L2 - 37 clocks
Memory - 280 clocks
Sequential Access
L1D - 5 clocks
L2 - 14 clocks
Memor...
Primitive Collections (GNU Trove, GS-Coll)
Arrays > Linked Lists
Hashtable > Search Tree
Avoid Code bloating (Loop Unrolli...
Data Locality vs Java Heap Layout
0
1
2
class Foo {
Integer count;
Bar bar;
Baz baz;
}
// No alignment guarantees
for (Foo...
Data Locality vs Java Heap Layout
Serious Java Weakness
Location of objects in memory hard to
guarantee.
GC also interfere...
Summary
Cache misses cause stalls, which kill performance
Measurable via Performance Event Counters
Common Techniques for ...
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
Hard Disks
Commonly used persistent storage
Spinning Rust, with a head to read/write
Constant Angular Velocity - rotations...
A simple model
Zone Constant Angular Velocity (ZCAV) /
Zoned Bit Recording (ZBR)
Operation Time =
Time to process the comm...
ZBR implies faster transfer at limits than
centre
Seeking vs Sequential reads
Seek and Rotation times dominate on small values of
data
Random writes of 4kb can be 300 times...
Fragmentation causes unnecessary seeks
Sector alignment is disk-level irrelevant
Summary
Simple, sequential access patterns win
Fragmentation is your enemy
Alignment can be relevant in RAID scenarios
SSD...
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
Software runs on Hardware, play nice
Predictability is a running theme
Huge theoretical speedups
YMMV: measured incremental improvements
>>> going nuts
More information
Articles
http://www.akkadia.org/drepper/cpumemory.pdf
https://gmplib.org/~tege/x86-timing.pdf
http://psy-...
Q & A
@richardwarburto
insightfullogic.com
tinyurl.com/java8lambdas
Performance and predictability
Performance and predictability
Performance and predictability
Performance and predictability
Performance and predictability
Performance and predictability
Performance and predictability
Performance and predictability
Upcoming SlideShare
Loading in …5
×

Performance and predictability

1,165 views

Published on

These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data.

In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,165
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Performance and predictability

  1. 1. PERFORMANCE AND PREDICTABILITY Richard Warburton @richardwarburto insightfullogic.com
  2. 2. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  3. 3. Performance Discussion Product Solutions “Just use our library/tool/framework, and everything is web-scale!” Architecture Advocacy “Always design your software like this.” Methodology & Fundamentals “Here are some principles and knowledge, use your brain“
  4. 4. Do you care? DON'T look at access patterns first Many problems not Pattern Related Networking Database or External Service Minimising I/O Garbage Collection Insufficient Parallelism Harvest the low hanging fruit first
  5. 5. Low Hanging is a cost-benefit analysis
  6. 6. So when does it matter?
  7. 7. Informed Design & Architecture * * this is not a call for premature optimisation
  8. 8. That 10% that actually matters
  9. 9. Performance of Predictable Accesses
  10. 10. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  11. 11. What 4 things do CPUs actually do?
  12. 12. Fetch, Decode, Execute, Writeback
  13. 13. Pipelining speeds this up
  14. 14. Pipelined
  15. 15. What about branches? public static int simple(int x, int y, int z) { int ret; if (x > 5) { ret = y + z; } else { ret = y; } return ret; }
  16. 16. Branches cause stalls, stalls kill performance
  17. 17. Super-pipelined & Superscalar
  18. 18. Can we eliminate branches?
  19. 19. Naive Compilation if (x > 5) { ret = z; } else { ret = y; } cmp x, 5 jmp L1 mov z, ret br L2 L1: mov y, ret L2: ...
  20. 20. Conditional Branches if (x > 5) { ret = z; } else { ret = y; } cmp x, 5 mov z, ret cmovle y, ret
  21. 21. Strategy: predict branches and speculatively execute
  22. 22. Static Prediction A forward branch defaults to not taken A backward branch defaults to taken
  23. 23. Static Hints (Pentium 4 or later) __emit 0x3E defaults to taken __emit 0x2E defaults to not taken don’t use them, flip the branch
  24. 24. Dynamic prediction: record history and predict future
  25. 25. Branch Target Buffer (BTB) a log of the history of each branch also stores the address its finite!
  26. 26. Local record per conditional branch histories Global record shared history of conditional jumps
  27. 27. Loop specialised predictor when there’s a loop (jumping in a cycle n times) Function specialised buffer for predicted nearby function returns Two level Adaptive Predictor accounts for up patterns of up to 3 if statements
  28. 28. Measurement and Monitoring Use Performance Event Counters (Model Specific Registers) Can be configured to store branch prediction information Profilers & Tooling: perf (linux), VTune, AMD Code Analyst, Visual Studio, Oracle Performance Studio
  29. 29. Approach: Loop Unrolling for (int x = 0; x < 100; x+=5) { foo(x); foo(x+1); foo(x+2); foo(x+3); foo(x+4); }
  30. 30. Approach: minimise branches in loops
  31. 31. Approach: Separation of Concerns Little branching on a latency critical path or thread Heavily branchy administrative code on a different path Helps adjust for the inflight branch prediction not tracking many targets Only applicable with low context switching rates Also easier to understand the code
  32. 32. One more thing ...
  33. 33. Division/Modulus index = (index + 1) % SIZE; vs index = index + 1; if (index == SIZE) { index = 0; }
  34. 34. / or % are 92 Cycles on Core i*
  35. 35. Summary CPUs are Super-pipelined and Superscalar Branches cause stalls Branches can be removed, minimised or simplified Frequently get easier to read code as well
  36. 36. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  37. 37. The Problem Very Fast Relatively Slow
  38. 38. The Solution: CPU Cache Core Demands Data, looks at its cache If present (a "hit") then data returned to register If absent (a "miss") then data looked up from memory and stored in the cache Fast memory is expensive, a small amount is affordable
  39. 39. Multilevel Cache: Intel Sandybridge Shared Level 3 Cache Level 2 Cache Level 1 Data Cache Level 1 Instruction Cache Physical Core 0 HT: 2 Logical Cores .... Level 2 Cache Level 1 Data Cache Level 1 Instruction Cache Physical Core N HT: 2 Logical Cores
  40. 40. How bad is a miss? Location Latency in Clockcycles Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 150-400
  41. 41. Cache Lines Data transferred in cache lines Fixed size block of memory Usually 64 bytes in current x86 CPUs Between 32 and 256 bytes Purely hardware consideration
  42. 42. Prefetch = Eagerly load data Adjacent Cache Line Prefetch Data Cache Unit (Streaming) Prefetch Problem: CPU Prediction isn't perfect Solution: Arrange Data so accesses are predictable Prefetching
  43. 43. Sequential Locality Referring to data that is arranged linearly in memory Spatial Locality Referring to data that is close together in memory Temporal Locality Repeatedly referring to same data in a short time span
  44. 44. General Principles Use smaller data types (-XX:+UseCompressedOops) Avoid 'big holes' in your data Make accesses as linear as possible
  45. 45. Primitive Arrays // Sequential Access = Predictable for (int i=0; i<someArray.length; i++) someArray[i]++;
  46. 46. Primitive Arrays - Skipping Elements // Holes Hurt for (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  47. 47. Primitive Arrays - Skipping Elements
  48. 48. Multidimensional Arrays Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C) Some people realign their accesses: for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; } }
  49. 49. Bad Access Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  50. 50. Full Random Access L1D - 5 clocks L2 - 37 clocks Memory - 280 clocks Sequential Access L1D - 5 clocks L2 - 14 clocks Memory - 28 clocks
  51. 51. Primitive Collections (GNU Trove, GS-Coll) Arrays > Linked Lists Hashtable > Search Tree Avoid Code bloating (Loop Unrolling) Custom Data Structures Judy Arrays an associative array/map kD-Trees generalised Binary Space Partitioning Z-Order Curve multidimensional data in one dimension Data Layout Principles
  52. 52. Data Locality vs Java Heap Layout 0 1 2 class Foo { Integer count; Bar bar; Baz baz; } // No alignment guarantees for (Foo foo : foos) { foo.count = 5; foo.bar.visit(); } 3 ... Foo bar baz count
  53. 53. Data Locality vs Java Heap Layout Serious Java Weakness Location of objects in memory hard to guarantee. GC also interferes Copying Compaction
  54. 54. Summary Cache misses cause stalls, which kill performance Measurable via Performance Event Counters Common Techniques for optimizing code
  55. 55. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  56. 56. Hard Disks Commonly used persistent storage Spinning Rust, with a head to read/write Constant Angular Velocity - rotations per minute stays constant
  57. 57. A simple model Zone Constant Angular Velocity (ZCAV) / Zoned Bit Recording (ZBR) Operation Time = Time to process the command Time to seek Rotational speed latency Sequential Transfer TIme
  58. 58. ZBR implies faster transfer at limits than centre
  59. 59. Seeking vs Sequential reads Seek and Rotation times dominate on small values of data Random writes of 4kb can be 300 times slower than theoretical max data transfer Consider the impact of context switching between applications or threads
  60. 60. Fragmentation causes unnecessary seeks
  61. 61. Sector alignment is disk-level irrelevant
  62. 62. Summary Simple, sequential access patterns win Fragmentation is your enemy Alignment can be relevant in RAID scenarios SSDs are a different game
  63. 63. Why care about low level rubbish? Branch Prediction Memory Access Storage Conclusions
  64. 64. Software runs on Hardware, play nice
  65. 65. Predictability is a running theme
  66. 66. Huge theoretical speedups
  67. 67. YMMV: measured incremental improvements >>> going nuts
  68. 68. More information Articles http://www.akkadia.org/drepper/cpumemory.pdf https://gmplib.org/~tege/x86-timing.pdf http://psy-lob-saw.blogspot.co.uk/ http://www.intel.com/content/www/us/en/architecture-and-technology/64- ia-32-architectures-optimization-manual.html http://mechanical-sympathy.blogspot.co.uk http://www.agner.org/optimize/microarchitecture.pdf Mailing Lists: https://groups.google.com/forum/#!forum/mechanical-sympathy https://groups.google.com/a/jclarity.com/forum/#!forum/friends http://gee.cs.oswego.edu/dl/concurrency-interest/
  69. 69. Q & A @richardwarburto insightfullogic.com tinyurl.com/java8lambdas

×