Sony Computer Entertainment Europe     Research & Development DivisionPitfalls of Object Oriented Programming             ...
What I will be covering• A quick look at Object Oriented (OO)  programming• A common example• Optimisation of that example...
Object Oriented (OO) Programming•   What is OO programming?     –   a programming paradigm that uses "objects" – data stru...
What’s OOP for?• OO programming allows you to think about  problems in terms of objects and their  interactions.• Each obj...
Objects• If objects are self contained then they can be   – Reused.   – Maintained without side effects.   – Used without ...
Are Objects Good?• Well, yes• And no.• First some history.                        Slide 6
A Brief History of C++       C++ development started1979                                2009                         Slide 7
A Brief History of C++            Named “C++”1979 1983                            2009                          Slide 8
A Brief History of C++              First Commercial release1979   1985                              2009                 ...
A Brief History of C++                 Release of v2.01979      1989                     2009                    Slide 10
A Brief History of C++                                Added                                • multiple inheritance,        ...
A Brief History of C++                                 Standardised1979                      1998                  2009   ...
A Brief History of C++                                 Updated1979                      2003      2009               Slide...
A Brief History of C++                                   C++0x1979                            2009   ?               Slide...
So what has changed since 1979?• Many more features have  been added to C++• CPUs have become much  faster.• Transition to...
CPU performance     Slide 16   Computer architecture: a quantitative approach                By John L. Hennessy, David A....
CPU/Memory performance         Slide 17   Computer architecture: a quantitative approach                    By John L. Hen...
What has changed since 1979?• One of the biggest changes is that memory  access speeds are far slower (relatively)   – 198...
What has this to do with OO?• OO classes encapsulate code and data.• So, an instantiated object will generally  contain al...
My Claim• With modern HW (particularly consoles),  excessive encapsulation is BAD.• Data flow should be fundamental to you...
Consider a simple OO Scene Tree•   Base Object class     – Contains general data•   Node     – Container class•   Modifier...
Object• Each object  –   Maintains bounding sphere for culling  –   Has transform (local and world)  –   Dirty flag (optim...
Objects                   Each square isClass Definition   4 bytes           Memory Layout                          Slide 23
Nodes• Each Node is an object, plus   – Has a container of other objects   – Has a visibility flag.                      S...
NodesClass Definition              Memory Layout                   Slide 25
Consider the following code…• Update the world transform and world  space bounding sphere for each object.                ...
Consider the following code…• Leaf nodes (objects) return transformed  bounding spheres                   Slide 27
Consider the following code…• Leaf nodes (objects)this           What‟s wrong with return transformed  bounding code?     ...
Consider the following code…• Leaf nodesm_Dirty=false thenreturn transformed           If (objects) we get branch  boundin...
Consider the following code…• Leaf nodes (objects) return transformed  bounding Calculation12 cycles. bounding sphere     ...
Consider the following code…• Leaf nodes (objects) return transformed  bounding So using a dirty using oneis actually     ...
Lets illustrate cache usageMain Memory    Each cache line is               128 bytes                  L2 Cache            ...
Cache usageMain Memory                                L2 Cache              parentTransform is already              in the...
Cache usage               Assume this is a 128byteMain Memory   boundary (start of cacheline)   L2 Cache                  ...
Cache usageMain Memory                                 L2 Cache              Load m_Transform into cache                  ...
Cache usageMain Memory                                    L2 Cache              m_WorldTransform is stored via            ...
Cache usageMain Memory                 L2 Cache                            Next it loads m_Objects                 Slide 37
Cache usageMain Memory                                 L2 Cache Then a pointer is pulled from  somewhere else (Memory   ma...
Cache usageMain Memory                                  L2 Cache  vtbl ptr loaded into Cache                              ...
Cache usageMain Memory                                 L2 Cache   Look up virtual function                                ...
Cache usageMain Memory                               L2 Cache    Then branch to that code      (load in instructions)     ...
Cache usage Main Memory                                 L2 CacheNew code checks dirty flag then sets world bounding sphere...
Cache usage  Main Memory                                L2 CacheNode‟s World Bounding Sphere      is then Expanded        ...
Cache usageMain Memory                           L2 Cache                Then the next Object is                      proc...
Cache usageMain Memory                              L2 Cache               First object costs at least 7                  ...
Cache usageMain Memory                                L2 Cache              Subsequent objects cost at least              ...
The Test• 11,111 nodes/objects in a  tree 5 levels deep• Every node being  transformed• Hierarchical culling of tree• Rend...
Performance              This is the time              taken just to              traverse the tree!   Slide 48
Why is it so slow?                 ~22ms      Slide 49
Look at GetWorldBoundingSphere()              Slide 50
Samples can be a littlemisleading at the sourcecode levelSlide 51
if(!m_Dirty) comparison                          Slide 52
Stalls due to the load 2           instructions earlierSlide 53
Similarly with the matrix           multiplySlide 54
Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms                                Slide 55
Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36msL2 Cache misses: 36,345 @ 400 cycles each  ...
• From Tuner, ~ 3 L2 cache misses per  object  – These cache misses are mostly sequential    (more than 1 fetch from main ...
Slow memory is the problem here• How can we fix it?• And still keep the same functionality and  interface?                ...
The first step• Use homogenous, sequential sets of data                   Slide 59
Homogeneous Sequential Data           Slide 60
Generating Contiguous Data• Use custom allocators   – Minimal impact on existing code• Allocate contiguous   – Nodes   – M...
Performance              19.6ms -> 12.9ms              35% faster just by              moving things around in            ...
What next?• Process data in order• Use implicit structure for hierarchy   – Minimise to and fro from nodes.• Group logic t...
We start with                Hierarchya parent Node                  Node                  Slide 64
Hierarchy            NodeNode   Which has children   Node       nodes            Slide 65
Hierarchy            NodeNode                     Node       And they have a       parent            Slide 66
Hierarchy                            Node          Node                                  NodeNode   Node      Node   Node ...
Hierarchy                             Node          Node                                  NodeNode   Node      Node   Node...
Hierarchy                             Node            Node                               NodeNode   Node Node Node        ...
HierarchyNode                            Use a set of arrays,                             one per hierarchyNode   Node    ...
HierarchyNode                       Parent has 2  :2                         childrenNode   Node                          ...
HierarchyNode                         Ensure nodes and their                             data are contiguous in  :2       ...
• Make the processing global rather than  local   – Pull the updates out of the objects.      • No more virtuals   – Easie...
Need to change some things…• OO version  – Update transform top down and expand WBS    bottom up                   Slide 74
Update       transform               Node              Node                                NodeNode      Node       Node  ...
Update       transform               Node              Node                                NodeNode      Node       Node  ...
Node              Node                                   NodeNode     Node        Node      Node         Node   Node     N...
Node              Node                                 NodeNode     Node        Node    Node         Node   Node     Node ...
Node              Node                                   NodeNode     Node        Node      Node         Node   Node     N...
Node              Node                                 NodeNode     Node        Node    Node         Node   Node     Node ...
Node              Node                                   NodeNode     Node        Node      Node         Node   Node     N...
Node              Node                                 NodeNode     Node        Node    Node         Node   Node     Node ...
• Hierarchical bounding spheres pass info up• Transforms cascade down• Data use and code is „striped‟.   – Processing is a...
Conversion to linear• To do this with a „flat‟ hierarchy, break it  into 2 passes   – Update the transforms and bounding  ...
Transform and BS updatesNode                            For each node at each level (top down)  :2                        ...
Update bounding sphere hierarchiesNode                            For each node at each level (bottom up)  :2             ...
Update Transform and Bounding Sphere          How many children nodes          to process                   Slide 87
Update Transform and Bounding Sphere                           For each child, update                           transform ...
Update Transform and Bounding Sphere                           Note the contiguous arrays                Slide 89
So, what’s happening in the cache? Parent                                   Unified L2 Cache   Children node    Children‟s...
Load parent and its transformParent                                  Unified L2 Cache         Parent Data    Childrens‟ Da...
Load child transform and set world transform      Parent                                        Unified L2 Cache          ...
Load child BS and set WBSParent                                  Unified L2 Cache         Parent Data    Childrens‟ Data  ...
Load child BS and set WBSParent    Next child is calculated with    no extra cache misses !         Unified L2 Cache      ...
Load child BS and set WBSParent next 2 children incur 2    The    cache misses in total        Unified L2 Cache      Paren...
PrefetchingParent     Because all data is linear, we                                      Unified L2 Cache     can predict...
• Tuner scans show about 1.7 cache misses  per node.• But, these misses are much more frequent  – Code/cache miss/cache mi...
Performance              19.6 -> 12.9 -> 4.8ms   Slide 98
Prefetching• Data accesses are now predictable• Can use prefetch (dcbt) to warm the cache   – Data streams can be tricky  ...
Prefetching example• Prefetch a predetermined number of  iterations ahead• Ignore incorrect prefetches                   S...
Performance               19.6 -> 12.9 -> 4.8 -> 3.3ms   Slide 101
A Warning on Prefetching• This example makes very heavy use of the  cache• This can affect other threads‟ use of the  cach...
The old scan               ~22ms   Slide 103
The new scan          ~16.6ms    Slide 104
Up close        ~16.6ms  Slide 105
Looking at the code (samples)            Slide 106
Performance countersBranch mispredictions: 2,867 (cf. 47,000)L2 cache misses: 16,064      (cf 36,000)                     ...
In Summary• Just reorganising data locations was a win• Data + code reorganisation= dramatic  improvement.• + prefetching ...
OO is not necessarily EVIL• Be careful not to design yourself into a corner• Consider data in your design   – Can you deco...
Its all about the memory• Optimise for data first, then code.   – Memory access is probably going to be your     biggest b...
Homogeneity• Keep code and data homogenous   – Avoid introducing variations   – Don‟t test for exceptions – sort by them.•...
Remember• You are writing a GAME   – You have control over the input data   – Don‟t be afraid to preformat it – drasticall...
Data Oriented Design Delivers•   Better performance•   Better realisation of code optimisations•   Often simpler code•   M...
The END• Questions?                Slide 114
Upcoming SlideShare
Loading in …5
×

Pitfalls of object_oriented_programming_gcap_09

971 views
925 views

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
971
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Pitfalls of object_oriented_programming_gcap_09

  1. 1. Sony Computer Entertainment Europe Research & Development DivisionPitfalls of Object Oriented Programming Tony Albrecht – Technical Consultant Developer Services
  2. 2. What I will be covering• A quick look at Object Oriented (OO) programming• A common example• Optimisation of that example• Summary Slide 2
  3. 3. Object Oriented (OO) Programming• What is OO programming? – a programming paradigm that uses "objects" – data structures consisting of datafields and methods together with their interactions – to design applications and computer programs. (Wikipedia)• Includes features such as – Data abstraction – Encapsulation – Polymorphism – Inheritance Slide 3
  4. 4. What’s OOP for?• OO programming allows you to think about problems in terms of objects and their interactions.• Each object is (ideally) self contained – Contains its own code and data. – Defines an interface to its code and data.• Each object can be perceived as a „black box‟. Slide 4
  5. 5. Objects• If objects are self contained then they can be – Reused. – Maintained without side effects. – Used without understanding internal implementation/representation.• This is good, yes? Slide 5
  6. 6. Are Objects Good?• Well, yes• And no.• First some history. Slide 6
  7. 7. A Brief History of C++ C++ development started1979 2009 Slide 7
  8. 8. A Brief History of C++ Named “C++”1979 1983 2009 Slide 8
  9. 9. A Brief History of C++ First Commercial release1979 1985 2009 Slide 9
  10. 10. A Brief History of C++ Release of v2.01979 1989 2009 Slide 10
  11. 11. A Brief History of C++ Added • multiple inheritance, • abstract classes, Release of v2.0 • static member functions, • const member functions • protected members.1979 1989 2009 Slide 11
  12. 12. A Brief History of C++ Standardised1979 1998 2009 Slide 12
  13. 13. A Brief History of C++ Updated1979 2003 2009 Slide 13
  14. 14. A Brief History of C++ C++0x1979 2009 ? Slide 14
  15. 15. So what has changed since 1979?• Many more features have been added to C++• CPUs have become much faster.• Transition to multiple cores• Memory has become faster. http://www.vintagecomputing.com Slide 15
  16. 16. CPU performance Slide 16 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
  17. 17. CPU/Memory performance Slide 17 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
  18. 18. What has changed since 1979?• One of the biggest changes is that memory access speeds are far slower (relatively) – 1980: RAM latency ~ 1 cycle – 2009: RAM latency ~ 400+ cycles• What can you do in 400 cycles? Slide 18
  19. 19. What has this to do with OO?• OO classes encapsulate code and data.• So, an instantiated object will generally contain all data associated with it. Slide 19
  20. 20. My Claim• With modern HW (particularly consoles), excessive encapsulation is BAD.• Data flow should be fundamental to your design (Data Oriented Design) Slide 20
  21. 21. Consider a simple OO Scene Tree• Base Object class – Contains general data• Node – Container class• Modifier – Updates transforms• Drawable/Cube – Renders objects Slide 21
  22. 22. Object• Each object – Maintains bounding sphere for culling – Has transform (local and world) – Dirty flag (optimisation) – Pointer to Parent Slide 22
  23. 23. Objects Each square isClass Definition 4 bytes Memory Layout Slide 23
  24. 24. Nodes• Each Node is an object, plus – Has a container of other objects – Has a visibility flag. Slide 24
  25. 25. NodesClass Definition Memory Layout Slide 25
  26. 26. Consider the following code…• Update the world transform and world space bounding sphere for each object. Slide 26
  27. 27. Consider the following code…• Leaf nodes (objects) return transformed bounding spheres Slide 27
  28. 28. Consider the following code…• Leaf nodes (objects)this What‟s wrong with return transformed bounding code? spheres Slide 28
  29. 29. Consider the following code…• Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles. Slide 29
  30. 30. Consider the following code…• Leaf nodes (objects) return transformed bounding Calculation12 cycles. bounding sphere spheresthe world takes only of Slide 30
  31. 31. Consider the following code…• Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not where it is false) Slide 31
  32. 32. Lets illustrate cache usageMain Memory Each cache line is 128 bytes L2 Cache Slide 32
  33. 33. Cache usageMain Memory L2 Cache parentTransform is already in the cache (somewhere) Slide 33
  34. 34. Cache usage Assume this is a 128byteMain Memory boundary (start of cacheline) L2 Cache Slide 34
  35. 35. Cache usageMain Memory L2 Cache Load m_Transform into cache Slide 35
  36. 36. Cache usageMain Memory L2 Cache m_WorldTransform is stored via cache (write-back) Slide 36
  37. 37. Cache usageMain Memory L2 Cache Next it loads m_Objects Slide 37
  38. 38. Cache usageMain Memory L2 Cache Then a pointer is pulled from somewhere else (Memory managed by std::vector) Slide 38
  39. 39. Cache usageMain Memory L2 Cache vtbl ptr loaded into Cache Slide 39
  40. 40. Cache usageMain Memory L2 Cache Look up virtual function Slide 40
  41. 41. Cache usageMain Memory L2 Cache Then branch to that code (load in instructions) Slide 41
  42. 42. Cache usage Main Memory L2 CacheNew code checks dirty flag then sets world bounding sphere Slide 42
  43. 43. Cache usage Main Memory L2 CacheNode‟s World Bounding Sphere is then Expanded Slide 43
  44. 44. Cache usageMain Memory L2 Cache Then the next Object is processed Slide 44
  45. 45. Cache usageMain Memory L2 Cache First object costs at least 7 cache misses Slide 45
  46. 46. Cache usageMain Memory L2 Cache Subsequent objects cost at least 2 cache misses each Slide 46
  47. 47. The Test• 11,111 nodes/objects in a tree 5 levels deep• Every node being transformed• Hierarchical culling of tree• Render method is empty Slide 47
  48. 48. Performance This is the time taken just to traverse the tree! Slide 48
  49. 49. Why is it so slow? ~22ms Slide 49
  50. 50. Look at GetWorldBoundingSphere() Slide 50
  51. 51. Samples can be a littlemisleading at the sourcecode levelSlide 51
  52. 52. if(!m_Dirty) comparison Slide 52
  53. 53. Stalls due to the load 2 instructions earlierSlide 53
  54. 54. Similarly with the matrix multiplySlide 54
  55. 55. Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms Slide 55
  56. 56. Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36msL2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms Slide 56
  57. 57. • From Tuner, ~ 3 L2 cache misses per object – These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) – Code/cache miss/code/cache miss/code… Slide 57
  58. 58. Slow memory is the problem here• How can we fix it?• And still keep the same functionality and interface? Slide 58
  59. 59. The first step• Use homogenous, sequential sets of data Slide 59
  60. 60. Homogeneous Sequential Data Slide 60
  61. 61. Generating Contiguous Data• Use custom allocators – Minimal impact on existing code• Allocate contiguous – Nodes – Matrices – Bounding spheres Slide 61
  62. 62. Performance 19.6ms -> 12.9ms 35% faster just by moving things around in memory! Slide 62
  63. 63. What next?• Process data in order• Use implicit structure for hierarchy – Minimise to and fro from nodes.• Group logic to optimally use what is already in cache.• Remove regularly called virtuals. Slide 63
  64. 64. We start with Hierarchya parent Node Node Slide 64
  65. 65. Hierarchy NodeNode Which has children Node nodes Slide 65
  66. 66. Hierarchy NodeNode Node And they have a parent Slide 66
  67. 67. Hierarchy Node Node NodeNode Node Node Node Node Node Node Node And they have children Slide 67
  68. 68. Hierarchy Node Node NodeNode Node Node Node Node Node Node Node And they all have parents Slide 68
  69. 69. Hierarchy Node Node NodeNode Node Node Node Node Node Node Node NodeNode Node Node Node Node A lot of this information can be inferred Slide 69
  70. 70. HierarchyNode Use a set of arrays, one per hierarchyNode Node levelNode Node Node Node Node Node Node Node Slide 70
  71. 71. HierarchyNode Parent has 2 :2 childrenNode Node Children have 4 :4 children :4Node Node Node Node Node Node Node Node Slide 71
  72. 72. HierarchyNode Ensure nodes and their data are contiguous in :2 memoryNode Node :4 :4Node Node Node Node Node Node Node Node Slide 72
  73. 73. • Make the processing global rather than local – Pull the updates out of the objects. • No more virtuals – Easier to understand too – all code in one place. Slide 73
  74. 74. Need to change some things…• OO version – Update transform top down and expand WBS bottom up Slide 74
  75. 75. Update transform Node Node NodeNode Node Node Node Node Node Node Node Slide 75
  76. 76. Update transform Node Node NodeNode Node Node Node Node Node Node Node Slide 76
  77. 77. Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 77
  78. 78. Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 78
  79. 79. Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 79
  80. 80. Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 80
  81. 81. Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 81
  82. 82. Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 82
  83. 83. • Hierarchical bounding spheres pass info up• Transforms cascade down• Data use and code is „striped‟. – Processing is alternating Slide 83
  84. 84. Conversion to linear• To do this with a „flat‟ hierarchy, break it into 2 passes – Update the transforms and bounding spheres(from top down) – Expand bounding spheres (bottom up) Slide 84
  85. 85. Transform and BS updatesNode For each node at each level (top down) :2 { multiply world transform by parent‟sNode Node transform wbs by world transform :4 } :4Node Node Node Node Node Node Node Node Slide 85
  86. 86. Update bounding sphere hierarchiesNode For each node at each level (bottom up) :2 { add wbs to parent‟sNode Node } cull wbs against frustum :4 } :4Node Node Node Node Node Node Node Node Slide 86
  87. 87. Update Transform and Bounding Sphere How many children nodes to process Slide 87
  88. 88. Update Transform and Bounding Sphere For each child, update transform and bounding sphere Slide 88
  89. 89. Update Transform and Bounding Sphere Note the contiguous arrays Slide 89
  90. 90. So, what’s happening in the cache? Parent Unified L2 Cache Children node Children‟s data not needed Parent Data Childrens‟ Data Slide 90
  91. 91. Load parent and its transformParent Unified L2 Cache Parent Data Childrens‟ Data Slide 91
  92. 92. Load child transform and set world transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 92
  93. 93. Load child BS and set WBSParent Unified L2 Cache Parent Data Childrens‟ Data Slide 93
  94. 94. Load child BS and set WBSParent Next child is calculated with no extra cache misses ! Unified L2 Cache Parent Data Childrens‟ Data Slide 94
  95. 95. Load child BS and set WBSParent next 2 children incur 2 The cache misses in total Unified L2 Cache Parent Data Childrens‟ Data Slide 95
  96. 96. PrefetchingParent Because all data is linear, we Unified L2 Cache can predict what memory will be needed in ~400 cycles and prefetch Parent Data Childrens‟ Data Slide 96
  97. 97. • Tuner scans show about 1.7 cache misses per node.• But, these misses are much more frequent – Code/cache miss/cache miss/code – Less stalling Slide 97
  98. 98. Performance 19.6 -> 12.9 -> 4.8ms Slide 98
  99. 99. Prefetching• Data accesses are now predictable• Can use prefetch (dcbt) to warm the cache – Data streams can be tricky – Many reasons for stream termination – Easier to just use dcbt blindly • (look ahead x number of iterations) Slide 99
  100. 100. Prefetching example• Prefetch a predetermined number of iterations ahead• Ignore incorrect prefetches Slide 100
  101. 101. Performance 19.6 -> 12.9 -> 4.8 -> 3.3ms Slide 101
  102. 102. A Warning on Prefetching• This example makes very heavy use of the cache• This can affect other threads‟ use of the cache – Multiple threads with heavy cache use may thrash the cache Slide 102
  103. 103. The old scan ~22ms Slide 103
  104. 104. The new scan ~16.6ms Slide 104
  105. 105. Up close ~16.6ms Slide 105
  106. 106. Looking at the code (samples) Slide 106
  107. 107. Performance countersBranch mispredictions: 2,867 (cf. 47,000)L2 cache misses: 16,064 (cf 36,000) Slide 107
  108. 108. In Summary• Just reorganising data locations was a win• Data + code reorganisation= dramatic improvement.• + prefetching equals even more WIN. Slide 108
  109. 109. OO is not necessarily EVIL• Be careful not to design yourself into a corner• Consider data in your design – Can you decouple data from objects? – …code from objects?• Be aware of what the compiler and HW are doing Slide 109
  110. 110. Its all about the memory• Optimise for data first, then code. – Memory access is probably going to be your biggest bottleneck• Simplify systems – KISS – Easier to optimise, easier to parallelise Slide 110
  111. 111. Homogeneity• Keep code and data homogenous – Avoid introducing variations – Don‟t test for exceptions – sort by them.• Not everything needs to be an object – If you must have a pattern, then consider using Managers Slide 111
  112. 112. Remember• You are writing a GAME – You have control over the input data – Don‟t be afraid to preformat it – drastically if need be.• Design for specifics, not generics (generally). Slide 112
  113. 113. Data Oriented Design Delivers• Better performance• Better realisation of code optimisations• Often simpler code• More parallelisable code Slide 113
  114. 114. The END• Questions? Slide 114

×