• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
228
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sony Computer Entertainment Europe Research & Development DivisionPitfalls of Object Oriented Programming Tony Albrecht – Technical Consultant Developer Services
  • 2. What I will be covering• A quick look at Object Oriented (OO) programming• A common example• Optimisation of that example• Summary Slide 2
  • 3. Object Oriented (OO) Programming• What is OO programming? – a programming paradigm that uses "objects" – data structures consisting of datafields and methods together with their interactions – to design applications and computer programs. (Wikipedia)• Includes features such as – Data abstraction – Encapsulation – Polymorphism – Inheritance Slide 3
  • 4. What’s OOP for?• OO programming allows you to think about problems in terms of objects and their interactions.• Each object is (ideally) self contained – Contains its own code and data. – Defines an interface to its code and data.• Each object can be perceived as a „black box‟. Slide 4
  • 5. Objects• If objects are self contained then they can be – Reused. – Maintained without side effects. – Used without understanding internal implementation/representation.• This is good, yes? Slide 5
  • 6. Are Objects Good?• Well, yes• And no.• First some history. Slide 6
  • 7. A Brief History of C++ C++ development started1979 2009 Slide 7
  • 8. A Brief History of C++ Named “C++”1979 1983 2009 Slide 8
  • 9. A Brief History of C++ First Commercial release1979 1985 2009 Slide 9
  • 10. A Brief History of C++ Release of v2.01979 1989 2009 Slide 10
  • 11. A Brief History of C++ Added • multiple inheritance, • abstract classes, Release of v2.0 • static member functions, • const member functions • protected members.1979 1989 2009 Slide 11
  • 12. A Brief History of C++ Standardised1979 1998 2009 Slide 12
  • 13. A Brief History of C++ Updated1979 2003 2009 Slide 13
  • 14. A Brief History of C++ C++0x1979 2009 ? Slide 14
  • 15. So what has changed since 1979?• Many more features have been added to C++• CPUs have become much faster.• Transition to multiple cores• Memory has become faster. http://www.vintagecomputing.com Slide 15
  • 16. CPU performance Slide 16 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
  • 17. CPU/Memory performance Slide 17 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
  • 18. What has changed since 1979?• One of the biggest changes is that memory access speeds are far slower (relatively) – 1980: RAM latency ~ 1 cycle – 2009: RAM latency ~ 400+ cycles• What can you do in 400 cycles? Slide 18
  • 19. What has this to do with OO?• OO classes encapsulate code and data.• So, an instantiated object will generally contain all data associated with it. Slide 19
  • 20. My Claim• With modern HW (particularly consoles), excessive encapsulation is BAD.• Data flow should be fundamental to your design (Data Oriented Design) Slide 20
  • 21. Consider a simple OO Scene Tree• Base Object class – Contains general data• Node – Container class• Modifier – Updates transforms• Drawable/Cube – Renders objects Slide 21
  • 22. Object• Each object – Maintains bounding sphere for culling – Has transform (local and world) – Dirty flag (optimisation) – Pointer to Parent Slide 22
  • 23. Objects Each square isClass Definition 4 bytes Memory Layout Slide 23
  • 24. Nodes• Each Node is an object, plus – Has a container of other objects – Has a visibility flag. Slide 24
  • 25. NodesClass Definition Memory Layout Slide 25
  • 26. Consider the following code…• Update the world transform and world space bounding sphere for each object. Slide 26
  • 27. Consider the following code…• Leaf nodes (objects) return transformed bounding spheres Slide 27
  • 28. Consider the following code…• Leaf nodes (objects)this What‟s wrong with return transformed bounding code? spheres Slide 28
  • 29. Consider the following code…• Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles. Slide 29
  • 30. Consider the following code…• Leaf nodes (objects) return transformed bounding Calculation12 cycles. bounding sphere spheresthe world takes only of Slide 30
  • 31. Consider the following code…• Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not where it is false) Slide 31
  • 32. Lets illustrate cache usageMain Memory Each cache line is 128 bytes L2 Cache Slide 32
  • 33. Cache usageMain Memory L2 Cache parentTransform is already in the cache (somewhere) Slide 33
  • 34. Cache usage Assume this is a 128byteMain Memory boundary (start of cacheline) L2 Cache Slide 34
  • 35. Cache usageMain Memory L2 Cache Load m_Transform into cache Slide 35
  • 36. Cache usageMain Memory L2 Cache m_WorldTransform is stored via cache (write-back) Slide 36
  • 37. Cache usageMain Memory L2 Cache Next it loads m_Objects Slide 37
  • 38. Cache usageMain Memory L2 Cache Then a pointer is pulled from somewhere else (Memory managed by std::vector) Slide 38
  • 39. Cache usageMain Memory L2 Cache vtbl ptr loaded into Cache Slide 39
  • 40. Cache usageMain Memory L2 Cache Look up virtual function Slide 40
  • 41. Cache usageMain Memory L2 Cache Then branch to that code (load in instructions) Slide 41
  • 42. Cache usage Main Memory L2 CacheNew code checks dirty flag then sets world bounding sphere Slide 42
  • 43. Cache usage Main Memory L2 CacheNode‟s World Bounding Sphere is then Expanded Slide 43
  • 44. Cache usageMain Memory L2 Cache Then the next Object is processed Slide 44
  • 45. Cache usageMain Memory L2 Cache First object costs at least 7 cache misses Slide 45
  • 46. Cache usageMain Memory L2 Cache Subsequent objects cost at least 2 cache misses each Slide 46
  • 47. The Test• 11,111 nodes/objects in a tree 5 levels deep• Every node being transformed• Hierarchical culling of tree• Render method is empty Slide 47
  • 48. Performance This is the time taken just to traverse the tree! Slide 48
  • 49. Why is it so slow? ~22ms Slide 49
  • 50. Look at GetWorldBoundingSphere() Slide 50
  • 51. Samples can be a littlemisleading at the sourcecode levelSlide 51
  • 52. if(!m_Dirty) comparison Slide 52
  • 53. Stalls due to the load 2 instructions earlierSlide 53
  • 54. Similarly with the matrix multiplySlide 54
  • 55. Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms Slide 55
  • 56. Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36msL2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms Slide 56
  • 57. • From Tuner, ~ 3 L2 cache misses per object – These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) – Code/cache miss/code/cache miss/code… Slide 57
  • 58. Slow memory is the problem here• How can we fix it?• And still keep the same functionality and interface? Slide 58
  • 59. The first step• Use homogenous, sequential sets of data Slide 59
  • 60. Homogeneous Sequential Data Slide 60
  • 61. Generating Contiguous Data• Use custom allocators – Minimal impact on existing code• Allocate contiguous – Nodes – Matrices – Bounding spheres Slide 61
  • 62. Performance 19.6ms -> 12.9ms 35% faster just by moving things around in memory! Slide 62
  • 63. What next?• Process data in order• Use implicit structure for hierarchy – Minimise to and fro from nodes.• Group logic to optimally use what is already in cache.• Remove regularly called virtuals. Slide 63
  • 64. We start with Hierarchya parent Node Node Slide 64
  • 65. Hierarchy NodeNode Which has children Node nodes Slide 65
  • 66. Hierarchy NodeNode Node And they have a parent Slide 66
  • 67. Hierarchy Node Node NodeNode Node Node Node Node Node Node Node And they have children Slide 67
  • 68. Hierarchy Node Node NodeNode Node Node Node Node Node Node Node And they all have parents Slide 68
  • 69. Hierarchy Node Node NodeNode Node Node Node Node Node Node Node NodeNode Node Node Node Node A lot of this information can be inferred Slide 69
  • 70. HierarchyNode Use a set of arrays, one per hierarchyNode Node levelNode Node Node Node Node Node Node Node Slide 70
  • 71. HierarchyNode Parent has 2 :2 childrenNode Node Children have 4 :4 children :4Node Node Node Node Node Node Node Node Slide 71
  • 72. HierarchyNode Ensure nodes and their data are contiguous in :2 memoryNode Node :4 :4Node Node Node Node Node Node Node Node Slide 72
  • 73. • Make the processing global rather than local – Pull the updates out of the objects. • No more virtuals – Easier to understand too – all code in one place. Slide 73
  • 74. Need to change some things…• OO version – Update transform top down and expand WBS bottom up Slide 74
  • 75. Update transform Node Node NodeNode Node Node Node Node Node Node Node Slide 75
  • 76. Update transform Node Node NodeNode Node Node Node Node Node Node Node Slide 76
  • 77. Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 77
  • 78. Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 78
  • 79. Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 79
  • 80. Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 80
  • 81. Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 81
  • 82. Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 82
  • 83. • Hierarchical bounding spheres pass info up• Transforms cascade down• Data use and code is „striped‟. – Processing is alternating Slide 83
  • 84. Conversion to linear• To do this with a „flat‟ hierarchy, break it into 2 passes – Update the transforms and bounding spheres(from top down) – Expand bounding spheres (bottom up) Slide 84
  • 85. Transform and BS updatesNode For each node at each level (top down) :2 { multiply world transform by parent‟sNode Node transform wbs by world transform :4 } :4Node Node Node Node Node Node Node Node Slide 85
  • 86. Update bounding sphere hierarchiesNode For each node at each level (bottom up) :2 { add wbs to parent‟sNode Node } cull wbs against frustum :4 } :4Node Node Node Node Node Node Node Node Slide 86
  • 87. Update Transform and Bounding Sphere How many children nodes to process Slide 87
  • 88. Update Transform and Bounding Sphere For each child, update transform and bounding sphere Slide 88
  • 89. Update Transform and Bounding Sphere Note the contiguous arrays Slide 89
  • 90. So, what’s happening in the cache? Parent Unified L2 Cache Children node Children‟s data not needed Parent Data Childrens‟ Data Slide 90
  • 91. Load parent and its transformParent Unified L2 Cache Parent Data Childrens‟ Data Slide 91
  • 92. Load child transform and set world transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 92
  • 93. Load child BS and set WBSParent Unified L2 Cache Parent Data Childrens‟ Data Slide 93
  • 94. Load child BS and set WBSParent Next child is calculated with no extra cache misses ! Unified L2 Cache Parent Data Childrens‟ Data Slide 94
  • 95. Load child BS and set WBSParent next 2 children incur 2 The cache misses in total Unified L2 Cache Parent Data Childrens‟ Data Slide 95
  • 96. PrefetchingParent Because all data is linear, we Unified L2 Cache can predict what memory will be needed in ~400 cycles and prefetch Parent Data Childrens‟ Data Slide 96
  • 97. • Tuner scans show about 1.7 cache misses per node.• But, these misses are much more frequent – Code/cache miss/cache miss/code – Less stalling Slide 97
  • 98. Performance 19.6 -> 12.9 -> 4.8ms Slide 98
  • 99. Prefetching• Data accesses are now predictable• Can use prefetch (dcbt) to warm the cache – Data streams can be tricky – Many reasons for stream termination – Easier to just use dcbt blindly • (look ahead x number of iterations) Slide 99
  • 100. Prefetching example• Prefetch a predetermined number of iterations ahead• Ignore incorrect prefetches Slide 100
  • 101. Performance 19.6 -> 12.9 -> 4.8 -> 3.3ms Slide 101
  • 102. A Warning on Prefetching• This example makes very heavy use of the cache• This can affect other threads‟ use of the cache – Multiple threads with heavy cache use may thrash the cache Slide 102
  • 103. The old scan ~22ms Slide 103
  • 104. The new scan ~16.6ms Slide 104
  • 105. Up close ~16.6ms Slide 105
  • 106. Looking at the code (samples) Slide 106
  • 107. Performance countersBranch mispredictions: 2,867 (cf. 47,000)L2 cache misses: 16,064 (cf 36,000) Slide 107
  • 108. In Summary• Just reorganising data locations was a win• Data + code reorganisation= dramatic improvement.• + prefetching equals even more WIN. Slide 108
  • 109. OO is not necessarily EVIL• Be careful not to design yourself into a corner• Consider data in your design – Can you decouple data from objects? – …code from objects?• Be aware of what the compiler and HW are doing Slide 109
  • 110. Its all about the memory• Optimise for data first, then code. – Memory access is probably going to be your biggest bottleneck• Simplify systems – KISS – Easier to optimise, easier to parallelise Slide 110
  • 111. Homogeneity• Keep code and data homogenous – Avoid introducing variations – Don‟t test for exceptions – sort by them.• Not everything needs to be an object – If you must have a pattern, then consider using Managers Slide 111
  • 112. Remember• You are writing a GAME – You have control over the input data – Don‟t be afraid to preformat it – drastically if need be.• Design for specifics, not generics (generally). Slide 112
  • 113. Data Oriented Design Delivers• Better performance• Better realisation of code optimisations• Often simpler code• More parallelisable code Slide 113
  • 114. The END• Questions? Slide 114