• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Pitfalls of object_oriented_programming_gcap_09
 

Pitfalls of object_oriented_programming_gcap_09

on

  • 292 views

 

Statistics

Views

Total Views
292
Views on SlideShare
292
Embed Views
0

Actions

Likes
1
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Pitfalls of object_oriented_programming_gcap_09 Pitfalls of object_oriented_programming_gcap_09 Presentation Transcript

    • Sony Computer Entertainment Europe Research & Development DivisionPitfalls of Object Oriented Programming Tony Albrecht – Technical Consultant Developer Services
    • What I will be covering• A quick look at Object Oriented (OO) programming• A common example• Optimisation of that example• Summary Slide 2
    • Object Oriented (OO) Programming• What is OO programming? – a programming paradigm that uses "objects" – data structures consisting of datafields and methods together with their interactions – to design applications and computer programs. (Wikipedia)• Includes features such as – Data abstraction – Encapsulation – Polymorphism – Inheritance Slide 3
    • What’s OOP for?• OO programming allows you to think about problems in terms of objects and their interactions.• Each object is (ideally) self contained – Contains its own code and data. – Defines an interface to its code and data.• Each object can be perceived as a „black box‟. Slide 4
    • Objects• If objects are self contained then they can be – Reused. – Maintained without side effects. – Used without understanding internal implementation/representation.• This is good, yes? Slide 5
    • Are Objects Good?• Well, yes• And no.• First some history. Slide 6
    • A Brief History of C++ C++ development started1979 2009 Slide 7
    • A Brief History of C++ Named “C++”1979 1983 2009 Slide 8
    • A Brief History of C++ First Commercial release1979 1985 2009 Slide 9
    • A Brief History of C++ Release of v2.01979 1989 2009 Slide 10
    • A Brief History of C++ Added • multiple inheritance, • abstract classes, Release of v2.0 • static member functions, • const member functions • protected members.1979 1989 2009 Slide 11
    • A Brief History of C++ Standardised1979 1998 2009 Slide 12
    • A Brief History of C++ Updated1979 2003 2009 Slide 13
    • A Brief History of C++ C++0x1979 2009 ? Slide 14
    • So what has changed since 1979?• Many more features have been added to C++• CPUs have become much faster.• Transition to multiple cores• Memory has become faster. http://www.vintagecomputing.com Slide 15
    • CPU performance Slide 16 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
    • CPU/Memory performance Slide 17 Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
    • What has changed since 1979?• One of the biggest changes is that memory access speeds are far slower (relatively) – 1980: RAM latency ~ 1 cycle – 2009: RAM latency ~ 400+ cycles• What can you do in 400 cycles? Slide 18
    • What has this to do with OO?• OO classes encapsulate code and data.• So, an instantiated object will generally contain all data associated with it. Slide 19
    • My Claim• With modern HW (particularly consoles), excessive encapsulation is BAD.• Data flow should be fundamental to your design (Data Oriented Design) Slide 20
    • Consider a simple OO Scene Tree• Base Object class – Contains general data• Node – Container class• Modifier – Updates transforms• Drawable/Cube – Renders objects Slide 21
    • Object• Each object – Maintains bounding sphere for culling – Has transform (local and world) – Dirty flag (optimisation) – Pointer to Parent Slide 22
    • Objects Each square isClass Definition 4 bytes Memory Layout Slide 23
    • Nodes• Each Node is an object, plus – Has a container of other objects – Has a visibility flag. Slide 24
    • NodesClass Definition Memory Layout Slide 25
    • Consider the following code…• Update the world transform and world space bounding sphere for each object. Slide 26
    • Consider the following code…• Leaf nodes (objects) return transformed bounding spheres Slide 27
    • Consider the following code…• Leaf nodes (objects)this What‟s wrong with return transformed bounding code? spheres Slide 28
    • Consider the following code…• Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles. Slide 29
    • Consider the following code…• Leaf nodes (objects) return transformed bounding Calculation12 cycles. bounding sphere spheresthe world takes only of Slide 30
    • Consider the following code…• Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not where it is false) Slide 31
    • Lets illustrate cache usageMain Memory Each cache line is 128 bytes L2 Cache Slide 32
    • Cache usageMain Memory L2 Cache parentTransform is already in the cache (somewhere) Slide 33
    • Cache usage Assume this is a 128byteMain Memory boundary (start of cacheline) L2 Cache Slide 34
    • Cache usageMain Memory L2 Cache Load m_Transform into cache Slide 35
    • Cache usageMain Memory L2 Cache m_WorldTransform is stored via cache (write-back) Slide 36
    • Cache usageMain Memory L2 Cache Next it loads m_Objects Slide 37
    • Cache usageMain Memory L2 Cache Then a pointer is pulled from somewhere else (Memory managed by std::vector) Slide 38
    • Cache usageMain Memory L2 Cache vtbl ptr loaded into Cache Slide 39
    • Cache usageMain Memory L2 Cache Look up virtual function Slide 40
    • Cache usageMain Memory L2 Cache Then branch to that code (load in instructions) Slide 41
    • Cache usage Main Memory L2 CacheNew code checks dirty flag then sets world bounding sphere Slide 42
    • Cache usage Main Memory L2 CacheNode‟s World Bounding Sphere is then Expanded Slide 43
    • Cache usageMain Memory L2 Cache Then the next Object is processed Slide 44
    • Cache usageMain Memory L2 Cache First object costs at least 7 cache misses Slide 45
    • Cache usageMain Memory L2 Cache Subsequent objects cost at least 2 cache misses each Slide 46
    • The Test• 11,111 nodes/objects in a tree 5 levels deep• Every node being transformed• Hierarchical culling of tree• Render method is empty Slide 47
    • Performance This is the time taken just to traverse the tree! Slide 48
    • Why is it so slow? ~22ms Slide 49
    • Look at GetWorldBoundingSphere() Slide 50
    • Samples can be a littlemisleading at the sourcecode levelSlide 51
    • if(!m_Dirty) comparison Slide 52
    • Stalls due to the load 2 instructions earlierSlide 53
    • Similarly with the matrix multiplySlide 54
    • Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms Slide 55
    • Some rough calculationsBranch Mispredictions: 50,421 @ 23 cycles each ~= 0.36msL2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms Slide 56
    • • From Tuner, ~ 3 L2 cache misses per object – These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) – Code/cache miss/code/cache miss/code… Slide 57
    • Slow memory is the problem here• How can we fix it?• And still keep the same functionality and interface? Slide 58
    • The first step• Use homogenous, sequential sets of data Slide 59
    • Homogeneous Sequential Data Slide 60
    • Generating Contiguous Data• Use custom allocators – Minimal impact on existing code• Allocate contiguous – Nodes – Matrices – Bounding spheres Slide 61
    • Performance 19.6ms -> 12.9ms 35% faster just by moving things around in memory! Slide 62
    • What next?• Process data in order• Use implicit structure for hierarchy – Minimise to and fro from nodes.• Group logic to optimally use what is already in cache.• Remove regularly called virtuals. Slide 63
    • We start with Hierarchya parent Node Node Slide 64
    • Hierarchy NodeNode Which has children Node nodes Slide 65
    • Hierarchy NodeNode Node And they have a parent Slide 66
    • Hierarchy Node Node NodeNode Node Node Node Node Node Node Node And they have children Slide 67
    • Hierarchy Node Node NodeNode Node Node Node Node Node Node Node And they all have parents Slide 68
    • Hierarchy Node Node NodeNode Node Node Node Node Node Node Node NodeNode Node Node Node Node A lot of this information can be inferred Slide 69
    • HierarchyNode Use a set of arrays, one per hierarchyNode Node levelNode Node Node Node Node Node Node Node Slide 70
    • HierarchyNode Parent has 2 :2 childrenNode Node Children have 4 :4 children :4Node Node Node Node Node Node Node Node Slide 71
    • HierarchyNode Ensure nodes and their data are contiguous in :2 memoryNode Node :4 :4Node Node Node Node Node Node Node Node Slide 72
    • • Make the processing global rather than local – Pull the updates out of the objects. • No more virtuals – Easier to understand too – all code in one place. Slide 73
    • Need to change some things…• OO version – Update transform top down and expand WBS bottom up Slide 74
    • Update transform Node Node NodeNode Node Node Node Node Node Node Node Slide 75
    • Update transform Node Node NodeNode Node Node Node Node Node Node Node Slide 76
    • Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 77
    • Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 78
    • Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 79
    • Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 80
    • Node Node NodeNode Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 81
    • Node Node NodeNode Node Node Node Node Node Node Node Add bounding sphere of child Slide 82
    • • Hierarchical bounding spheres pass info up• Transforms cascade down• Data use and code is „striped‟. – Processing is alternating Slide 83
    • Conversion to linear• To do this with a „flat‟ hierarchy, break it into 2 passes – Update the transforms and bounding spheres(from top down) – Expand bounding spheres (bottom up) Slide 84
    • Transform and BS updatesNode For each node at each level (top down) :2 { multiply world transform by parent‟sNode Node transform wbs by world transform :4 } :4Node Node Node Node Node Node Node Node Slide 85
    • Update bounding sphere hierarchiesNode For each node at each level (bottom up) :2 { add wbs to parent‟sNode Node } cull wbs against frustum :4 } :4Node Node Node Node Node Node Node Node Slide 86
    • Update Transform and Bounding Sphere How many children nodes to process Slide 87
    • Update Transform and Bounding Sphere For each child, update transform and bounding sphere Slide 88
    • Update Transform and Bounding Sphere Note the contiguous arrays Slide 89
    • So, what’s happening in the cache? Parent Unified L2 Cache Children node Children‟s data not needed Parent Data Childrens‟ Data Slide 90
    • Load parent and its transformParent Unified L2 Cache Parent Data Childrens‟ Data Slide 91
    • Load child transform and set world transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 92
    • Load child BS and set WBSParent Unified L2 Cache Parent Data Childrens‟ Data Slide 93
    • Load child BS and set WBSParent Next child is calculated with no extra cache misses ! Unified L2 Cache Parent Data Childrens‟ Data Slide 94
    • Load child BS and set WBSParent next 2 children incur 2 The cache misses in total Unified L2 Cache Parent Data Childrens‟ Data Slide 95
    • PrefetchingParent Because all data is linear, we Unified L2 Cache can predict what memory will be needed in ~400 cycles and prefetch Parent Data Childrens‟ Data Slide 96
    • • Tuner scans show about 1.7 cache misses per node.• But, these misses are much more frequent – Code/cache miss/cache miss/code – Less stalling Slide 97
    • Performance 19.6 -> 12.9 -> 4.8ms Slide 98
    • Prefetching• Data accesses are now predictable• Can use prefetch (dcbt) to warm the cache – Data streams can be tricky – Many reasons for stream termination – Easier to just use dcbt blindly • (look ahead x number of iterations) Slide 99
    • Prefetching example• Prefetch a predetermined number of iterations ahead• Ignore incorrect prefetches Slide 100
    • Performance 19.6 -> 12.9 -> 4.8 -> 3.3ms Slide 101
    • A Warning on Prefetching• This example makes very heavy use of the cache• This can affect other threads‟ use of the cache – Multiple threads with heavy cache use may thrash the cache Slide 102
    • The old scan ~22ms Slide 103
    • The new scan ~16.6ms Slide 104
    • Up close ~16.6ms Slide 105
    • Looking at the code (samples) Slide 106
    • Performance countersBranch mispredictions: 2,867 (cf. 47,000)L2 cache misses: 16,064 (cf 36,000) Slide 107
    • In Summary• Just reorganising data locations was a win• Data + code reorganisation= dramatic improvement.• + prefetching equals even more WIN. Slide 108
    • OO is not necessarily EVIL• Be careful not to design yourself into a corner• Consider data in your design – Can you decouple data from objects? – …code from objects?• Be aware of what the compiler and HW are doing Slide 109
    • Its all about the memory• Optimise for data first, then code. – Memory access is probably going to be your biggest bottleneck• Simplify systems – KISS – Easier to optimise, easier to parallelise Slide 110
    • Homogeneity• Keep code and data homogenous – Avoid introducing variations – Don‟t test for exceptions – sort by them.• Not everything needs to be an object – If you must have a pattern, then consider using Managers Slide 111
    • Remember• You are writing a GAME – You have control over the input data – Don‟t be afraid to preformat it – drastically if need be.• Design for specifics, not generics (generally). Slide 112
    • Data Oriented Design Delivers• Better performance• Better realisation of code optimisations• Often simpler code• More parallelisable code Slide 113
    • The END• Questions? Slide 114