Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced databases ben stopford

4,117 views

Published on

Published in: Technology
  • Be the first to comment

Advanced databases ben stopford

  1. 1. Data Storage for ExtremeUse Cases: The Lay of theLand and a Peek at ODCBen Stopford : RBS
  2. 2. How fast is a HashMap lookup?
  3. 3. That‟s how long it takes light to travel a room
  4. 4. How fast is a database lookup?
  5. 5. That‟s how long it takes light to go to Australia and back
  6. 6. Computers really are very fast!
  7. 7. The problem is we‟re quite good at writing software that slows them down
  8. 8. Question:Is it fair to compare theperformance of a Database witha HashMap?
  9. 9. Of course not…
  10. 10. Mechanical Sympathy Ethernet ping 1MB Disk/Ethernet RDMA over InfinibandCrossContinental ms μs ns psRoundTrip 0.000,000,000,000 Main Memory L1 Cache Ref Ref 1MB Main Memory L2 Cache Ref * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
  11. 11. Key Point #1 Simple computerprograms, operating in asingle address space are extremely fast.
  12. 12. Why are there so manytypes of databasethese days?…because we needdifferent architecturesfor different jobs
  13. 13. Times are changing
  14. 14. Traditional DatabaseArchitecture is Aging
  15. 15. The Traditional Architecture
  16. 16. Traditional Shared Shared In Memory Disk NothingDistributed SimplerIn Memory Contract
  17. 17. Key Point #2 Different architectural decisions about how westore and access data are needed in different environments.Our ‘Context’ has changed
  18. 18. Simplifying the Contract
  19. 19. How big is the internet? 5 exabytes(which is 5,000 petabytes or 5,000,000 terabytes)
  20. 20. How big is an average enterprise database80% < 1TB (in 2009)
  21. 21. The context ofour problem has changed
  22. 22. Simplifying the Contract
  23. 23. Databases have hugeoperational overheads Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al
  24. 24. Avoid that overhead with asimpler contract and avoidingIO
  25. 25. Key Point #3For the very top end data volumes a simpler contract is mandatory. ACID is simply not possible.
  26. 26. Key Point #3 (addendum) But we should alwaysretain ACID properties if our use case allows it.
  27. 27. Options forscaling-out the traditional architecture
  28. 28. #1: The Shared Disk Architecture Shared Disk
  29. 29. #2: The Shared Nothing Architecture
  30. 30. Each machine is responsible for a subsetof the records. Each record exists on only one machine. 1, 2, 3… 97, 98, 99… 765, 769… 169, 170… Client 333, 334… 244, 245…
  31. 31. #3: The In Memory Database (single address-space)
  32. 32. Databases must cache subsets of the data in memory Cache
  33. 33. Not knowing what you don‟t know 90% in Cache Data on Disk
  34. 34. If you can fit it ALL in memoryyou know everything!!
  35. 35. The architecture of an in memory database
  36. 36. Memory is at least 100x faster than disk ms μs ns ps 1MB Disk/Network 1MB Main Memory 0.000,000,000,000Cross Continental Main Memory L1 Cache RefRound Trip Ref Cross Network Round L2 Cache Ref Trip * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
  37. 37. Random vs. Sequential Access
  38. 38. This makes them very fast!!
  39. 39. The proof is in the stats. TPC-HBenchmarks on a 1TB data set
  40. 40. So why haven‟t in-memory databases taken off?
  41. 41. Address-Spaces are relativelysmall and of a finite, fixed size
  42. 42. Durability
  43. 43. One solution is distribution
  44. 44. Distributed In Memory (Shared Nothing)
  45. 45. Again we spread our data but this time only using RAM. 1, 2, 3… 97, 98, 99… 765, 769… 169, 170…Client 333, 334… 244, 245…
  46. 46. Distribution solves our two problems
  47. 47. We get massive amounts of parallel processing
  48. 48. But at the costof loosing thesingle address space
  49. 49. Traditional Shared Shared In Memory Disk NothingDistributed SimplerIn Memory Contract
  50. 50. Key Point #4 There are three key forces: Simplify theDistribution No Disk contract Improve Gain scalability scalability All data is by picking through a held in appropriate distributed RAM ACID architecture properties.
  51. 51. These three non- functional themeslay behind the design of ODC, RBS‟s in- memory data warehouse
  52. 52. ODC
  53. 53. ODC represents a balance betweenthroughput and latency
  54. 54. What is Latency?
  55. 55. What is Throughput
  56. 56. Which is best for latency? Shared Nothing (Distributed) Traditional In-Memory Database Database Latency?
  57. 57. Which is best for throughput? Shared Nothing (Distributed) Traditional In-Memory Database Database Throughput?
  58. 58. So why do we use distributed in-memory? In Plentiful Memory hardware Latency Throughput
  59. 59. ODC – Distributed, Shared Nothing, InMemory, Semi-Normalised, Realtime Graph DB 450 processes 2TB of RAM Messaging (Topic Based) as a system of record (persistence)
  60. 60. The LayersAccess Layer Jav Jav a a clie clie nt API nt APIQuery Layer TransactionData Layer s Mtms CashflowsPersistence Layer
  61. 61. Three Tools of Distributed Data Architecture Indexing Partitioning Replication
  62. 62. How should we use these tools?
  63. 63. Replication puts data everywhere But your storage is limited by the memory on a node
  64. 64. Partitioning scales Associating data in different partitions implies moving it. Scalable storage, bandwidth and processing
  65. 65. So we have some data.Our data is bound together in a model Desk Sub Name Trader Party Trade
  66. 66. Which we save.. Trade r Part y Trad eTrad Trade Part e r y
  67. 67. Binding them back together involves a “distributed join” => Lots of network hops Trade r Part y Trad e Trad Trade Part e r y
  68. 68. The hops have to be spread over time Network Time
  69. 69. Lots of network hops makes it slow
  70. 70. OK – what if we held itall together??“Denormalised”
  71. 71. Hence denormalisation is FAST! (for reads)
  72. 72. Denormalisation implies theduplication of some sub-entities
  73. 73. …and that means managingconsistency over lots of copies
  74. 74. …and all the duplication means you run out of space really quickly
  75. 75. Spaces issues are exaggerated further when data is versioned Trade r Part Version 1 y Trad e Trade r Part Version 2 y Trad e Trade r Part Version 3 y Trad e Trade r Part Version 4 y Trad…and you need eversioning to do MVCC
  76. 76. And reconstituting a previous time slice becomes very difficult. Trad Trade Part e r y Part Trade Trad y r e Part y Trade r Trad e Part y
  77. 77. So we want to hold entities separately(normalised) to alleviate concerns around consistency and space usage
  78. 78. Remember this means the object graph will be split across multiple machines. Data isIndependently Versioned Trade Singleton r Part y Trad e Trad Trade Part e r y
  79. 79. Binding them back together involves a “distributed join” => Lots of network hops Trade r Part y Trad e Trad Trade Part e r y
  80. 80. Whereas the denormalisedmodel the join is already done
  81. 81. So what we want is the advantagesof a normalised store at the speedof a denormalised one!This is what using Snowflake Schemas and the Connected Replication pattern is all about!
  82. 82. Looking more closely: Why does normalisation mean wehave to spread data around thecluster. Why can‟t we hold it all together?
  83. 83. It‟s all about the keys
  84. 84. We can collocate data with common keys but if they crosscut the only way to collocate is to replicate Crosscuttin g Keys Common Keys
  85. 85. We tackle this problem with a hybrid model: Replicated Trader Party Trade Partitioned
  86. 86. We adapt the concept of a Snowflake Schema.
  87. 87. Taking the concept of Facts and Dimensions
  88. 88. Everything starts from a Core Fact (Trades for us)
  89. 89. Facts are Big, dimensions are small
  90. 90. Facts have one key that relates them all (used to partition)
  91. 91. Dimensions have many keys(which crosscut the partitioning key)
  92. 92. Looking at the data: Facts: =>Big, common keys Dimensions =>Small, crosscutting Keys
  93. 93. We remember we are a grid. We should avoid the distributed join.
  94. 94. … so we only want to „join‟ data that is in the same process Use a Key Assignment Trade Policy MTMs (e.g. KeyAssociation s in Coherence) Common Key
  95. 95. So we prescribe differentphysical storage for Facts and Dimensions Replicated Trader Party Trade Partitioned
  96. 96. Facts arepartitioned, dimensions arereplicated Query Layer Trader Party Trade Transactions Data Layer Mtms Cashflows Fact Storage (Partitioned)
  97. 97. Facts arepartitioned, dimensions arereplicated Dimension s (repliacte) Transactions Facts Mtms Cashflows (distribute/ partition) Fact Storage (Partitioned)
  98. 98. The data volumes back this up as a sensible hypothesis Facts: =>Big =>Distribut e Dimensions =>Small => Replicate
  99. 99. Key Point We use a variant on a Snowflake Schema topartition big entities that canbe related via a partitioningkey and replicate small stuffwho’s keys can’t map to our partitioning key.
  100. 100. ReplicateDistribute
  101. 101. So how does they help us to run queries without distributed joins? Select Transaction, MTM, RefrenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
  102. 102. What would this look like without this pattern? Get Get Get Get Get Get Get Cost Ledger Source Transa MTMs Legs CostCenter Books Books c-tions Center s s Network Time
  103. 103. But by balancing Replication andPartitioning we don‟t need all those hops Get Get Get Get Get Get Get Cost Ledger Source Transac MTMs Legs Cost Centers Books Books -tions Centers Network
  104. 104. Stage 1: Focus on the where clause: Where Cost Centre = „CC1‟
  105. 105. Stage 1: Get the right keys to query the Facts Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Join Dimensions in Query Layer Transactions Mtms Cashflows Partitioned
  106. 106. Stage 2: Cluster Join to get Facts Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Join Dimensions in Query Layer Transactions Join Facts Mtms acrossCashflows cluster Partitioned
  107. 107. Stage 2: Join the facts togetherefficiently as we know they are collocated
  108. 108. Stage 3: Augment raw Facts with relevant Dimensions Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’Join Join DimensionsDimensions in Query Layerin QueryLayer Transactions Join FactsMtms across Cashflows cluster Partitioned
  109. 109. Stage 3: Bind relevantdimensions to the result
  110. 110. Bringing it together: JavReplicated a clie Partitioned nt APIDimensions FactsWe never have to do a distributed join!
  111. 111. So all the big stuff is held partitioned And we can join without shipping keys around andhaving intermediate results
  112. 112. We get to do this… Trade r Part y Trad eTrad Trade Part e r y
  113. 113. …and this…Trade r Part Version 1 y Trad e Trade r Part Version 2 y Trad e Trade r Part Version 3 y Trad e Trade r Part Version 4 y Trad e
  114. 114. ..and this..Trad Trade Part e r y Part TradeTrad y r e Part y Trade rTrad e Part y
  115. 115. …without the problems of this…
  116. 116. …or this..
  117. 117. ..all at the speed of this… well almost!
  118. 118. But there is a fly in the ointment…
  119. 119. I lied earlier. These aren‟t all Facts. Facts This is a dimension • It has a different key to the Facts. Dimensions • And it’s BIG
  120. 120. We can‟t replicate really bigstuff… we‟ll run out of space => Big Dimensions are aproblem.
  121. 121. Fortunately there is a simplesolution!
  122. 122. Whilst there are lots of thesebig dimensions, a large majorityare never used. They are not all“connected”.
  123. 123. If there are no Trades for Goldmansin the data store then a Trade Querywill never need the GoldmansCounterparty
  124. 124. Looking at the Dimension data some are quite large
  125. 125. But Connected Dimension Data is tiny by comparison
  126. 126. One recent independent studyfrom the database communityshowed that 80% of dataremains unused
  127. 127. So we only replicate‘Connected’ or ‘Used’ dimensions
  128. 128. As data is written to the data store wekeep our „Connected Caches‟ up to date Processing Layer Dimension Caches (Replicated) Transactions Data Layer As new Facts are added Mtms relevant Dimensions that they reference are moved Cashflows to processing layer caches Fact Storage (Partitioned)
  129. 129. The Replicated Layer is updatedby recursing through the arcson the domain model when factschange
  130. 130. Saving a trade causes all it‟s 1 levelst references to be triggered Query Layer Save Trade (With connected dimension Caches) Data LayerCache Trad (All Normalised)Store e Partitioned Trigger Cache Party Sourc Ccy Alias e Book
  131. 131. This updates the connected caches Query Layer (With connected dimension Caches) Data Layer Trad (All Normalised) eParty Sourc CcyAlias e Book
  132. 132. The process recurses through the object graph Query Layer (With connected dimension Caches) Data Layer Trad (All Normalised) eParty Sourc CcyAlias e Book Party Ledge rBook
  133. 133. ‘Connected Replication’ A simple pattern whichrecurses through the foreign keys in the domain model, ensuring only‘Connected’ dimensions are replicated
  134. 134. With ‘Connected Replication’ only 1/10th of the dataneeds to be replicated (on average).
  135. 135. Limitations of this approach
  136. 136. Conclusion
  137. 137. Conclusion
  138. 138. Conclusion
  139. 139. Conclusion
  140. 140. Conclusion
  141. 141. Conclusion Partitioned Storage
  142. 142. Conclusion
  143. 143. The End

×