Data Storage for ExtremeUse Cases: The Lay of theLand and a Peek at ODCBen Stopford : RBS
How fast is a HashMap lookup?
That‟s how long it takes light to         travel a room
How fast is a database lookup?
That‟s how long it takes light to   go to Australia and back
Computers really are very fast!
The problem is we‟re quite good at writing software that slows them               down
Question:Is it fair to compare theperformance of a Database witha HashMap?
Of course not…
Mechanical Sympathy                      Ethernet ping  1MB Disk/Ethernet                                      RDMA over I...
Key Point #1    Simple computerprograms, operating in asingle address space are     extremely fast.
Why are there so manytypes of databasethese days?…because we needdifferent architecturesfor different jobs
Times are changing
Traditional DatabaseArchitecture is Aging
The Traditional Architecture
Traditional       Shared                 Shared                In Memory        Disk                  NothingDistributed  ...
Key Point #2  Different architectural decisions about how westore and access data are    needed in different      environm...
Simplifying the   Contract
How big is the internet?   5 exabytes(which is 5,000 petabytes or 5,000,000 terabytes)
How big is an average enterprise database80% < 1TB      (in 2009)
The context ofour problem has    changed
Simplifying the Contract
Databases have hugeoperational overheads                    Taken from “OLTP Through the                    Looking Glass,...
Avoid that overhead with asimpler contract and avoidingIO
Key Point #3For the very top end data   volumes a simpler contract is mandatory.   ACID is simply not        possible.
Key Point #3 (addendum) But we should alwaysretain ACID properties if our use case allows it.
Options forscaling-out the  traditional architecture
#1: The Shared Disk    Architecture                Shared                 Disk
#2: The Shared Nothing      Architecture
Each machine is responsible for a subsetof the records. Each record exists on only               one machine.             ...
#3: The In Memory Database   (single address-space)
Databases must cache subsets    of the data in memory            Cache
Not knowing what you don‟t          know      90% in Cache        Data on Disk
If you can fit it ALL in memoryyou know everything!!
The architecture of an in   memory database
Memory is at least 100x faster than               disk                    ms          μs           ns             ps 1MB D...
Random vs. Sequential Access
This makes them very fast!!
The proof is in the stats. TPC-HBenchmarks on a 1TB data set
So why haven‟t in-memory databases taken        off?
Address-Spaces are relativelysmall and of a finite, fixed size
Durability
One solution is distribution
Distributed In Memory (Shared           Nothing)
Again we spread our data but this time          only using RAM.                 1, 2, 3…   97, 98, 99…         765, 769…  ...
Distribution solves our two         problems
We get massive amounts of   parallel processing
But at the costof loosing thesingle address     space
Traditional       Shared                 Shared                In Memory        Disk                  NothingDistributed  ...
Key Point #4   There are three key forces:                 Simplify theDistribution                     No Disk           ...
These three non- functional themeslay behind the design  of ODC, RBS‟s in-     memory data      warehouse
ODC
ODC represents   a balance    betweenthroughput and    latency
What is Latency?
What is Throughput
Which is best for latency?                 Shared                 Nothing               (Distributed) Traditional         ...
Which is best for throughput?                   Shared                   Nothing                 (Distributed)   Tradition...
So why do we use distributed        in-memory?         In        Plentiful       Memory     hardware        Latency   Thro...
ODC – Distributed, Shared Nothing, InMemory, Semi-Normalised, Realtime Graph                   DB      450 processes      ...
The LayersAccess Layer   Jav    Jav                a      a               clie   clie                nt               API ...
Three Tools of Distributed Data         Architecture                   Indexing    Partitioning              Replication
How should we use these tools?
Replication puts data     everywhere      But your storage is limited by      the memory on a node
Partitioning scales     Associating data in     different partitions implies     moving it.     Scalable storage, bandwidt...
So we have some data.Our data is bound together in a             model      Desk                         Sub              ...
Which we save..                   Trade                     r            Part                                   y         ...
Binding them back together involves a “distributed join” => Lots of network                  hops                      Tra...
The hops have to be spread        over time        Network                     Time
Lots of network hops makes it            slow
OK – what if we held itall together??“Denormalised”
Hence denormalisation is FAST!           (for reads)
Denormalisation implies theduplication of some sub-entities
…and that means managingconsistency over lots of copies
…and all the duplication means you run out of space really            quickly
Spaces issues are exaggerated  further when data is versioned  Trade    r            Part        Version 1                ...
And reconstituting a previous  time slice becomes very           difficult.   Trad          Trade          Part    e      ...
So we want to hold   entities separately(normalised) to alleviate    concerns around consistency and space          usage
Remember this means the  object graph will be split across          multiple machines. Data isIndependently Versioned     ...
Binding them back together involves a “distributed join” => Lots of network                  hops                      Tra...
Whereas the denormalisedmodel the join is already done
So what we want is the advantagesof a normalised store at the speedof a denormalised one!This is what using Snowflake Sche...
Looking more closely: Why does normalisation mean wehave to spread data around thecluster. Why can‟t we hold it all       ...
It‟s all about the keys
We can collocate data with common keys but if they crosscut the only way to collocate is to                   replicate   ...
We tackle this problem with a       hybrid model:                                  Replicated      Trader                 ...
We adapt the concept of a  Snowflake Schema.
Taking the concept of Facts and          Dimensions
Everything starts from a Core    Fact (Trades for us)
Facts are   Big, dimensions are                small
Facts have one key that relates  them all (used to partition)
Dimensions have many keys(which crosscut the partitioning key)
Looking at the data:                       Facts:                       =>Big,                       common               ...
We remember we are a grid. We should avoid the distributed            join.
… so we only want to „join‟ data  that is in the same process                        Use a Key                        Assi...
So we prescribe differentphysical storage for Facts and         Dimensions                                   Replicated   ...
Facts arepartitioned, dimensions arereplicated                                          Query Layer  Trader               ...
Facts arepartitioned, dimensions arereplicated                        Dimension                             s             ...
The data volumes back this up  as a sensible hypothesis                        Facts:                        =>Big        ...
Key Point   We use a variant on a   Snowflake Schema topartition big entities that canbe related via a partitioningkey and...
ReplicateDistribute
So how does they help us to run  queries without distributed            joins?  Select Transaction, MTM,  RefrenceData Fro...
What would this look like          without this pattern? Get      Get    Get    Get      Get   Get      Get Cost    Ledger...
But by balancing Replication andPartitioning we don‟t need all those hops     Get       Get      Get     Get    Get   Get ...
Stage 1: Focus on the where          clause:    Where Cost Centre = „CC1‟
Stage 1: Get the right keys to      query the Facts    Select Transaction, MTM, ReferenceData From    MTM, Transaction, Re...
Stage 2: Cluster Join to get           Facts   Select Transaction, MTM, ReferenceData From   MTM, Transaction, Ref Where C...
Stage 2: Join the facts togetherefficiently as we know they are            collocated
Stage 3: Augment raw Facts      with relevant Dimensions             Select Transaction, MTM, ReferenceData From          ...
Stage 3: Bind relevantdimensions to the result
Bringing it together:                  JavReplicated         a                  clie                           Partitioned...
So all the big stuff is  held partitioned And we can join without shipping keys around andhaving intermediate      results
We get to do this…                    Trade                      r            Part                                    y   ...
…and this…Trade  r            Part        Version 1                y        Trad         e       Trade                   r...
..and this..Trad          Trade       Part e              r        y       Part   TradeTrad    y       r e       Part     ...
…without the problems of this…
…or this..
..all at the speed of this… well              almost!
But there is a fly in the      ointment…
I lied earlier. These aren‟t all             Facts.                              Facts     This is a dimension       • It ...
We can‟t replicate really bigstuff… we‟ll run out of space => Big Dimensions are aproblem.
Fortunately there is a simplesolution!
Whilst there are lots of thesebig dimensions, a large majorityare never used. They are not all“connected”.
If there are no Trades for Goldmansin the data store then a Trade Querywill never need the GoldmansCounterparty
Looking at the Dimension data    some are quite large
But Connected Dimension Data    is tiny by comparison
One recent independent studyfrom the database communityshowed that 80% of dataremains unused
So we only replicate‘Connected’ or ‘Used’     dimensions
As data is written to the data store wekeep our „Connected Caches‟ up to date                                             ...
The Replicated Layer is updatedby recursing through the arcson the domain model when factschange
Saving a trade causes all it‟s     1                                   levelst        references to be triggered          ...
This updates the connected caches                         Query Layer                         (With connected             ...
The process recurses through the          object graph                                  Query Layer                       ...
‘Connected Replication’   A simple pattern whichrecurses through the foreign     keys in the domain    model, ensuring onl...
With ‘Connected  Replication’ only  1/10th of the dataneeds to be replicated    (on average).
Limitations of this approach
Conclusion
Conclusion
Conclusion
Conclusion
Conclusion
Conclusion       Partitioned        Storage
Conclusion
The End
Advanced databases   ben stopford
Advanced databases   ben stopford
Advanced databases   ben stopford
Advanced databases   ben stopford
Upcoming SlideShare
Loading in …5
×

Advanced databases ben stopford

3,895 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,895
On SlideShare
0
From Embeds
0
Number of Embeds
2,667
Actions
Shares
0
Downloads
44
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • I started a project back in 2004. It was a trading system back at barcap. When it came to persisting our data there were three choices, Oracle, Sybase or Sql Server. A lot of has changed in that time. Today, we are far more likely to look at one of a variety of technologies to satisfy our need to store and re-retrieve our data. So how many of you use a traditional database?What about a distributed database like Oracle RAC?NoSQL?.. do you use it with a database or stand alone.What about an in memory database? in production?Finally what about distributed in memory?This talk is about an in memory database. It&apos;s not really a distributed cache, despite being implemented in Coherence, although you could call it one if you preferred. In truth it has a variety of elements that make it closer to what you might perceive to be a database. It is normalised: that is to say that it holds entities independently from one another and versions them as such. It has some basic guarantees of the automaticity when writing certain groups of objects that are collocated. Most importantly it is both fast and scalable regardless of the join criteria you impose on it, this being something fairly illusive in the world of distributed data storage. I have a few aims for today:I hope you will leave with a broader view on what stores are available to you and what is coming in the future.I hope you&apos;ll see the benefits that niece storage solutions can provide through simpler contracts between client and data store.I&apos;d like you to understand the benefits of memory over disk.
  • Better example is amazonPartition by user so orders and basket are held togetherProducts will be shared by multiple users
  • Big data sets are held distributed and only joined on the grid to collocated objects.Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)
  • Big data sets are held distributed and only joined on the grid to collocated objects.Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)
  • Advanced databases ben stopford

    1. 1. Data Storage for ExtremeUse Cases: The Lay of theLand and a Peek at ODCBen Stopford : RBS
    2. 2. How fast is a HashMap lookup?
    3. 3. That‟s how long it takes light to travel a room
    4. 4. How fast is a database lookup?
    5. 5. That‟s how long it takes light to go to Australia and back
    6. 6. Computers really are very fast!
    7. 7. The problem is we‟re quite good at writing software that slows them down
    8. 8. Question:Is it fair to compare theperformance of a Database witha HashMap?
    9. 9. Of course not…
    10. 10. Mechanical Sympathy Ethernet ping 1MB Disk/Ethernet RDMA over InfinibandCrossContinental ms μs ns psRoundTrip 0.000,000,000,000 Main Memory L1 Cache Ref Ref 1MB Main Memory L2 Cache Ref * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
    11. 11. Key Point #1 Simple computerprograms, operating in asingle address space are extremely fast.
    12. 12. Why are there so manytypes of databasethese days?…because we needdifferent architecturesfor different jobs
    13. 13. Times are changing
    14. 14. Traditional DatabaseArchitecture is Aging
    15. 15. The Traditional Architecture
    16. 16. Traditional Shared Shared In Memory Disk NothingDistributed SimplerIn Memory Contract
    17. 17. Key Point #2 Different architectural decisions about how westore and access data are needed in different environments.Our ‘Context’ has changed
    18. 18. Simplifying the Contract
    19. 19. How big is the internet? 5 exabytes(which is 5,000 petabytes or 5,000,000 terabytes)
    20. 20. How big is an average enterprise database80% < 1TB (in 2009)
    21. 21. The context ofour problem has changed
    22. 22. Simplifying the Contract
    23. 23. Databases have hugeoperational overheads Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al
    24. 24. Avoid that overhead with asimpler contract and avoidingIO
    25. 25. Key Point #3For the very top end data volumes a simpler contract is mandatory. ACID is simply not possible.
    26. 26. Key Point #3 (addendum) But we should alwaysretain ACID properties if our use case allows it.
    27. 27. Options forscaling-out the traditional architecture
    28. 28. #1: The Shared Disk Architecture Shared Disk
    29. 29. #2: The Shared Nothing Architecture
    30. 30. Each machine is responsible for a subsetof the records. Each record exists on only one machine. 1, 2, 3… 97, 98, 99… 765, 769… 169, 170… Client 333, 334… 244, 245…
    31. 31. #3: The In Memory Database (single address-space)
    32. 32. Databases must cache subsets of the data in memory Cache
    33. 33. Not knowing what you don‟t know 90% in Cache Data on Disk
    34. 34. If you can fit it ALL in memoryyou know everything!!
    35. 35. The architecture of an in memory database
    36. 36. Memory is at least 100x faster than disk ms μs ns ps 1MB Disk/Network 1MB Main Memory 0.000,000,000,000Cross Continental Main Memory L1 Cache RefRound Trip Ref Cross Network Round L2 Cache Ref Trip * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
    37. 37. Random vs. Sequential Access
    38. 38. This makes them very fast!!
    39. 39. The proof is in the stats. TPC-HBenchmarks on a 1TB data set
    40. 40. So why haven‟t in-memory databases taken off?
    41. 41. Address-Spaces are relativelysmall and of a finite, fixed size
    42. 42. Durability
    43. 43. One solution is distribution
    44. 44. Distributed In Memory (Shared Nothing)
    45. 45. Again we spread our data but this time only using RAM. 1, 2, 3… 97, 98, 99… 765, 769… 169, 170…Client 333, 334… 244, 245…
    46. 46. Distribution solves our two problems
    47. 47. We get massive amounts of parallel processing
    48. 48. But at the costof loosing thesingle address space
    49. 49. Traditional Shared Shared In Memory Disk NothingDistributed SimplerIn Memory Contract
    50. 50. Key Point #4 There are three key forces: Simplify theDistribution No Disk contract Improve Gain scalability scalability All data is by picking through a held in appropriate distributed RAM ACID architecture properties.
    51. 51. These three non- functional themeslay behind the design of ODC, RBS‟s in- memory data warehouse
    52. 52. ODC
    53. 53. ODC represents a balance betweenthroughput and latency
    54. 54. What is Latency?
    55. 55. What is Throughput
    56. 56. Which is best for latency? Shared Nothing (Distributed) Traditional In-Memory Database Database Latency?
    57. 57. Which is best for throughput? Shared Nothing (Distributed) Traditional In-Memory Database Database Throughput?
    58. 58. So why do we use distributed in-memory? In Plentiful Memory hardware Latency Throughput
    59. 59. ODC – Distributed, Shared Nothing, InMemory, Semi-Normalised, Realtime Graph DB 450 processes 2TB of RAM Messaging (Topic Based) as a system of record (persistence)
    60. 60. The LayersAccess Layer Jav Jav a a clie clie nt API nt APIQuery Layer TransactionData Layer s Mtms CashflowsPersistence Layer
    61. 61. Three Tools of Distributed Data Architecture Indexing Partitioning Replication
    62. 62. How should we use these tools?
    63. 63. Replication puts data everywhere But your storage is limited by the memory on a node
    64. 64. Partitioning scales Associating data in different partitions implies moving it. Scalable storage, bandwidth and processing
    65. 65. So we have some data.Our data is bound together in a model Desk Sub Name Trader Party Trade
    66. 66. Which we save.. Trade r Part y Trad eTrad Trade Part e r y
    67. 67. Binding them back together involves a “distributed join” => Lots of network hops Trade r Part y Trad e Trad Trade Part e r y
    68. 68. The hops have to be spread over time Network Time
    69. 69. Lots of network hops makes it slow
    70. 70. OK – what if we held itall together??“Denormalised”
    71. 71. Hence denormalisation is FAST! (for reads)
    72. 72. Denormalisation implies theduplication of some sub-entities
    73. 73. …and that means managingconsistency over lots of copies
    74. 74. …and all the duplication means you run out of space really quickly
    75. 75. Spaces issues are exaggerated further when data is versioned Trade r Part Version 1 y Trad e Trade r Part Version 2 y Trad e Trade r Part Version 3 y Trad e Trade r Part Version 4 y Trad…and you need eversioning to do MVCC
    76. 76. And reconstituting a previous time slice becomes very difficult. Trad Trade Part e r y Part Trade Trad y r e Part y Trade r Trad e Part y
    77. 77. So we want to hold entities separately(normalised) to alleviate concerns around consistency and space usage
    78. 78. Remember this means the object graph will be split across multiple machines. Data isIndependently Versioned Trade Singleton r Part y Trad e Trad Trade Part e r y
    79. 79. Binding them back together involves a “distributed join” => Lots of network hops Trade r Part y Trad e Trad Trade Part e r y
    80. 80. Whereas the denormalisedmodel the join is already done
    81. 81. So what we want is the advantagesof a normalised store at the speedof a denormalised one!This is what using Snowflake Schemas and the Connected Replication pattern is all about!
    82. 82. Looking more closely: Why does normalisation mean wehave to spread data around thecluster. Why can‟t we hold it all together?
    83. 83. It‟s all about the keys
    84. 84. We can collocate data with common keys but if they crosscut the only way to collocate is to replicate Crosscuttin g Keys Common Keys
    85. 85. We tackle this problem with a hybrid model: Replicated Trader Party Trade Partitioned
    86. 86. We adapt the concept of a Snowflake Schema.
    87. 87. Taking the concept of Facts and Dimensions
    88. 88. Everything starts from a Core Fact (Trades for us)
    89. 89. Facts are Big, dimensions are small
    90. 90. Facts have one key that relates them all (used to partition)
    91. 91. Dimensions have many keys(which crosscut the partitioning key)
    92. 92. Looking at the data: Facts: =>Big, common keys Dimensions =>Small, crosscutting Keys
    93. 93. We remember we are a grid. We should avoid the distributed join.
    94. 94. … so we only want to „join‟ data that is in the same process Use a Key Assignment Trade Policy MTMs (e.g. KeyAssociation s in Coherence) Common Key
    95. 95. So we prescribe differentphysical storage for Facts and Dimensions Replicated Trader Party Trade Partitioned
    96. 96. Facts arepartitioned, dimensions arereplicated Query Layer Trader Party Trade Transactions Data Layer Mtms Cashflows Fact Storage (Partitioned)
    97. 97. Facts arepartitioned, dimensions arereplicated Dimension s (repliacte) Transactions Facts Mtms Cashflows (distribute/ partition) Fact Storage (Partitioned)
    98. 98. The data volumes back this up as a sensible hypothesis Facts: =>Big =>Distribut e Dimensions =>Small => Replicate
    99. 99. Key Point We use a variant on a Snowflake Schema topartition big entities that canbe related via a partitioningkey and replicate small stuffwho’s keys can’t map to our partitioning key.
    100. 100. ReplicateDistribute
    101. 101. So how does they help us to run queries without distributed joins? Select Transaction, MTM, RefrenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
    102. 102. What would this look like without this pattern? Get Get Get Get Get Get Get Cost Ledger Source Transa MTMs Legs CostCenter Books Books c-tions Center s s Network Time
    103. 103. But by balancing Replication andPartitioning we don‟t need all those hops Get Get Get Get Get Get Get Cost Ledger Source Transac MTMs Legs Cost Centers Books Books -tions Centers Network
    104. 104. Stage 1: Focus on the where clause: Where Cost Centre = „CC1‟
    105. 105. Stage 1: Get the right keys to query the Facts Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Join Dimensions in Query Layer Transactions Mtms Cashflows Partitioned
    106. 106. Stage 2: Cluster Join to get Facts Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Join Dimensions in Query Layer Transactions Join Facts Mtms acrossCashflows cluster Partitioned
    107. 107. Stage 2: Join the facts togetherefficiently as we know they are collocated
    108. 108. Stage 3: Augment raw Facts with relevant Dimensions Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’Join Join DimensionsDimensions in Query Layerin QueryLayer Transactions Join FactsMtms across Cashflows cluster Partitioned
    109. 109. Stage 3: Bind relevantdimensions to the result
    110. 110. Bringing it together: JavReplicated a clie Partitioned nt APIDimensions FactsWe never have to do a distributed join!
    111. 111. So all the big stuff is held partitioned And we can join without shipping keys around andhaving intermediate results
    112. 112. We get to do this… Trade r Part y Trad eTrad Trade Part e r y
    113. 113. …and this…Trade r Part Version 1 y Trad e Trade r Part Version 2 y Trad e Trade r Part Version 3 y Trad e Trade r Part Version 4 y Trad e
    114. 114. ..and this..Trad Trade Part e r y Part TradeTrad y r e Part y Trade rTrad e Part y
    115. 115. …without the problems of this…
    116. 116. …or this..
    117. 117. ..all at the speed of this… well almost!
    118. 118. But there is a fly in the ointment…
    119. 119. I lied earlier. These aren‟t all Facts. Facts This is a dimension • It has a different key to the Facts. Dimensions • And it’s BIG
    120. 120. We can‟t replicate really bigstuff… we‟ll run out of space => Big Dimensions are aproblem.
    121. 121. Fortunately there is a simplesolution!
    122. 122. Whilst there are lots of thesebig dimensions, a large majorityare never used. They are not all“connected”.
    123. 123. If there are no Trades for Goldmansin the data store then a Trade Querywill never need the GoldmansCounterparty
    124. 124. Looking at the Dimension data some are quite large
    125. 125. But Connected Dimension Data is tiny by comparison
    126. 126. One recent independent studyfrom the database communityshowed that 80% of dataremains unused
    127. 127. So we only replicate‘Connected’ or ‘Used’ dimensions
    128. 128. As data is written to the data store wekeep our „Connected Caches‟ up to date Processing Layer Dimension Caches (Replicated) Transactions Data Layer As new Facts are added Mtms relevant Dimensions that they reference are moved Cashflows to processing layer caches Fact Storage (Partitioned)
    129. 129. The Replicated Layer is updatedby recursing through the arcson the domain model when factschange
    130. 130. Saving a trade causes all it‟s 1 levelst references to be triggered Query Layer Save Trade (With connected dimension Caches) Data LayerCache Trad (All Normalised)Store e Partitioned Trigger Cache Party Sourc Ccy Alias e Book
    131. 131. This updates the connected caches Query Layer (With connected dimension Caches) Data Layer Trad (All Normalised) eParty Sourc CcyAlias e Book
    132. 132. The process recurses through the object graph Query Layer (With connected dimension Caches) Data Layer Trad (All Normalised) eParty Sourc CcyAlias e Book Party Ledge rBook
    133. 133. ‘Connected Replication’ A simple pattern whichrecurses through the foreign keys in the domain model, ensuring only‘Connected’ dimensions are replicated
    134. 134. With ‘Connected Replication’ only 1/10th of the dataneeds to be replicated (on average).
    135. 135. Limitations of this approach
    136. 136. Conclusion
    137. 137. Conclusion
    138. 138. Conclusion
    139. 139. Conclusion
    140. 140. Conclusion
    141. 141. Conclusion Partitioned Storage
    142. 142. Conclusion
    143. 143. The End

    ×