• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Advanced databases   ben stopford
 

Advanced databases ben stopford

on

  • 3,756 views

 

Statistics

Views

Total Views
3,756
Views on SlideShare
1,146
Embed Views
2,610

Actions

Likes
1
Downloads
37
Comments
0

8 Embeds 2,610

http://www.benstopford.com 2530
http://gridwatch.collected.info 56
http://feeds.feedburner.com 11
http://lonrs05720 4
http://webcache.googleusercontent.com 3
http://www.netvibes.com 2
http://translate.googleusercontent.com 2
http://miclark 2
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I started a project back in 2004. It was a trading system back at barcap. When it came to persisting our data there were three choices, Oracle, Sybase or Sql Server. A lot of has changed in that time. Today, we are far more likely to look at one of a variety of technologies to satisfy our need to store and re-retrieve our data. So how many of you use a traditional database?What about a distributed database like Oracle RAC?NoSQL?.. do you use it with a database or stand alone.What about an in memory database? in production?Finally what about distributed in memory?This talk is about an in memory database. It's not really a distributed cache, despite being implemented in Coherence, although you could call it one if you preferred. In truth it has a variety of elements that make it closer to what you might perceive to be a database. It is normalised: that is to say that it holds entities independently from one another and versions them as such. It has some basic guarantees of the automaticity when writing certain groups of objects that are collocated. Most importantly it is both fast and scalable regardless of the join criteria you impose on it, this being something fairly illusive in the world of distributed data storage. I have a few aims for today:I hope you will leave with a broader view on what stores are available to you and what is coming in the future.I hope you'll see the benefits that niece storage solutions can provide through simpler contracts between client and data store.I'd like you to understand the benefits of memory over disk.
  • Better example is amazonPartition by user so orders and basket are held togetherProducts will be shared by multiple users
  • Big data sets are held distributed and only joined on the grid to collocated objects.Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)
  • Big data sets are held distributed and only joined on the grid to collocated objects.Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)

Advanced databases   ben stopford Advanced databases ben stopford Presentation Transcript

  • Data Storage for ExtremeUse Cases: The Lay of theLand and a Peek at ODCBen Stopford : RBS
  • How fast is a HashMap lookup?
  • That‟s how long it takes light to travel a room
  • How fast is a database lookup?
  • That‟s how long it takes light to go to Australia and back
  • Computers really are very fast!
  • The problem is we‟re quite good at writing software that slows them down
  • Question:Is it fair to compare theperformance of a Database witha HashMap?
  • Of course not…
  • Mechanical Sympathy Ethernet ping 1MB Disk/Ethernet RDMA over InfinibandCrossContinental ms μs ns psRoundTrip 0.000,000,000,000 Main Memory L1 Cache Ref Ref 1MB Main Memory L2 Cache Ref * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
  • Key Point #1 Simple computerprograms, operating in asingle address space are extremely fast.
  • Why are there so manytypes of databasethese days?…because we needdifferent architecturesfor different jobs
  • Times are changing
  • Traditional DatabaseArchitecture is Aging
  • The Traditional Architecture
  • Traditional Shared Shared In Memory Disk NothingDistributed SimplerIn Memory Contract
  • Key Point #2 Different architectural decisions about how westore and access data are needed in different environments.Our ‘Context’ has changed
  • Simplifying the Contract
  • How big is the internet? 5 exabytes(which is 5,000 petabytes or 5,000,000 terabytes)
  • How big is an average enterprise database80% < 1TB (in 2009)
  • The context ofour problem has changed
  • Simplifying the Contract
  • Databases have hugeoperational overheads Taken from “OLTP Through the Looking Glass, and What We Found There” Harizopoulos et al
  • Avoid that overhead with asimpler contract and avoidingIO
  • Key Point #3For the very top end data volumes a simpler contract is mandatory. ACID is simply not possible.
  • Key Point #3 (addendum) But we should alwaysretain ACID properties if our use case allows it.
  • Options forscaling-out the traditional architecture
  • #1: The Shared Disk Architecture Shared Disk
  • #2: The Shared Nothing Architecture
  • Each machine is responsible for a subsetof the records. Each record exists on only one machine. 1, 2, 3… 97, 98, 99… 765, 769… 169, 170… Client 333, 334… 244, 245…
  • #3: The In Memory Database (single address-space)
  • Databases must cache subsets of the data in memory Cache
  • Not knowing what you don‟t know 90% in Cache Data on Disk
  • If you can fit it ALL in memoryyou know everything!!
  • The architecture of an in memory database
  • Memory is at least 100x faster than disk ms μs ns ps 1MB Disk/Network 1MB Main Memory 0.000,000,000,000Cross Continental Main Memory L1 Cache RefRound Trip Ref Cross Network Round L2 Cache Ref Trip * L1 ref is about 2 clock cycles or 0.7ns. This is the time it takes light to travel 20cm
  • Random vs. Sequential Access
  • This makes them very fast!!
  • The proof is in the stats. TPC-HBenchmarks on a 1TB data set
  • So why haven‟t in-memory databases taken off?
  • Address-Spaces are relativelysmall and of a finite, fixed size
  • Durability
  • One solution is distribution
  • Distributed In Memory (Shared Nothing)
  • Again we spread our data but this time only using RAM. 1, 2, 3… 97, 98, 99… 765, 769… 169, 170…Client 333, 334… 244, 245…
  • Distribution solves our two problems
  • We get massive amounts of parallel processing
  • But at the costof loosing thesingle address space
  • Traditional Shared Shared In Memory Disk NothingDistributed SimplerIn Memory Contract
  • Key Point #4 There are three key forces: Simplify theDistribution No Disk contract Improve Gain scalability scalability All data is by picking through a held in appropriate distributed RAM ACID architecture properties.
  • These three non- functional themeslay behind the design of ODC, RBS‟s in- memory data warehouse
  • ODC
  • ODC represents a balance betweenthroughput and latency
  • What is Latency?
  • What is Throughput
  • Which is best for latency? Shared Nothing (Distributed) Traditional In-Memory Database Database Latency?
  • Which is best for throughput? Shared Nothing (Distributed) Traditional In-Memory Database Database Throughput?
  • So why do we use distributed in-memory? In Plentiful Memory hardware Latency Throughput
  • ODC – Distributed, Shared Nothing, InMemory, Semi-Normalised, Realtime Graph DB 450 processes 2TB of RAM Messaging (Topic Based) as a system of record (persistence)
  • The LayersAccess Layer Jav Jav a a clie clie nt API nt APIQuery Layer TransactionData Layer s Mtms CashflowsPersistence Layer
  • Three Tools of Distributed Data Architecture Indexing Partitioning Replication
  • How should we use these tools?
  • Replication puts data everywhere But your storage is limited by the memory on a node
  • Partitioning scales Associating data in different partitions implies moving it. Scalable storage, bandwidth and processing
  • So we have some data.Our data is bound together in a model Desk Sub Name Trader Party Trade
  • Which we save.. Trade r Part y Trad eTrad Trade Part e r y
  • Binding them back together involves a “distributed join” => Lots of network hops Trade r Part y Trad e Trad Trade Part e r y
  • The hops have to be spread over time Network Time
  • Lots of network hops makes it slow
  • OK – what if we held itall together??“Denormalised”
  • Hence denormalisation is FAST! (for reads)
  • Denormalisation implies theduplication of some sub-entities
  • …and that means managingconsistency over lots of copies
  • …and all the duplication means you run out of space really quickly
  • Spaces issues are exaggerated further when data is versioned Trade r Part Version 1 y Trad e Trade r Part Version 2 y Trad e Trade r Part Version 3 y Trad e Trade r Part Version 4 y Trad…and you need eversioning to do MVCC
  • And reconstituting a previous time slice becomes very difficult. Trad Trade Part e r y Part Trade Trad y r e Part y Trade r Trad e Part y
  • So we want to hold entities separately(normalised) to alleviate concerns around consistency and space usage
  • Remember this means the object graph will be split across multiple machines. Data isIndependently Versioned Trade Singleton r Part y Trad e Trad Trade Part e r y
  • Binding them back together involves a “distributed join” => Lots of network hops Trade r Part y Trad e Trad Trade Part e r y
  • Whereas the denormalisedmodel the join is already done
  • So what we want is the advantagesof a normalised store at the speedof a denormalised one!This is what using Snowflake Schemas and the Connected Replication pattern is all about!
  • Looking more closely: Why does normalisation mean wehave to spread data around thecluster. Why can‟t we hold it all together?
  • It‟s all about the keys
  • We can collocate data with common keys but if they crosscut the only way to collocate is to replicate Crosscuttin g Keys Common Keys
  • We tackle this problem with a hybrid model: Replicated Trader Party Trade Partitioned
  • We adapt the concept of a Snowflake Schema.
  • Taking the concept of Facts and Dimensions
  • Everything starts from a Core Fact (Trades for us)
  • Facts are Big, dimensions are small
  • Facts have one key that relates them all (used to partition)
  • Dimensions have many keys(which crosscut the partitioning key)
  • Looking at the data: Facts: =>Big, common keys Dimensions =>Small, crosscutting Keys
  • We remember we are a grid. We should avoid the distributed join.
  • … so we only want to „join‟ data that is in the same process Use a Key Assignment Trade Policy MTMs (e.g. KeyAssociation s in Coherence) Common Key
  • So we prescribe differentphysical storage for Facts and Dimensions Replicated Trader Party Trade Partitioned
  • Facts arepartitioned, dimensions arereplicated Query Layer Trader Party Trade Transactions Data Layer Mtms Cashflows Fact Storage (Partitioned)
  • Facts arepartitioned, dimensions arereplicated Dimension s (repliacte) Transactions Facts Mtms Cashflows (distribute/ partition) Fact Storage (Partitioned)
  • The data volumes back this up as a sensible hypothesis Facts: =>Big =>Distribut e Dimensions =>Small => Replicate
  • Key Point We use a variant on a Snowflake Schema topartition big entities that canbe related via a partitioningkey and replicate small stuffwho’s keys can’t map to our partitioning key.
  • ReplicateDistribute
  • So how does they help us to run queries without distributed joins? Select Transaction, MTM, RefrenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’
  • What would this look like without this pattern? Get Get Get Get Get Get Get Cost Ledger Source Transa MTMs Legs CostCenter Books Books c-tions Center s s Network Time
  • But by balancing Replication andPartitioning we don‟t need all those hops Get Get Get Get Get Get Get Cost Ledger Source Transac MTMs Legs Cost Centers Books Books -tions Centers Network
  • Stage 1: Focus on the where clause: Where Cost Centre = „CC1‟
  • Stage 1: Get the right keys to query the Facts Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Join Dimensions in Query Layer Transactions Mtms Cashflows Partitioned
  • Stage 2: Cluster Join to get Facts Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’ Join Dimensions in Query Layer Transactions Join Facts Mtms acrossCashflows cluster Partitioned
  • Stage 2: Join the facts togetherefficiently as we know they are collocated
  • Stage 3: Augment raw Facts with relevant Dimensions Select Transaction, MTM, ReferenceData From MTM, Transaction, Ref Where Cost Centre = ‘CC1’Join Join DimensionsDimensions in Query Layerin QueryLayer Transactions Join FactsMtms across Cashflows cluster Partitioned
  • Stage 3: Bind relevantdimensions to the result
  • Bringing it together: JavReplicated a clie Partitioned nt APIDimensions FactsWe never have to do a distributed join!
  • So all the big stuff is held partitioned And we can join without shipping keys around andhaving intermediate results
  • We get to do this… Trade r Part y Trad eTrad Trade Part e r y
  • …and this…Trade r Part Version 1 y Trad e Trade r Part Version 2 y Trad e Trade r Part Version 3 y Trad e Trade r Part Version 4 y Trad e
  • ..and this..Trad Trade Part e r y Part TradeTrad y r e Part y Trade rTrad e Part y
  • …without the problems of this…
  • …or this..
  • ..all at the speed of this… well almost!
  • But there is a fly in the ointment…
  • I lied earlier. These aren‟t all Facts. Facts This is a dimension • It has a different key to the Facts. Dimensions • And it’s BIG
  • We can‟t replicate really bigstuff… we‟ll run out of space => Big Dimensions are aproblem.
  • Fortunately there is a simplesolution!
  • Whilst there are lots of thesebig dimensions, a large majorityare never used. They are not all“connected”.
  • If there are no Trades for Goldmansin the data store then a Trade Querywill never need the GoldmansCounterparty
  • Looking at the Dimension data some are quite large
  • But Connected Dimension Data is tiny by comparison
  • One recent independent studyfrom the database communityshowed that 80% of dataremains unused
  • So we only replicate‘Connected’ or ‘Used’ dimensions
  • As data is written to the data store wekeep our „Connected Caches‟ up to date Processing Layer Dimension Caches (Replicated) Transactions Data Layer As new Facts are added Mtms relevant Dimensions that they reference are moved Cashflows to processing layer caches Fact Storage (Partitioned)
  • The Replicated Layer is updatedby recursing through the arcson the domain model when factschange
  • Saving a trade causes all it‟s 1 levelst references to be triggered Query Layer Save Trade (With connected dimension Caches) Data LayerCache Trad (All Normalised)Store e Partitioned Trigger Cache Party Sourc Ccy Alias e Book
  • This updates the connected caches Query Layer (With connected dimension Caches) Data Layer Trad (All Normalised) eParty Sourc CcyAlias e Book
  • The process recurses through the object graph Query Layer (With connected dimension Caches) Data Layer Trad (All Normalised) eParty Sourc CcyAlias e Book Party Ledge rBook
  • ‘Connected Replication’ A simple pattern whichrecurses through the foreign keys in the domain model, ensuring only‘Connected’ dimensions are replicated
  • With ‘Connected Replication’ only 1/10th of the dataneeds to be replicated (on average).
  • Limitations of this approach
  • Conclusion
  • Conclusion
  • Conclusion
  • Conclusion
  • Conclusion
  • Conclusion Partitioned Storage
  • Conclusion
  • The End