Dissecting Scalable DatabaseArchitecturesDoug JuddCEO, Hypertable Inc.
Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends
Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable
Auto-Sharding
Auto-Sharding
Auto-sharding Systems• Oracle NoSQL Database• MongoDB
Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”   – Amazon.com, 2007• Distributed Hash Table (DHT)• Handles in...
Consistent Hashing
Eventual Consistency
Vector Clocks
Dynamo-based Systems•   Cassandra•   DynamoDB•   Riak•   Voldemort
Bigtable• “Bigtable: A Distributed Storage System for Structured Data”   - Google, Inc., OSDI ’06• Ordered• Consistent• No...
Google Architecture
Google File System
Google File System
Table: Growth Process
Scaling (part 1)
Scaling (part 2)
Scaling (part 3)
System overview
Database Model• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key   •   Row (string) ...
Table: Visual Representation
Table: Actual Representation
Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,  ones-compliment...
Log Structured Merge Tree
Range Server: CellStore• Sequence of 65K blocks of  compressed key/value pairs
Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not pres...
Request Routing
Bigtable-based Systems• Accumulo• HBase• Hypertable
Next-generation Architectures• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)
PNUTS• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records  ...
PNUTS System Architecture
Record-level Mastering• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region na...
PNUTS API•   Read-any•   Read-critical(required_version)•   Read-latest•   Write•   Test-and-set-write(required_version)
Spanner•   Globally distributed database (cross-datacenter replication)•   Synchronously Replicated•   Externally-consiste...
Spanner Server Organization
Spanserver• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of  mappings:            ...
TrueTime• Universal Clock• Set of time master servers per-datacenter  • GPL clock via GPS receivers with dedicated antenna...
Spanner Software Stack
Externally-consistent Operations•   Read-Write Transaction•   Read-Only Transaction•   Snapshot Read (client-provided time...
Dremel•   Scalable, interactive ad-hoc query system•   Designed to operate on read-only data•   Handles nested data (Proto...
Columnar Storage Format• Novel format for storing lists of nested records (Protocol  Buffers)• Highly space-efficient• Alg...
Multi-level Execution Trees• Execution model for one-pass aggregations returning small  and medium-sized results (very com...
Performance
Example Queries• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1• SELECT country, SUM(item.amount) FROM T2  GROUP BY c...
Future Evolution - HardwareTrends• SSD Drives• Disk Drives• Networking
Flash Memory Rated Lifetime(P/E Cycles)     Source: Bleak Future of NAND Flash Memory,             Grupp et al., FAST 2012
Flash Memory Average BER atRated Lifetime     Source: Bleak Future of NAND Flash Memory,             Grupp et al., FAST 2012
Disk: Areal Density Trend      Source: GPFS Scans 10 Billion Files in 43 Minutes.              © Copyright IBM Corporation...
Disk: Maximum SustainedBandwidth Trend     Source: GPFS Scans 10 Billion Files in 43 Minutes.             © Copyright IBM ...
Time Required to Sequentially Fill aSATA Drive
Average Seek Time     Source: GPFS Scans 10 Billion Files in 43 Minutes.             © Copyright IBM Corporation 2011
Average Rotational Latency      Source: GPFS Scans 10 Billion Files in 43 Minutes.              © Copyright IBM Corporatio...
Time Required to RandomlyRead a SATA Drive
Ethernet• 10GbE  • Starting to replace 1GbE for server NICs  • De facto network port for new servers in 2014• 40GbE  • Dat...
10GbE Adoption Curve (?)     Source: CREHAN RESEARCH Inc. © Copyright 2012
The EndThank you!
Upcoming SlideShare
Loading in …5
×

Dissecting Scalable Database Architectures

2,757 views

Published on

Presentation by Doug Judd, co-founder of Hypertable Inc, at Groupon office in Palo Alto, CA on November 15th, 2012.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,757
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
58
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Strengths: Multiple-servers can satisfy readsWeaknesses: 1. Failover, 2. Mapping service on single machine, 3. Irregular growth patterns can cause imbalance
  • Strengths: Multiple-servers can satisfy readsWeaknesses: 1. Failover, 2. Mapping service on single machine, 3. Irregular growth patterns can cause imbalance
  • Designed for their “Shopping Cart” service
  • An important part of any DHT is the mechanism by which keys get mapped …Supports incremental scalabilityGossip protocol is used to propagate membership changes
  • Dynamo was designed for high write availability (Shopping Cart service)This is also how they handle inter-datacenter replicationRead repair (downside is reading becomes expensive)
  • Dynamo uses Vector Clocks to assist with the reconciliation of divergent copies of objects in the systemVector Clocks are used to track revision history of objectsfor the purposes of reconciliation in the event of a divergenceAny storage node in Dynamo is eligible to receive client get and put operations for any keyOne vector clock is associated with every version of every objectIf the version numbers on the first object’s clock are <= to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
  • Strengths: Low latency writes, handles inter-datacenter replicationWeaknesses: Not ordered, Read-repair can impact read latency
  • Strengths: Ordered, consistentWeaknesses: Does not handle inter-datacenter replication
  • This diagram shows two regions …Data replication happens via that Yahoo Message Broker (YMB), which is a pub/sub system that offers reliable message deliveryStorage units manage tabletsTablet controller contains authoritative mapping of tablets-to-storage-units, also orchestrates tablet movment
  • PNUTS offers relaxed consistency guarantees across the dataset, but per-record timeline consistency through the use of Record Level MasteringThey’ve found this to be sufficient for most of their web use cases
  • Read-any – reads any (possible stale) recent version. Low latency.Read-critical – Reads any record that
  • Placement driver handles automated movement of data across zones.
  • Time daemons synchronize with time masters every 30 seconds and apply a drift rate of 200 microseconds/s between synchronizationsThe reason for the two clocks is that they each have different failure modes.
  • Read-Write transactions use pessimistic concurrency control (lock table)Reads are lock-free and can happen at an replica that is sufficiently up-to-date.
  • Another key aspect of Dremel is how it handles certain common aggregation queries
  • SLC – Single Level Cell
  • Kryder’s law
  • 40% CAGR for arial density, 15% CAGR for sustained write bandwidth
  • 7,200 RPM, 10,000 RPM, 15,000 RPM
  • Dissecting Scalable Database Architectures

    1. 1. Dissecting Scalable DatabaseArchitecturesDoug JuddCEO, Hypertable Inc.
    2. 2. Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends
    3. 3. Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable
    4. 4. Auto-Sharding
    5. 5. Auto-Sharding
    6. 6. Auto-sharding Systems• Oracle NoSQL Database• MongoDB
    7. 7. Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store” – Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability
    8. 8. Consistent Hashing
    9. 9. Eventual Consistency
    10. 10. Vector Clocks
    11. 11. Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort
    12. 12. Bigtable• “Bigtable: A Distributed Storage System for Structured Data” - Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication
    13. 13. Google Architecture
    14. 14. Google File System
    15. 15. Google File System
    16. 16. Table: Growth Process
    17. 17. Scaling (part 1)
    18. 18. Scaling (part 2)
    19. 19. Scaling (part 3)
    20. 20. System overview
    21. 21. Database Model• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key • Row (string) • Column Family • Column Qualifier (string) • Timestamp
    22. 22. Table: Visual Representation
    23. 23. Table: Actual Representation
    24. 24. Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian, ones-compliment• Simple byte-wise comparison
    25. 25. Log Structured Merge Tree
    26. 26. Range Server: CellStore• Sequence of 65K blocks of compressed key/value pairs
    27. 27. Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present
    28. 28. Request Routing
    29. 29. Bigtable-based Systems• Accumulo• HBase• Hypertable
    30. 30. Next-generation Architectures• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)
    31. 31. PNUTS• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records • Hashed tables implemented via proprietary disk-based hash • Ordered tables implemented with MySQL+InnoDB• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!
    32. 32. PNUTS System Architecture
    33. 33. Record-level Mastering• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record
    34. 34. PNUTS API• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)
    35. 35. Spanner• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language
    36. 36. Spanner Server Organization
    37. 37. Spanserver• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of mappings: (key:string, timestamp:int64) -> string• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories • Set of contiguous keys that share a common prefix • Unit of data placement • Can be moved between Tablets for performance reasons
    38. 38. TrueTime• Universal Clock• Set of time master servers per-datacenter • GPL clock via GPS receivers with dedicated antennas • Atomic clock• Time daemon runs on every machine• TrueTime API:
    39. 39. Spanner Software Stack
    40. 40. Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction
    41. 41. Dremel• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds
    42. 42. Columnar Storage Format• Novel format for storing lists of nested records (Protocol Buffers)• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records
    43. 43. Multi-level Execution Trees• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel aggregation of partial results.
    44. 44. Performance
    45. 45. Example Queries• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1• SELECT country, SUM(item.amount) FROM T2 GROUP BY country• SELECT domain, SUM(item.amount) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain• SELECT COUNT(DISTINCT a) FROM T5
    46. 46. Future Evolution - HardwareTrends• SSD Drives• Disk Drives• Networking
    47. 47. Flash Memory Rated Lifetime(P/E Cycles) Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012
    48. 48. Flash Memory Average BER atRated Lifetime Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012
    49. 49. Disk: Areal Density Trend Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
    50. 50. Disk: Maximum SustainedBandwidth Trend Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
    51. 51. Time Required to Sequentially Fill aSATA Drive
    52. 52. Average Seek Time Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
    53. 53. Average Rotational Latency Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
    54. 54. Time Required to RandomlyRead a SATA Drive
    55. 55. Ethernet• 10GbE • Starting to replace 1GbE for server NICs • De facto network port for new servers in 2014• 40GbE • Data center core & aggregation • Top-of-rack server aggregation• 100GbE • Service Provider core and aggregation • Metro and large Campus core • Data center core & aggregation• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”
    56. 56. 10GbE Adoption Curve (?) Source: CREHAN RESEARCH Inc. © Copyright 2012
    57. 57. The EndThank you!

    ×