Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dissecting Scalable DatabaseArchitecturesDoug JuddCEO, Hypertable Inc.
Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends
Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable
Auto-Sharding
Auto-Sharding
Auto-sharding Systems• Oracle NoSQL Database• MongoDB
Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store”   – Amazon.com, 2007• Distributed Hash Table (DHT)• Handles in...
Consistent Hashing
Eventual Consistency
Vector Clocks
Dynamo-based Systems•   Cassandra•   DynamoDB•   Riak•   Voldemort
Bigtable• “Bigtable: A Distributed Storage System for Structured Data”   - Google, Inc., OSDI ’06• Ordered• Consistent• No...
Google Architecture
Google File System
Google File System
Table: Growth Process
Scaling (part 1)
Scaling (part 2)
Scaling (part 3)
System overview
Database Model• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key   •   Row (string) ...
Table: Visual Representation
Table: Actual Representation
Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian,  ones-compliment...
Log Structured Merge Tree
Range Server: CellStore• Sequence of 65K blocks of  compressed key/value pairs
Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not pres...
Request Routing
Bigtable-based Systems• Accumulo• HBase• Hypertable
Next-generation Architectures• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)
PNUTS• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records  ...
PNUTS System Architecture
Record-level Mastering• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region na...
PNUTS API•   Read-any•   Read-critical(required_version)•   Read-latest•   Write•   Test-and-set-write(required_version)
Spanner•   Globally distributed database (cross-datacenter replication)•   Synchronously Replicated•   Externally-consiste...
Spanner Server Organization
Spanserver• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of  mappings:            ...
TrueTime• Universal Clock• Set of time master servers per-datacenter  • GPL clock via GPS receivers with dedicated antenna...
Spanner Software Stack
Externally-consistent Operations•   Read-Write Transaction•   Read-Only Transaction•   Snapshot Read (client-provided time...
Dremel•   Scalable, interactive ad-hoc query system•   Designed to operate on read-only data•   Handles nested data (Proto...
Columnar Storage Format• Novel format for storing lists of nested records (Protocol  Buffers)• Highly space-efficient• Alg...
Multi-level Execution Trees• Execution model for one-pass aggregations returning small  and medium-sized results (very com...
Performance
Example Queries• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1• SELECT country, SUM(item.amount) FROM T2  GROUP BY c...
Future Evolution - HardwareTrends• SSD Drives• Disk Drives• Networking
Flash Memory Rated Lifetime(P/E Cycles)     Source: Bleak Future of NAND Flash Memory,             Grupp et al., FAST 2012
Flash Memory Average BER atRated Lifetime     Source: Bleak Future of NAND Flash Memory,             Grupp et al., FAST 2012
Disk: Areal Density Trend      Source: GPFS Scans 10 Billion Files in 43 Minutes.              © Copyright IBM Corporation...
Disk: Maximum SustainedBandwidth Trend     Source: GPFS Scans 10 Billion Files in 43 Minutes.             © Copyright IBM ...
Time Required to Sequentially Fill aSATA Drive
Average Seek Time     Source: GPFS Scans 10 Billion Files in 43 Minutes.             © Copyright IBM Corporation 2011
Average Rotational Latency      Source: GPFS Scans 10 Billion Files in 43 Minutes.              © Copyright IBM Corporatio...
Time Required to RandomlyRead a SATA Drive
Ethernet• 10GbE  • Starting to replace 1GbE for server NICs  • De facto network port for new servers in 2014• 40GbE  • Dat...
10GbE Adoption Curve (?)     Source: CREHAN RESEARCH Inc. © Copyright 2012
The EndThank you!
Upcoming SlideShare
Loading in …5
×

Dissecting Scalable Database Architectures

3,089 views

Published on

Presentation by Doug Judd, co-founder of Hypertable Inc, at Groupon office in Palo Alto, CA on November 15th, 2012.

Published in: Technology
  • Follow the link, new dating source: ❤❤❤ http://bit.ly/2u6xbL5 ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ❶❶❶ http://bit.ly/2u6xbL5 ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Dissecting Scalable Database Architectures

  1. 1. Dissecting Scalable DatabaseArchitecturesDoug JuddCEO, Hypertable Inc.
  2. 2. Talk Outline• Scalable “NoSQL” Architectures• Next-generation Architectures• Future Evolution - Hardware Trends
  3. 3. Scalable NoSQLArchitecture Categories• Auto-sharding• Dynamo• Bigtable
  4. 4. Auto-Sharding
  5. 5. Auto-Sharding
  6. 6. Auto-sharding Systems• Oracle NoSQL Database• MongoDB
  7. 7. Dynamo• “Dynamo: Amazon’s Highly Available Key-value Store” – Amazon.com, 2007• Distributed Hash Table (DHT)• Handles inter-datacenter replication• Designed for High Write Availability
  8. 8. Consistent Hashing
  9. 9. Eventual Consistency
  10. 10. Vector Clocks
  11. 11. Dynamo-based Systems• Cassandra• DynamoDB• Riak• Voldemort
  12. 12. Bigtable• “Bigtable: A Distributed Storage System for Structured Data” - Google, Inc., OSDI ’06• Ordered• Consistent• Not designed to handle inter-datacenter replication
  13. 13. Google Architecture
  14. 14. Google File System
  15. 15. Google File System
  16. 16. Table: Growth Process
  17. 17. Scaling (part 1)
  18. 18. Scaling (part 2)
  19. 19. Scaling (part 3)
  20. 20. System overview
  21. 21. Database Model• Sparse, two-dimensional table with cell versions• Cells are identified by a 4-part key • Row (string) • Column Family • Column Qualifier (string) • Timestamp
  22. 22. Table: Visual Representation
  23. 23. Table: Actual Representation
  24. 24. Anatomy of a Key• Column Family is represented with 1 byte• Timestamp and revision are stored big-endian, ones-compliment• Simple byte-wise comparison
  25. 25. Log Structured Merge Tree
  26. 26. Range Server: CellStore• Sequence of 65K blocks of compressed key/value pairs
  27. 27. Bloom Filter• Associated with each Cell Store• Dramatically reduces disk access• Tells you if key is definitively not present
  28. 28. Request Routing
  29. 29. Bigtable-based Systems• Accumulo• HBase• Hypertable
  30. 30. Next-generation Architectures• PNUTS (Yahoo, Inc.)• Spanner (Google, Inc.)• Dremel (Google, Inc.)
  31. 31. PNUTS• Geographically distributed database• Designed for low-latency access• Manages hashed or ordered tables of records • Hashed tables implemented via proprietary disk-based hash • Ordered tables implemented with MySQL+InnoDB• Not optimized for bulk storage (image, videos, …)• Runs as a hosted service inside Yahoo!
  32. 32. PNUTS System Architecture
  33. 33. Record-level Mastering• Provides per-record timeline consistency• Master is adaptively changed to suit workload• Region names are two bytes associated with each record
  34. 34. PNUTS API• Read-any• Read-critical(required_version)• Read-latest• Write• Test-and-set-write(required_version)
  35. 35. Spanner• Globally distributed database (cross-datacenter replication)• Synchronously Replicated• Externally-consistent distributed transactions• Globally distributed transaction management• SQL-based query language
  36. 36. Spanner Server Organization
  37. 37. Spanserver• Manages 100-1000 tablets• A tablet is similar to a Bigtable tablet and manages a bag of mappings: (key:string, timestamp:int64) -> string• Single Paxos state machine implemented on top of each tablet• Tablet may contain multiple directories • Set of contiguous keys that share a common prefix • Unit of data placement • Can be moved between Tablets for performance reasons
  38. 38. TrueTime• Universal Clock• Set of time master servers per-datacenter • GPL clock via GPS receivers with dedicated antennas • Atomic clock• Time daemon runs on every machine• TrueTime API:
  39. 39. Spanner Software Stack
  40. 40. Externally-consistent Operations• Read-Write Transaction• Read-Only Transaction• Snapshot Read (client-provided timestamp)• Snapshot Read (client-provided bound)• Schema Change Transaction
  41. 41. Dremel• Scalable, interactive ad-hoc query system• Designed to operate on read-only data• Handles nested data (Protocol Buffers)• Can run aggregation queries over trillion-row tables in seconds
  42. 42. Columnar Storage Format• Novel format for storing lists of nested records (Protocol Buffers)• Highly space-efficient• Algorithm for dissecting list of nested records into columns• Algorithm for reassembling columns into list of records
  43. 43. Multi-level Execution Trees• Execution model for one-pass aggregations returning small and medium-sized results (very common at Google)• Query gets re-written as it passes down the execution tree.• On the way up, intermediate servers perform a parallel aggregation of partial results.
  44. 44. Performance
  45. 45. Example Queries• SELECT SUM(CountWords(txtField)) / COUNT(*) FROM T1• SELECT country, SUM(item.amount) FROM T2 GROUP BY country• SELECT domain, SUM(item.amount) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain• SELECT COUNT(DISTINCT a) FROM T5
  46. 46. Future Evolution - HardwareTrends• SSD Drives• Disk Drives• Networking
  47. 47. Flash Memory Rated Lifetime(P/E Cycles) Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012
  48. 48. Flash Memory Average BER atRated Lifetime Source: Bleak Future of NAND Flash Memory, Grupp et al., FAST 2012
  49. 49. Disk: Areal Density Trend Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
  50. 50. Disk: Maximum SustainedBandwidth Trend Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
  51. 51. Time Required to Sequentially Fill aSATA Drive
  52. 52. Average Seek Time Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
  53. 53. Average Rotational Latency Source: GPFS Scans 10 Billion Files in 43 Minutes. © Copyright IBM Corporation 2011
  54. 54. Time Required to RandomlyRead a SATA Drive
  55. 55. Ethernet• 10GbE • Starting to replace 1GbE for server NICs • De facto network port for new servers in 2014• 40GbE • Data center core & aggregation • Top-of-rack server aggregation• 100GbE • Service Provider core and aggregation • Metro and large Campus core • Data center core & aggregation• No technology currently exists to transport 40 Gbps or 100 Gbps as a single stream over existing copper or fiber• 40GbE & 100GbE solved using either 4 or 10 parallel 10GbE “lanes”
  56. 56. 10GbE Adoption Curve (?) Source: CREHAN RESEARCH Inc. © Copyright 2012
  57. 57. The EndThank you!

×