Non-Relational Databases: This hurts. I like it.

1,014 views

Published on

Delivered at San Luis Obispo .NET Users Group on October 13th, 2009.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,014
On SlideShare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
51
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Non-Relational Databases: This hurts. I like it.

  1. 1. Non-Relational Databases: This hurts. I like it. Christopher Groskopf / bouvard / @onyxfish
  2. 2. Outline <ul><li>First! </li><ul><li>A Hypothetical </li></ul><li>Second! </li><ul><li>Platforms </li></ul><li>Third! </li><ul><li>Voter's Daily and CouchDB </li></ul></ul>
  3. 3. First! A Hypothetical
  4. 4. I want to query space.
  5. 5. The Kepler Mission <ul><li>NASA's search for extra-solar planets
  6. 6. 100,000 stars
  7. 7. 3.5 years of constant observation
  8. 8. Sensitive measurements
  9. 9. How would you store this data so that your researchers can analyze it effectively?
  10. 10. (Hint: It is probably not sqlite on a thumb drive.) </li></ul>
  11. 11. The Relational Model
  12. 12. Pros and Cons <ul><li>SQL lets you query all the data at once
  13. 13. Enforces data integrity
  14. 14. Minimizes repetition
  15. 15. Proven
  16. 16. Familiar </li><ul><li>To your DBA
  17. 17. To your users </li></ul></ul><ul><li>Rigidly schematic
  18. 18. Joins rapidly become a bottleneck
  19. 19. Difficult to scale up
  20. 20. Gets in the way of parallelization
  21. 21. Optimization may mitigate the benefits of normalization </li></ul>
  22. 22. The Non-Relational Model
  23. 23. Pros and Cons <ul><li>Schema-less
  24. 24. Master ↔ Master replication
  25. 25. Scales well
  26. 26. Map/Reduce means everything runs in parallel
  27. 27. Built for the web </li></ul><ul><li>No SQL
  28. 28. Integrity-enforcement migrates to code
  29. 29. Limited ORM tooling
  30. 30. Significant learning curve
  31. 31. Proven only in a subset of cases </li></ul>
  32. 32. Second! Platforms
  33. 33. Traits of NRDBs <ul><li>Usually they are a key/value datastore
  34. 34. Often, they offer Master ↔ Master replication
  35. 35. In most cases they store schema-less data
  36. 36. Typically they scale by “automatic” sharding
  37. 37. Sometimes they offer “eventual consistency”
  38. 38. For the most part they are fast
  39. 39. Generally they are targeted at web applications
  40. 40. Frequently we can't define what they are </li></ul>
  41. 41. Used Memcache? <ul><li>Memcache is a high-availability key/value store
  42. 42. Imagine if Memcache was your database
  43. 43. That is more or less what an NRDB is
  44. 44. Except that everything is permanently “cached” to disk
  45. 45. And only the most common result sets are in held in RAM (it could be all of them)
  46. 46. In most cases this is faster than computing fresh results based on indices (that is, SQL) </li></ul>
  47. 47. Top 10 NRDBs... <ul><li>Azure Table Storage
  48. 48. Berkeley DB
  49. 49. BigTable
  50. 50. Cassandra
  51. 51. CouchDB
  52. 52. HyperTable
  53. 53. MongoDB
  54. 54. Project Voldemort
  55. 55. SimpleDB
  56. 56. Tokyo Cabinet </li></ul>
  57. 57. ...and their backers <ul><li>Azure Table Storage ->
  58. 58. Berkeley DB ->
  59. 59. BigTable ->
  60. 60. Cassandra ->
  61. 61. CouchDB ->
  62. 62. HyperTable ->
  63. 63. MongoDB ->
  64. 64. Project Voldemort ->
  65. 65. SimpleDB ->
  66. 66. Tokyo Cabinet -> </li></ul><ul>Microsoft ( 2, 6 ) Oracle ( 3 ) Google ( 4 , 1 ) Facebook ( 2 ) IBM ( 1 ) Baidu ( 9 ) SourceForge ( 182 ) LinkedIn ( 65 ) Amazon ( 29 ) Mixi ( 88 ) </ul>Blue: Largest software companies according to Forbes (2009) Red: Highest traffic websites according to Alexa (as of 9/17)
  67. 67. This is not a fad.
  68. 68. Primary Use-cases <ul><li>Ridiculous scale
  69. 69. Unstructured data
  70. 70. Massive datasets (broad > deep)
  71. 71. Fuzzy and/or fault tolerant data
  72. 72. Versioned data
  73. 73. Logging
  74. 74. When eventual consistency is good enough </li></ul>
  75. 75. If you are storing a JSON or XML string in your SQL database: I Have Your Medicine
  76. 76. Mis-use Cases <ul><li>SQL is a prerequisite
  77. 77. Deeply hierarchical datasets
  78. 78. Data integrity that must be enforced by a DBA
  79. 79. High security applications where the database must enforce that security (LAN/WAN facing)
  80. 80. Transactional data (banking, analytics, etc.)
  81. 81. Usage is highly unpredictable, combinatorial, or likely to change suddenly </li></ul>
  82. 82. Third! Voter's Daily and CouchDB
  83. 83. My Requirements <ul><li>Loss-less data structures (non-uniform data)
  84. 84. Loosely coupled dependency </li><ul><li>Portable
  85. 85. RESTful </li></ul><li>Scalable without refactoring
  86. 86. Understood by the Gov2.0 community
  87. 87. Reusable / Educational / Transparent </li></ul>
  88. 88. CouchDB <ul><li>Schema-less
  89. 89. “Speaks” JSON
  90. 90. “Thinks” Javascript (optionally, Python)
  91. 91. RESTful API
  92. 92. Pre-collates Views (on insert) for fast reads
  93. 93. Supports Master ↔ Master replication
  94. 94. “Futon” management interface
  95. 95. Written in Erlang </li></ul>
  96. 96. An Example JSON Document { &quot; _id &quot;: &quot;2006-12-06T00:00:00Z - C-SPAN House Ways and Means Committee Schedule Scraper&quot; , &quot; _rev &quot;: &quot;1-2ca577e0a4a25ad2704fdf5a20161f9f&quot; , &quot; datetime &quot;: &quot;2006-12-06T00:00:00Z&quot; , &quot; end_datetime &quot;: null , &quot; title &quot;: &quot;Hearing on Patient Safety and Quality Issues in End Stage Renal Disease Treatment&quot; , &quot; description &quot;: null , &quot; branch &quot;: &quot;Legislative&quot; , &quot; entity &quot;: &quot;House of Representatives&quot; , &quot; source_url &quot;: &quot;http://www3.capwiz.com/c-span/dbq/officials/schedule.dbq?committee= hways&command=committee_schedules&chambername=House&chamber=H& period=&quot; , &quot; source_text &quot;: &quot;<span class=&quot;cwnormalbold&quot;>DECEMBER 06, 2006<br /></span> u000au0009<span class=&quot;cwnormal&quot;>Hearing on Patient Safety and Quality Issues in End Stage Renal Disease Treatment<br /></span>&quot; , &quot; access_datetime &quot;: &quot;2009-09-28T04:19:02Z&quot; , &quot; parser_name &quot;: &quot;C-SPAN House Ways and Means Committee Schedule Scraper&quot; , &quot; parser_version &quot;: &quot;0.1&quot; }
  97. 97. Views with Map/Reduce I <ul>All events scraped for the Supreme Court Map: Reduce: </ul>function ( doc ) { if ( doc. entity == “Supreme Court” ) { emit ( doc. datetime , doc ) } } None
  98. 98. Views with Map/Reduce II function ( doc ) { var month = doc. datetime . substr ( 0 , 7 ); emit ( month , 1 ); } function ( key , values ) { return sum ( values ); } <ul>All events counted by event month Map: Reduce: </ul>
  99. 99. Lessons <ul><li>Unlearning normalization is very difficult
  100. 100. Harnessing “high availability” requires a large up-front investment of development time
  101. 101. Map/Reduce and SQL shouldn't even be used in the same sentence (GQL is a stupid name)
  102. 102. Schema-less data is fantastic
  103. 103. Integrity checking in code is not so bad (that is what abstraction is for)
  104. 104. Doing Joins in code is actually very liberating </li></ul>
  105. 105. Conclusions <ul><li>You (probably) do not need an NRDB
  106. 106. But you ought to learn one anyway
  107. 107. It's not just for Twitter and bleeding edge startups
  108. 108. Amazon, Facebook, Google, IBM, and Microsoft all get this
  109. 109. Sometimes it is simply the right tool for the job </li></ul>
  110. 110. Links <ul><li>CouchDB: http://couchdb.apache.org/
  111. 111. CouchDB & Map/Reduce Emulator: http://labs.mudynamics.com/wp-content/uploads/2009/04/icouch.html
  112. 112. NASA's Kepler Mission: http://kepler.nasa.gov/
  113. 113. ReadWriteWeb on NRDBs: http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
  114. 114. Voter's Daily: http://github.com/bouvard/votersdaily </li></ul>

×