Non-Relational Databases: This hurts. I like it.
Upcoming SlideShare
Loading in...5
×
 

Non-Relational Databases: This hurts. I like it.

on

  • 1,504 views

Delivered at San Luis Obispo .NET Users Group on October 13th, 2009.

Delivered at San Luis Obispo .NET Users Group on October 13th, 2009.

Statistics

Views

Total Views
1,504
Slideshare-icon Views on SlideShare
1,470
Embed Views
34

Actions

Likes
2
Downloads
49
Comments
0

2 Embeds 34

http://www.etlafins.com 27
http://www.slideshare.net 7

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Non-Relational Databases: This hurts. I like it. Non-Relational Databases: This hurts. I like it. Presentation Transcript

    • Non-Relational Databases: This hurts. I like it. Christopher Groskopf / bouvard / @onyxfish
    • Outline
      • First!
        • A Hypothetical
      • Second!
        • Platforms
      • Third!
        • Voter's Daily and CouchDB
    • First! A Hypothetical
    • I want to query space.
    • The Kepler Mission
      • NASA's search for extra-solar planets
      • 100,000 stars
      • 3.5 years of constant observation
      • Sensitive measurements
      • How would you store this data so that your researchers can analyze it effectively?
      • (Hint: It is probably not sqlite on a thumb drive.)
    • The Relational Model
    • Pros and Cons
      • SQL lets you query all the data at once
      • Enforces data integrity
      • Minimizes repetition
      • Proven
      • Familiar
        • To your DBA
        • To your users
      • Rigidly schematic
      • Joins rapidly become a bottleneck
      • Difficult to scale up
      • Gets in the way of parallelization
      • Optimization may mitigate the benefits of normalization
    • The Non-Relational Model
    • Pros and Cons
      • Schema-less
      • Master ↔ Master replication
      • Scales well
      • Map/Reduce means everything runs in parallel
      • Built for the web
      • No SQL
      • Integrity-enforcement migrates to code
      • Limited ORM tooling
      • Significant learning curve
      • Proven only in a subset of cases
    • Second! Platforms
    • Traits of NRDBs
      • Usually they are a key/value datastore
      • Often, they offer Master ↔ Master replication
      • In most cases they store schema-less data
      • Typically they scale by “automatic” sharding
      • Sometimes they offer “eventual consistency”
      • For the most part they are fast
      • Generally they are targeted at web applications
      • Frequently we can't define what they are
    • Used Memcache?
      • Memcache is a high-availability key/value store
      • Imagine if Memcache was your database
      • That is more or less what an NRDB is
      • Except that everything is permanently “cached” to disk
      • And only the most common result sets are in held in RAM (it could be all of them)
      • In most cases this is faster than computing fresh results based on indices (that is, SQL)
    • Top 10 NRDBs...
      • Azure Table Storage
      • Berkeley DB
      • BigTable
      • Cassandra
      • CouchDB
      • HyperTable
      • MongoDB
      • Project Voldemort
      • SimpleDB
      • Tokyo Cabinet
    • ...and their backers
      • Azure Table Storage ->
      • Berkeley DB ->
      • BigTable ->
      • Cassandra ->
      • CouchDB ->
      • HyperTable ->
      • MongoDB ->
      • Project Voldemort ->
      • SimpleDB ->
      • Tokyo Cabinet ->
        Microsoft ( 2, 6 ) Oracle ( 3 ) Google ( 4 , 1 ) Facebook ( 2 ) IBM ( 1 ) Baidu ( 9 ) SourceForge ( 182 ) LinkedIn ( 65 ) Amazon ( 29 ) Mixi ( 88 )
      Blue: Largest software companies according to Forbes (2009) Red: Highest traffic websites according to Alexa (as of 9/17)
    • This is not a fad.
    • Primary Use-cases
      • Ridiculous scale
      • Unstructured data
      • Massive datasets (broad > deep)
      • Fuzzy and/or fault tolerant data
      • Versioned data
      • Logging
      • When eventual consistency is good enough
    • If you are storing a JSON or XML string in your SQL database: I Have Your Medicine
    • Mis-use Cases
      • SQL is a prerequisite
      • Deeply hierarchical datasets
      • Data integrity that must be enforced by a DBA
      • High security applications where the database must enforce that security (LAN/WAN facing)
      • Transactional data (banking, analytics, etc.)
      • Usage is highly unpredictable, combinatorial, or likely to change suddenly
    • Third! Voter's Daily and CouchDB
    • My Requirements
      • Loss-less data structures (non-uniform data)
      • Loosely coupled dependency
        • Portable
        • RESTful
      • Scalable without refactoring
      • Understood by the Gov2.0 community
      • Reusable / Educational / Transparent
    • CouchDB
      • Schema-less
      • “Speaks” JSON
      • “Thinks” Javascript (optionally, Python)
      • RESTful API
      • Pre-collates Views (on insert) for fast reads
      • Supports Master ↔ Master replication
      • “Futon” management interface
      • Written in Erlang
    • An Example JSON Document { &quot; _id &quot;: &quot;2006-12-06T00:00:00Z - C-SPAN House Ways and Means Committee Schedule Scraper&quot; , &quot; _rev &quot;: &quot;1-2ca577e0a4a25ad2704fdf5a20161f9f&quot; , &quot; datetime &quot;: &quot;2006-12-06T00:00:00Z&quot; , &quot; end_datetime &quot;: null , &quot; title &quot;: &quot;Hearing on Patient Safety and Quality Issues in End Stage Renal Disease Treatment&quot; , &quot; description &quot;: null , &quot; branch &quot;: &quot;Legislative&quot; , &quot; entity &quot;: &quot;House of Representatives&quot; , &quot; source_url &quot;: &quot;http://www3.capwiz.com/c-span/dbq/officials/schedule.dbq?committee= hways&command=committee_schedules&chambername=House&chamber=H& period=&quot; , &quot; source_text &quot;: &quot;<span class=&quot;cwnormalbold&quot;>DECEMBER 06, 2006<br /></span> u000au0009<span class=&quot;cwnormal&quot;>Hearing on Patient Safety and Quality Issues in End Stage Renal Disease Treatment<br /></span>&quot; , &quot; access_datetime &quot;: &quot;2009-09-28T04:19:02Z&quot; , &quot; parser_name &quot;: &quot;C-SPAN House Ways and Means Committee Schedule Scraper&quot; , &quot; parser_version &quot;: &quot;0.1&quot; }
    • Views with Map/Reduce I
        All events scraped for the Supreme Court Map: Reduce:
      function ( doc ) { if ( doc. entity == “Supreme Court” ) { emit ( doc. datetime , doc ) } } None
    • Views with Map/Reduce II function ( doc ) { var month = doc. datetime . substr ( 0 , 7 ); emit ( month , 1 ); } function ( key , values ) { return sum ( values ); }
        All events counted by event month Map: Reduce:
    • Lessons
      • Unlearning normalization is very difficult
      • Harnessing “high availability” requires a large up-front investment of development time
      • Map/Reduce and SQL shouldn't even be used in the same sentence (GQL is a stupid name)
      • Schema-less data is fantastic
      • Integrity checking in code is not so bad (that is what abstraction is for)
      • Doing Joins in code is actually very liberating
    • Conclusions
      • You (probably) do not need an NRDB
      • But you ought to learn one anyway
      • It's not just for Twitter and bleeding edge startups
      • Amazon, Facebook, Google, IBM, and Microsoft all get this
      • Sometimes it is simply the right tool for the job
    • Links
      • CouchDB: http://couchdb.apache.org/
      • CouchDB & Map/Reduce Emulator: http://labs.mudynamics.com/wp-content/uploads/2009/04/icouch.html
      • NASA's Kepler Mission: http://kepler.nasa.gov/
      • ReadWriteWeb on NRDBs: http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
      • Voter's Daily: http://github.com/bouvard/votersdaily