Nonrelational Databases
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Nonrelational Databases

on

  • 13,722 views

My improvised/copied preso for some short talk I gave.

My improvised/copied preso for some short talk I gave.

Statistics

Views

Total Views
13,722
Views on SlideShare
13,260
Embed Views
462

Actions

Likes
8
Downloads
225
Comments
0

4 Embeds 462

https://demo.buddycloud.org 439
http://www.slideshare.net 12
http://localhost 9
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Nonrelational Databases Presentation Transcript

  • 1. Non-relational Databases A new kind of Databases for handling Web Scale
  • 2. Agenda
    • The problem
    • 3. The solution
    • 4. Benefits
    • 5. Cost
    • 6. Example: Cassandra
  • 7. The problem
    • The Web introduces a new scale for applications, in terms of:
      • Concurrent users (millions of reqs/second)
      • 8. Data (peta-bytes generated daily)
      • 9. Processing (all this data needs processing)
      • 10. Exponential growth (surging unpredictable demands)
  • 11. The problem (contd.)
    • Web sites with very large traffic have no way to deal with this using existing RDBMS solutions:
      • Oracle
      • 12. MS SQL
      • 13. Sybase
      • 14. MySQL
      • 15. PostgreSQL
    • Even with their high-end clustering solutions
  • 16. The problem (contd.)
    • Why?
      • Applications using normalized database schema require the use of join's, which doesn't perform well under lots of data and/or nodes
      • 17. Existing RDBMS clustering solutions require scale-up, which is limited & not really scalable when dealing with exponential growth
      • 18. Machines have upper limits on capacity, & sharding the data & processing across machines is very complex & app-specific
  • 19. The problem (contd.)
    • Why not just use sharding?
      • Very problematic when adding/removing nodes
      • 20. Basically, you end up denormalizing everything & loosing all benefits of relational databases
  • 21. Who faced this problem?
    • Web applications dealing with high traffic, massive data, large user-base & user-generated content, such as:
      • Google
      • 22. Yahoo!
      • 23. Amazon
      • 24. Facebook
      • 25. Twitter
      • 26. Linked-In
      • 27. & many more
  • 28. 1 difference though
    • Compared to traditional large applications (telco, financial, &c), these web applications are usually free & therefore:
      • can sacrifice data integrity / consistency
        • No one will sue them if he doesn't receive the most current:
          • status of their friends (Facebook/Twitter)
          • 29. Web search result (Google /Yahoo!)
          • 30. Item added to cart (Amazon)
  • 31. The solution
    • These companies had to come up with a new kind of DBMS, capable of handling web scale
      • Possibly sacrificing some level of consistency or some other feature
  • 32. Must we sacrifice something?
    • In 2000, Eric Brewer (co-founder of Inktomi) formulated the CAP theorem, claiming that you can only optimize 2 out of these 3:
      • C onsistency
      • 33. A vailability
      • 34. P artition-tolerance
    • BTW, the theorem was later proved by MIT scientists in 2002
  • 35. Simple example
    • When you have a lot of data which needs to be highly available, you'll usually need to p artition it across machines & also replicate it to be more fault-tolerant
    • 36. This means, that when writing a record, all replica's must be updated too
    • 37. Now you need to choose between:
      • Lock all relevant replica's during update => be less a vailable
      • 38. Don't lock the replicas => be less c onsistent
  • 39. The consequence
    • You need to either:
      • Drop partition tolerance (CA)
      • 40. Drop availability (CP)
      • 41. Drop consistency (AP)
    • “Drop” here is usually not meant as binary, but rather tunable
  • 42. Non-relational databases
    • The solution these companies came up with are a family of database for handling web scale:
      • BigTable (developed at Google)
      • 43. Hbase (developed at Yahoo!)
      • 44. Dynamo (developed at Amazon)
      • 45. Cassandra (developed at FaceBook)
      • 46. Voldemort (developed at LinkedIn)
      • 47. & a few more:
        • Riak, Redis, CouchDB, MongoDB, Hypertable
  • 48. Benefits
    • Massively scalable
    • 49. Extremely fast
    • 50. Highly available, decentralized & fault tolerant (no single-point-of-failure)
    • 51. Transparent sharding (consistent hashing)
    • 52. Elasticity
    • 53. Parallel processing
    • 54. Dynamic schema
    • 55. Automatic conflict resolution
  • 56. Consistent hashing
  • 57. Replication
  • 58. Replication – node joining
  • 59. Replication – node leaving
  • 60. Scale-out / elasticity?
    • O(1) Distributed Hashtable
    • 61. Runs on a large number of cheap commodity machines
    • 62. Replication
    • 63. Gossip protocol
    • 64. Transparently handles adding/removing nodes
  • 65. Tunable consistency?
    • Levels of consistency:
      • Strict consistency
      • 66. Read your writes consistency
      • 67. Session consistency
      • 68. Monotonic read consistency
      • 69. Eventual consistency
    • Tunable means: how many replica's to lock on write
      • N, R, W parameters
      • 70. Quorum
  • 71. Dealing with inconsistency
    • Read-repair (when encountering inconsistency)
    • 72. Vector clock conflict resolution
  • 73. Dynamic schema
    • Column families (basically a sparse table)
  • 74. Dynamic schema (contd.)
    • “Supercolumn” is a collection of columns
    • 75. Record can have several “supercolumns”
  • 76. Data processing
    • Map/Reduce: an API exposed by non-relational databases to process data
      • A functional programming pattern for parallelizing work
      • 77. Brings the workers to the data – excellent fit for non-relational databases
      • 78. Minimizes the programming to 2 simple functions (map & reduce)
      • 79. Example: count appearances of a word in a giant table of large texts
  • 80. Map/Reduce (contd.)
  • 81. Storage
  • 82. Cost
    • Allows sacrificing consistency (ACID) - at certain circumstances (but can deal with it)
    • 83. Non-standard new API model
    • 84. Non-standard new Schema model
    • 85. New knowledge required to tune/optimize
    • 86. Less mature
  • 87. API model
    • Usually, similar to Key-Value map:
      • Get(key)
      • 88. Put(key, value)
      • 89. Delete(key)
      • 90. Execute(operation, key_list)
    • “value” can be
      • an opaque serialized object
      • 91. a record (list of “columns”: <name, value, timestamp>)
  • 92. Schema model
    • Kind of sparse table
    • 93. No schema
  • 94. Example: Cassandra
    • Features:
      • O(1) DHT
      • 95. Eventual consistency
        • tunable: consistency vs. latency
      • Values are structured, indexed
      • 96. Columns / column families
      • 97. Slicing with predicates (queries)
      • 98. PartitionOrderer
  • 99. Cassandra performance
    • Benchmark against MySQL (50GB)
      • MySQL:
        • 300ms write
        • 100. 350ms read
      • Cassandra:
        • 0.12ms write
        • 101. 15ms read
    • how come writes are so fast?
      • Writes involve no reads/seeks
      • 102. Use any node (closest to you)
  • 103. Cassandra API
  • 104. Cassandra API (contd.)
  • 105. Example: Cassandra (contd.)
    • Java API
      • Simple DAO
      • 106. Simple client
  • 107. Cassandra usage
    • Very high-traffic sites:
      • Facebook
      • 108. Digg
      • 109. Twitter
  • 110. Further information
    • The Dynamo paper:
      • http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
    • Nosql patterns:
      • http://horicky.blogspot.com/2009/11/nosql-patterns.html
    • Nosql conference video's:
      • https://nosqleast.com/2009/
    • Hebrew podcast covering nosql & Cassandra (episodes 56, 57 & more):
      • http://www.reversim.com/
  • 111. Further information (contd.)
    • Ran Tavori's lecture (video + slides):
      • http://prettyprint.me/2010/01/09/introduction-to-nosql-and-cassandra-part-1/
      • 112. http://prettyprint.me/2010/01/20/introduction-to-nosql-and-cassandra-part-2/