Non-relational Databases A new kind of Databases for handling Web Scale
Agenda <ul><li>The problem
The solution
Benefits
Cost
Example: Cassandra </li></ul>
The problem <ul><li>The Web introduces a new scale for applications, in terms of: </li><ul><li>Concurrent users  (millions...
Data  (peta-bytes generated daily)
Processing  (all this data needs processing)
Exponential growth  (surging unpredictable demands) </li></ul></ul>
The problem (contd.) <ul><li>Web sites with very large traffic have no way to deal with this using existing RDBMS solution...
MS SQL
Sybase
MySQL
PostgreSQL </li></ul><li>Even with their high-end clustering solutions </li></ul>
The problem (contd.) <ul><li>Why? </li><ul><li>Applications using normalized database schema require the use of join's, wh...
Existing RDBMS clustering solutions require scale-up, which is limited & not really scalable when dealing with exponential...
Machines have upper limits on capacity, & sharding the data & processing across machines is very complex & app-specific </...
The problem (contd.) <ul><li>Why not just use sharding? </li><ul><li>Very problematic when adding/removing nodes
Basically, you end up denormalizing everything & loosing all benefits of relational databases </li></ul></ul>
Who faced this problem? <ul><li>Web applications dealing with high traffic, massive data, large user-base & user-generated...
Yahoo!
Amazon
Facebook
Twitter
Linked-In
& many more </li></ul></ul>
1 difference though <ul><li>Compared to traditional large applications (telco, financial, &c), these web applications are ...
Web search result (Google /Yahoo!)
Item added to cart (Amazon) </li></ul></ul></ul></ul>
The solution <ul><li>These companies had to come up with a new kind of DBMS, capable of handling web scale </li><ul><li>Po...
Must we sacrifice something? <ul><li>In 2000, Eric Brewer (co-founder of Inktomi) formulated the CAP theorem, claiming tha...
A vailability
P artition-tolerance </li></ul><li>BTW, the theorem was later proved by MIT scientists in 2002 </li></ul>
Simple example <ul><li>When you have a lot of data which needs to be highly available, you'll usually need to  p artition ...
This means, that when writing a record, all replica's must be updated too
Now you need to choose between: </li><ul><li>Lock all relevant replica's during update => be less  a vailable
Upcoming SlideShare
Loading in...5
×

Nonrelational Databases

12,622

Published on

My improvised/copied preso for some short talk I gave.

Published in: Technology

Nonrelational Databases

  1. 1. Non-relational Databases A new kind of Databases for handling Web Scale
  2. 2. Agenda <ul><li>The problem
  3. 3. The solution
  4. 4. Benefits
  5. 5. Cost
  6. 6. Example: Cassandra </li></ul>
  7. 7. The problem <ul><li>The Web introduces a new scale for applications, in terms of: </li><ul><li>Concurrent users (millions of reqs/second)
  8. 8. Data (peta-bytes generated daily)
  9. 9. Processing (all this data needs processing)
  10. 10. Exponential growth (surging unpredictable demands) </li></ul></ul>
  11. 11. The problem (contd.) <ul><li>Web sites with very large traffic have no way to deal with this using existing RDBMS solutions: </li><ul><li>Oracle
  12. 12. MS SQL
  13. 13. Sybase
  14. 14. MySQL
  15. 15. PostgreSQL </li></ul><li>Even with their high-end clustering solutions </li></ul>
  16. 16. The problem (contd.) <ul><li>Why? </li><ul><li>Applications using normalized database schema require the use of join's, which doesn't perform well under lots of data and/or nodes
  17. 17. Existing RDBMS clustering solutions require scale-up, which is limited & not really scalable when dealing with exponential growth
  18. 18. Machines have upper limits on capacity, & sharding the data & processing across machines is very complex & app-specific </li></ul></ul>
  19. 19. The problem (contd.) <ul><li>Why not just use sharding? </li><ul><li>Very problematic when adding/removing nodes
  20. 20. Basically, you end up denormalizing everything & loosing all benefits of relational databases </li></ul></ul>
  21. 21. Who faced this problem? <ul><li>Web applications dealing with high traffic, massive data, large user-base & user-generated content, such as: </li><ul><li>Google
  22. 22. Yahoo!
  23. 23. Amazon
  24. 24. Facebook
  25. 25. Twitter
  26. 26. Linked-In
  27. 27. & many more </li></ul></ul>
  28. 28. 1 difference though <ul><li>Compared to traditional large applications (telco, financial, &c), these web applications are usually free & therefore: </li><ul><li>can sacrifice data integrity / consistency </li><ul><li>No one will sue them if he doesn't receive the most current: </li><ul><li>status of their friends (Facebook/Twitter)
  29. 29. Web search result (Google /Yahoo!)
  30. 30. Item added to cart (Amazon) </li></ul></ul></ul></ul>
  31. 31. The solution <ul><li>These companies had to come up with a new kind of DBMS, capable of handling web scale </li><ul><li>Possibly sacrificing some level of consistency or some other feature </li></ul></ul>
  32. 32. Must we sacrifice something? <ul><li>In 2000, Eric Brewer (co-founder of Inktomi) formulated the CAP theorem, claiming that you can only optimize 2 out of these 3: </li><ul><li>C onsistency
  33. 33. A vailability
  34. 34. P artition-tolerance </li></ul><li>BTW, the theorem was later proved by MIT scientists in 2002 </li></ul>
  35. 35. Simple example <ul><li>When you have a lot of data which needs to be highly available, you'll usually need to p artition it across machines & also replicate it to be more fault-tolerant
  36. 36. This means, that when writing a record, all replica's must be updated too
  37. 37. Now you need to choose between: </li><ul><li>Lock all relevant replica's during update => be less a vailable
  38. 38. Don't lock the replicas => be less c onsistent </li></ul></ul>
  39. 39. The consequence <ul><li>You need to either: </li><ul><li>Drop partition tolerance (CA)
  40. 40. Drop availability (CP)
  41. 41. Drop consistency (AP) </li></ul><li>“Drop” here is usually not meant as binary, but rather tunable </li></ul>
  42. 42. Non-relational databases <ul><li>The solution these companies came up with are a family of database for handling web scale: </li><ul><li>BigTable (developed at Google)
  43. 43. Hbase (developed at Yahoo!)
  44. 44. Dynamo (developed at Amazon)
  45. 45. Cassandra (developed at FaceBook)
  46. 46. Voldemort (developed at LinkedIn)
  47. 47. & a few more: </li><ul><li>Riak, Redis, CouchDB, MongoDB, Hypertable </li></ul></ul></ul>
  48. 48. Benefits <ul><li>Massively scalable
  49. 49. Extremely fast
  50. 50. Highly available, decentralized & fault tolerant (no single-point-of-failure)
  51. 51. Transparent sharding (consistent hashing)
  52. 52. Elasticity
  53. 53. Parallel processing
  54. 54. Dynamic schema
  55. 55. Automatic conflict resolution </li></ul>
  56. 56. Consistent hashing
  57. 57. Replication
  58. 58. Replication – node joining
  59. 59. Replication – node leaving
  60. 60. Scale-out / elasticity? <ul><li>O(1) Distributed Hashtable
  61. 61. Runs on a large number of cheap commodity machines
  62. 62. Replication
  63. 63. Gossip protocol
  64. 64. Transparently handles adding/removing nodes </li></ul>
  65. 65. Tunable consistency? <ul><li>Levels of consistency: </li><ul><li>Strict consistency
  66. 66. Read your writes consistency
  67. 67. Session consistency
  68. 68. Monotonic read consistency
  69. 69. Eventual consistency </li></ul><li>Tunable means: how many replica's to lock on write </li><ul><li>N, R, W parameters
  70. 70. Quorum </li></ul></ul>
  71. 71. Dealing with inconsistency <ul><li>Read-repair (when encountering inconsistency)
  72. 72. Vector clock conflict resolution </li></ul>
  73. 73. Dynamic schema <ul><li>Column families (basically a sparse table) </li></ul>
  74. 74. Dynamic schema (contd.) <ul><li>“Supercolumn” is a collection of columns
  75. 75. Record can have several “supercolumns” </li></ul>
  76. 76. Data processing <ul><li>Map/Reduce: an API exposed by non-relational databases to process data </li><ul><li>A functional programming pattern for parallelizing work
  77. 77. Brings the workers to the data – excellent fit for non-relational databases
  78. 78. Minimizes the programming to 2 simple functions (map & reduce)
  79. 79. Example: count appearances of a word in a giant table of large texts </li></ul></ul>
  80. 80. Map/Reduce (contd.)
  81. 81. Storage
  82. 82. Cost <ul><li>Allows sacrificing consistency (ACID) - at certain circumstances (but can deal with it)
  83. 83. Non-standard new API model
  84. 84. Non-standard new Schema model
  85. 85. New knowledge required to tune/optimize
  86. 86. Less mature </li></ul>
  87. 87. API model <ul><li>Usually, similar to Key-Value map: </li><ul><li>Get(key)
  88. 88. Put(key, value)
  89. 89. Delete(key)
  90. 90. Execute(operation, key_list) </li></ul><li>“value” can be </li><ul><li>an opaque serialized object
  91. 91. a record (list of “columns”: <name, value, timestamp>) </li></ul></ul>
  92. 92. Schema model <ul><li>Kind of sparse table
  93. 93. No schema </li></ul>
  94. 94. Example: Cassandra <ul><li>Features: </li><ul><li>O(1) DHT
  95. 95. Eventual consistency </li><ul><li>tunable: consistency vs. latency </li></ul><li>Values are structured, indexed
  96. 96. Columns / column families
  97. 97. Slicing with predicates (queries)
  98. 98. PartitionOrderer </li></ul></ul>
  99. 99. Cassandra performance <ul><li>Benchmark against MySQL (50GB) </li><ul><li>MySQL: </li><ul><li>300ms write
  100. 100. 350ms read </li></ul><li>Cassandra: </li><ul><li>0.12ms write
  101. 101. 15ms read </li></ul></ul><li>how come writes are so fast? </li><ul><li>Writes involve no reads/seeks
  102. 102. Use any node (closest to you) </li></ul></ul>
  103. 103. Cassandra API
  104. 104. Cassandra API (contd.)
  105. 105. Example: Cassandra (contd.) <ul><li>Java API </li><ul><li>Simple DAO
  106. 106. Simple client </li></ul></ul>
  107. 107. Cassandra usage <ul><li>Very high-traffic sites: </li><ul><li>Facebook
  108. 108. Digg
  109. 109. Twitter </li></ul></ul>
  110. 110. Further information <ul><li>The Dynamo paper: </li><ul><li>http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html </li></ul><li>Nosql patterns: </li><ul><li>http://horicky.blogspot.com/2009/11/nosql-patterns.html </li></ul><li>Nosql conference video's: </li><ul><li>https://nosqleast.com/2009/ </li></ul><li>Hebrew podcast covering nosql & Cassandra (episodes 56, 57 & more): </li><ul><li>http://www.reversim.com/ </li></ul></ul>
  111. 111. Further information (contd.) <ul><li>Ran Tavori's lecture (video + slides): </li><ul><li>http://prettyprint.me/2010/01/09/introduction-to-nosql-and-cassandra-part-1/
  112. 112. http://prettyprint.me/2010/01/20/introduction-to-nosql-and-cassandra-part-2/ </li></ul></ul>
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×