Front Range PHP NoSQL Databases
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Front Range PHP NoSQL Databases

on

  • 6,799 views

The presentation I did for the FrontRange PHP User Group on 3/10/2010.

The presentation I did for the FrontRange PHP User Group on 3/10/2010.

Statistics

Views

Total Views
6,799
Views on SlideShare
5,631
Embed Views
1,168

Actions

Likes
3
Downloads
92
Comments
0

28 Embeds 1,168

http://frontrangephp.org 548
http://jonsbraindump.blogspot.com 244
http://www.aalizwel.com 118
http://www.frontrangephp.org 111
http://aalizwel.com 32
http://www.slideshare.net 31
http://jonsbraindump.blogspot.co.uk 14
http://jonsbraindump.blogspot.ru 13
http://jonsbraindump.blogspot.in 8
http://jonsbraindump.blogspot.se 5
http://jonsbraindump.blogspot.de 5
http://jonsbraindump.blogspot.it 4
http://jonsbraindump.blogspot.jp 4
http://jonsbraindump.blogspot.com.br 4
http://jonsbraindump.blogspot.com.au 4
http://jonsbraindump.blogspot.fr 3
http://jonsbraindump.blogspot.nl 3
http://jonsbraindump.blogspot.com.es 3
http://jonsbraindump.blogspot.ca 3
http://translate.googleusercontent.com 3
http://web.archive.org 1
http://jonsbraindump.blogspot.kr 1
http://jonsbraindump.blogspot.ch 1
http://jonsbraindump.blogspot.cz 1
http://jonsbraindump.blogspot.gr 1
http://jonsbraindump.blogspot.ie 1
http://www.blogger.com 1
http://jonsbraindump.blogspot.no 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Introduce Disclose work for Basho Working on Dynamo clone for the last couple of years

Front Range PHP NoSQL Databases Presentation Transcript

  • 1. NoSQL Databases Jon Meredith [email_address]
  • 2. What isn't NoSQL?
    • NOT a standard.
    • 3. NOT a product.
    • 4. NOT a single technology.
  • 5. Well, what is it?
      It's a buzzword .
    • A banner for non-relational databases to organize under.
    • 6. Mostly created in response to scaling and reliability problems.
    • 7. Huge differences between 'NoSQL' systems – but have elements in common.
  • 8. Where did it come from?
    • They've been around for a while
      • Local key/value stores
      • 9. Object databases
      • 10. Graph databases
      • 11. XML databases
    • New problems are emerging
      • Internet search
      • 12. e-commerce
      • 13. Social networking
  • 14. Where did it come from?
    • Some efforts came from scaling the web...
    • 15. Several papers published
      • 2006 – Google BigTable
      • 16. 2007 – Dynamo Paper
    • In 2008 - explosion of data storage projects
    • 17. All shambling under the NoSQL banner.
  • 18. Really, why not use RDBMs?
    • I need to perform arbitrary queries
    • 19. My application needs transactions
    • 20. Data needs to be nicely normalized
    • 21. I have replication for scalabilty/reliability
  • 22. Data Mapping Woes
    • Relational databases divide data into columns made up of tables.
    • 23. Programmers use complex nested data structures
      • Hashes
      • 24. Sets
      • 25. Arrays
      • 26. Things of things
    • Have to map between the two
  • 27. Data Mapping Woes (2)
    • Data in systems evolve over time … which means changes to the schema.
    • 28. Upgrade/rollback scripts have to operate on the whole database – could be millions of rows.
    • 29. Doing phased rollouts is hard … the application needs to do work
  • 30. Alternative!
    • Let the application do it
    • 31. Use convenient language features
      • PHP serialize/unserialize
    • … or use standards for mixed platforms
      • JSON very popular and well supported
      • 32. Google's protocol buffers
      • 33. … even XML
    • Design for forward compatibility
      • Preserve unknown fields
      • 34. Version objects
  • 35. Scalability and Availability
    • Scalability
      • How many requests you can process
    • Availability
      • How does your service degrade as things break.
    • RDBMS solutions - replication and sharding
  • 36. Scaling RDBMs - Replication
    • Master-Slave replication is easiest
    • 37. Every change on the master happens on the slave.
    • 38. Slaves are read-only. Does not scale INSERT, UPDATE, DELETE queries.
    • 39. Application responsible for distributing queries to correct server.
  • 40. Scaling RDBMs - Replication
    • Multi-master ring replication
      • Can update any master
      • 41. Updates travel around the ring
      • 42. What happens when it fails?
        • Reconfigure the ring
      • What happens on return
        • Synchronize the master
        • 43. Add back in to the ring
  • 44. Replication
    • Replication is usually asynchronous for performance – you don't want to wait for the slowest slave on each update.
    • 45. Replication takes time – there is time lag between the first and last server to see an update.
    • 46. You may not read your writes – not getting aCid properties any more.
  • 47. Scaling RDBMS – Sharding
    • Do application level splitting of data
      • Split large table into N smaller tables
      • 48. Use Id modulo N to find the right table
    • Tables could be spread across multiple database servers
      • But the application needs to know where to query
  • 49. Availability
    • If you want availability you need multiple servers – maybe even multiple sites.
    • 50. In the real world you get network partitions
      • Just because you can't see your other data center doesn't mean users can't.
    • What should you do if you can't see the other data center?
  • 51. Availability
    • Degrade one site to read-only
      • Defeats availability
    • If you allow both sites to operate
      • There's a chance two users could modify the same data.
      • 52. The application needs to know how to resolve it
  • 53. The bottom line...
    • Building systems that are
      • ...Scalable...
      • 54. ...Available...
      • 55. ...Maintainable...
      • 56. with an RDBMs requires large efforts by application developers and operational staff
  • 57. It's hard because...
    • Significant work for developers.
      • App needs to convert data to table/columns
      • 58. App needs to know data location
      • 59. App needs to handle failover
      • 60. App needs to handle inconsistency
    • Work for operational staff
      • Fixing replication topologies and synchronizing servers is fiddly work.
  • 61. Last decades bleeding edge is here
    • Organizations with big problems started experimenting with alternatives
    • 62. Developed internal systems during the mid 2000s
      • Distributed by design
      • 63. Different data models
    • Published details in 2006/2007
  • 64. Amazon
    • Huge e-commerce vendor.
    • 65. Amazon cares about customer experience
      • Availabilty
      • 66. Latency at the 99 th percentile
    • Built as an SOA – pages built from hundreds of services.
    • 67. Amazon runs multiple data centers.
      • Hardware failure is their normal state
      • 68. Network partitions common
  • 69. Amazon Requirements
    • Shopping cart service must always be available
    • 70. Customers should be able to view and add to their carts (in their words)
      • If disks are failing
      • 71. Network routes are flapping
      • 72. Data centers are being destroyed by tornadoes
  • 73. Amazon Observations
    • Many services just stored state.
      • Access by primary key
      • 74. No queries
    • Examples
      • Shopping carts
      • 75. Best seller lists
      • 76. Customer profiles
    • Hard to scale out relational databases
  • 77. Amazon Solution: Dynamo
    • Primary key access only
    • 78. Fault tolerant: Keeps N copies of the data
    • 79. Designed for inconsistency
    • 80. Totally decentralized – nodes 'gossip' state
    • 81. Self-healing
  • 82. Eventual Consistency 1
    • Brewer's CAP Theorem
      • Consistency
      • 83. Availability
      • 84. Partition tolerance
    • Pick two out of three!
    • 85. Amazon chose A-P over C
  • 86. Eventual Consistency 2
    • N copies of each value
    • 87. Read operations (get) require 'R' nodes to respond
    • 88. Write operations (put) require 'W' nodes to respond
    • 89. If R+W > N nodes will read their writes (if no failure)
    • 90. NRW tunes the cluster – typically (3,2,2)
  • 91. Eventual Consistency 3
    • Consequence of availability: Conflicts
    • 92. Conflicts can come from
      • Network partitions
      • 93. Applications themselves – no transactions or locking
    • Applications must handle conflicts
    • 94. Dynamo minimizes with vector clocks
  • 95. Vector Clocks
  • 96. Partitioning
  • 97. Example: Shopping Cart
    • User browses site – adds 3 widgets
  • 98. Shopping Cart - Conflict Network Failure
  • 99. Shopping Cart - Merge
  • 100. Open Source Dynamo
    • Dynamo is internal to Amazon
    • 101. Open source options
      • Riak from Basho
      • 102. Project Voldemort
  • 103. Google BigTables
    • Used internally at Google
      • Indexing the web
      • 104. Google Earth
      • 105. Finance
    • Distributed storage system for structured data
  • 106. Data representation
    • Data stored in tables.
    • 107. Table indexed by {key,timestamp} and a variable number of sparse columns
    • 108. Columns are grouped into column families. Columns in a family are stored together.
    • 109. Each table is broken into tablets.
    • 110. Tablets are stored on a cluster file system (GFS).
  • 111. BigTable – Column Families Copyright Google
  • 112. Map/Reduce
    • Processing framework that sits on top of BigTable.
    • 113. Programmers write two functions map() and reduce().
    • 114. Table is mapped, then reduced.
    • 115. Job control system monitors and resubmits.
  • 116. Map/Reduce Source: institutes.lanl.gov
  • 117. BigTable has inspired...
    • Hadoop/Hbase
    • 118. Cassandra
    • Riak
    • 119. CouchDB
    Map/Reduce
  • 120. Explosion of NoSQL Dbs
    • Too many projects
    • 121. Two good resources
      • http://nosql.mypopescu.com/
      • 122. http://www.vineetgupta.com/ 2010/01/nosql-databases-part-1-landscape.html
  • 123. So many projects! Dynamo, BigTables, Redis Riak, Voldemort, CouchDb, Peanuts Hadoop/Hbase, Cassandra, Hypertable MongoDb, Terrastore, Scalaris, BerkleyDB MemcacheDB, Dynomite, Neo4J, TokyoCabinet … and more
  • 124. NoSQL Characteristics
    • Broad types
      • Key/Value
      • 125. Sparse Column Family
      • 126. Document oriented
    • Persistence
      • In memory
      • 127. On disk
    • Distribution
      • Replicated
      • 128. Decentralized
  • 129. Riak from Basho http://riak.basho.com
    • Dynamo clone written in Erlang
    • 130. RESTful HTTP interface
    • 131. Fully distributed
    • 132. Clients for multiple languages
    • 133. Multiple storage backends
      • In-memory
      • 134. Filesystem
      • 135. Embedded InnoDB
    • I work there now!
  • 136. Redis 1.2
    • http://code.google.com/p/redis/
    • 137. Key/Value Store with structured values
    • 138. Written in C
    • 139. Memcache-like protocol
    • 140. In use at
      • Github
      • 141. Engine Yard
      • 142. VideoWiki
  • 143. Redis 1.2 (cont)
    • Values can be strings, sets, ordered sets, lists
    • 144. Operations like increment, decrement, intersection, push, pop
    • 145. In-memory (can be backed by disk)
    • 146. Auto-sharding in client libraries
    • 147. No fault tolerance (coming after 2.0)
    • 148. Example: retwis – Twitter clone in PHP
  • 149. Cassandra
    • http://incubator.apache.org/cassandra/
    • 150. BigTable ColumnFamily data model
    • 151. Dynamo data distribution
    • 152. Written in Java
    • 153. Thrift based interface
    • 154. In use at
      • Facebook
      • 155. Twitter
  • 156. CouchDB
    • Document oriented database
      • All JSON documents
    • Written in Erlang
    • 157. Used by Ubuntu One
    • 158. HTTP interface
    • 159. Uses Javascript for indexing/mapreduce
    • 160. Incremental replication
  • 161. BerkleyDB
    • Sleepycat now owned by Oracle
    • 162. Key/Value Store
      • Multi-threaded
      • 163. Multi-process
      • 164. Replicated
      • 165. Tranactional
    • Alternative: Tokyo Cabinet
  • 166. I'm out of time
    • MongoDB
    • 167. Neo4J – Graph Database
    • 168. Peanuts – Yahoo
  • 169. This is all great but...
    • Relational databases provide a lot of functionality.
      • Giving up queries
      • 170. Even range queries are hard for distributed hash systems.
      • 171. No transactions – rules out some classes of applications.
      • 172. Space is still evolving
  • 173. Conclusion
    • NoSQL systems give applications the tools they need for scalability/availability
    • 174. They force you to think about distributed design issues like consistency.
    • 175. Play with them!