Your SlideShare is downloading. ×
Front Range PHP NoSQL Databases
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Front Range PHP NoSQL Databases

5,469
views

Published on

The presentation I did for the FrontRange PHP User Group on 3/10/2010.

The presentation I did for the FrontRange PHP User Group on 3/10/2010.

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,469
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
96
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Introduce Disclose work for Basho Working on Dynamo clone for the last couple of years
  • Transcript

    • 1. NoSQL Databases Jon Meredith [email_address]
    • 2. What isn't NoSQL?
      • NOT a standard.
      • 3. NOT a product.
      • 4. NOT a single technology.
    • 5. Well, what is it?
        It's a buzzword .
      • A banner for non-relational databases to organize under.
      • 6. Mostly created in response to scaling and reliability problems.
      • 7. Huge differences between 'NoSQL' systems – but have elements in common.
    • 8. Where did it come from?
      • They've been around for a while
        • Local key/value stores
        • 9. Object databases
        • 10. Graph databases
        • 11. XML databases
      • New problems are emerging
        • Internet search
        • 12. e-commerce
        • 13. Social networking
    • 14. Where did it come from?
      • Some efforts came from scaling the web...
      • 15. Several papers published
        • 2006 – Google BigTable
        • 16. 2007 – Dynamo Paper
      • In 2008 - explosion of data storage projects
      • 17. All shambling under the NoSQL banner.
    • 18. Really, why not use RDBMs?
      • I need to perform arbitrary queries
      • 19. My application needs transactions
      • 20. Data needs to be nicely normalized
      • 21. I have replication for scalabilty/reliability
    • 22. Data Mapping Woes
      • Relational databases divide data into columns made up of tables.
      • 23. Programmers use complex nested data structures
      • Have to map between the two
    • 27. Data Mapping Woes (2)
      • Data in systems evolve over time … which means changes to the schema.
      • 28. Upgrade/rollback scripts have to operate on the whole database – could be millions of rows.
      • 29. Doing phased rollouts is hard … the application needs to do work
    • 30. Alternative!
      • Let the application do it
      • 31. Use convenient language features
        • PHP serialize/unserialize
      • … or use standards for mixed platforms
        • JSON very popular and well supported
        • 32. Google's protocol buffers
        • 33. … even XML
      • Design for forward compatibility
        • Preserve unknown fields
        • 34. Version objects
    • 35. Scalability and Availability
      • Scalability
        • How many requests you can process
      • Availability
        • How does your service degrade as things break.
      • RDBMS solutions - replication and sharding
    • 36. Scaling RDBMs - Replication
      • Master-Slave replication is easiest
      • 37. Every change on the master happens on the slave.
      • 38. Slaves are read-only. Does not scale INSERT, UPDATE, DELETE queries.
      • 39. Application responsible for distributing queries to correct server.
    • 40. Scaling RDBMs - Replication
      • Multi-master ring replication
        • Can update any master
        • 41. Updates travel around the ring
        • 42. What happens when it fails?
          • Reconfigure the ring
        • What happens on return
          • Synchronize the master
          • 43. Add back in to the ring
    • 44. Replication
      • Replication is usually asynchronous for performance – you don't want to wait for the slowest slave on each update.
      • 45. Replication takes time – there is time lag between the first and last server to see an update.
      • 46. You may not read your writes – not getting aCid properties any more.
    • 47. Scaling RDBMS – Sharding
      • Do application level splitting of data
        • Split large table into N smaller tables
        • 48. Use Id modulo N to find the right table
      • Tables could be spread across multiple database servers
        • But the application needs to know where to query
    • 49. Availability
      • If you want availability you need multiple servers – maybe even multiple sites.
      • 50. In the real world you get network partitions
        • Just because you can't see your other data center doesn't mean users can't.
      • What should you do if you can't see the other data center?
    • 51. Availability
      • Degrade one site to read-only
        • Defeats availability
      • If you allow both sites to operate
        • There's a chance two users could modify the same data.
        • 52. The application needs to know how to resolve it
    • 53. The bottom line...
      • Building systems that are
        • ...Scalable...
        • 54. ...Available...
        • 55. ...Maintainable...
        • 56. with an RDBMs requires large efforts by application developers and operational staff
    • 57. It's hard because...
      • Significant work for developers.
        • App needs to convert data to table/columns
        • 58. App needs to know data location
        • 59. App needs to handle failover
        • 60. App needs to handle inconsistency
      • Work for operational staff
        • Fixing replication topologies and synchronizing servers is fiddly work.
    • 61. Last decades bleeding edge is here
      • Organizations with big problems started experimenting with alternatives
      • 62. Developed internal systems during the mid 2000s
        • Distributed by design
        • 63. Different data models
      • Published details in 2006/2007
    • 64. Amazon
      • Huge e-commerce vendor.
      • 65. Amazon cares about customer experience
        • Availabilty
        • 66. Latency at the 99 th percentile
      • Built as an SOA – pages built from hundreds of services.
      • 67. Amazon runs multiple data centers.
        • Hardware failure is their normal state
        • 68. Network partitions common
    • 69. Amazon Requirements
      • Shopping cart service must always be available
      • 70. Customers should be able to view and add to their carts (in their words)
        • If disks are failing
        • 71. Network routes are flapping
        • 72. Data centers are being destroyed by tornadoes
    • 73. Amazon Observations
      • Many services just stored state.
        • Access by primary key
        • 74. No queries
      • Examples
        • Shopping carts
        • 75. Best seller lists
        • 76. Customer profiles
      • Hard to scale out relational databases
    • 77. Amazon Solution: Dynamo
      • Primary key access only
      • 78. Fault tolerant: Keeps N copies of the data
      • 79. Designed for inconsistency
      • 80. Totally decentralized – nodes 'gossip' state
      • 81. Self-healing
    • 82. Eventual Consistency 1
      • Brewer's CAP Theorem
        • Consistency
        • 83. Availability
        • 84. Partition tolerance
      • Pick two out of three!
      • 85. Amazon chose A-P over C
    • 86. Eventual Consistency 2
      • N copies of each value
      • 87. Read operations (get) require 'R' nodes to respond
      • 88. Write operations (put) require 'W' nodes to respond
      • 89. If R+W > N nodes will read their writes (if no failure)
      • 90. NRW tunes the cluster – typically (3,2,2)
    • 91. Eventual Consistency 3
      • Consequence of availability: Conflicts
      • 92. Conflicts can come from
        • Network partitions
        • 93. Applications themselves – no transactions or locking
      • Applications must handle conflicts
      • 94. Dynamo minimizes with vector clocks
    • 95. Vector Clocks
    • 96. Partitioning
    • 97. Example: Shopping Cart
      • User browses site – adds 3 widgets
    • 98. Shopping Cart - Conflict Network Failure
    • 99. Shopping Cart - Merge
    • 100. Open Source Dynamo
      • Dynamo is internal to Amazon
      • 101. Open source options
        • Riak from Basho
        • 102. Project Voldemort
    • 103. Google BigTables
      • Used internally at Google
      • Distributed storage system for structured data
    • 106. Data representation
      • Data stored in tables.
      • 107. Table indexed by {key,timestamp} and a variable number of sparse columns
      • 108. Columns are grouped into column families. Columns in a family are stored together.
      • 109. Each table is broken into tablets.
      • 110. Tablets are stored on a cluster file system (GFS).
    • 111. BigTable – Column Families Copyright Google
    • 112. Map/Reduce
      • Processing framework that sits on top of BigTable.
      • 113. Programmers write two functions map() and reduce().
      • 114. Table is mapped, then reduced.
      • 115. Job control system monitors and resubmits.
    • 116. Map/Reduce Source: institutes.lanl.gov
    • 117. BigTable has inspired... Map/Reduce
    • 120. Explosion of NoSQL Dbs
      • Too many projects
      • 121. Two good resources
        • http://nosql.mypopescu.com/
        • 122. http://www.vineetgupta.com/ 2010/01/nosql-databases-part-1-landscape.html
    • 123. So many projects! Dynamo, BigTables, Redis Riak, Voldemort, CouchDb, Peanuts Hadoop/Hbase, Cassandra, Hypertable MongoDb, Terrastore, Scalaris, BerkleyDB MemcacheDB, Dynomite, Neo4J, TokyoCabinet … and more
    • 124. NoSQL Characteristics
      • Broad types
      • Persistence
      • Distribution
        • Replicated
        • 128. Decentralized
    • 129. Riak from Basho http://riak.basho.com
      • Dynamo clone written in Erlang
      • 130. RESTful HTTP interface
      • 131. Fully distributed
      • 132. Clients for multiple languages
      • 133. Multiple storage backends
      • I work there now!
    • 136. Redis 1.2
    • 143. Redis 1.2 (cont)
      • Values can be strings, sets, ordered sets, lists
      • 144. Operations like increment, decrement, intersection, push, pop
      • 145. In-memory (can be backed by disk)
      • 146. Auto-sharding in client libraries
      • 147. No fault tolerance (coming after 2.0)
      • 148. Example: retwis – Twitter clone in PHP
    • 149. Cassandra
      • http://incubator.apache.org/cassandra/
      • 150. BigTable ColumnFamily data model
      • 151. Dynamo data distribution
      • 152. Written in Java
      • 153. Thrift based interface
      • 154. In use at
    • 156. CouchDB
      • Document oriented database
        • All JSON documents
      • Written in Erlang
      • 157. Used by Ubuntu One
      • 158. HTTP interface
      • 159. Uses Javascript for indexing/mapreduce
      • 160. Incremental replication
    • 161. BerkleyDB
      • Sleepycat now owned by Oracle
      • 162. Key/Value Store
      • Alternative: Tokyo Cabinet
    • 166. I'm out of time
      • MongoDB
      • 167. Neo4J – Graph Database
      • 168. Peanuts – Yahoo
    • 169. This is all great but...
      • Relational databases provide a lot of functionality.
        • Giving up queries
        • 170. Even range queries are hard for distributed hash systems.
        • 171. No transactions – rules out some classes of applications.
        • 172. Space is still evolving
    • 173. Conclusion
      • NoSQL systems give applications the tools they need for scalability/availability
      • 174. They force you to think about distributed design issues like consistency.
      • 175. Play with them!

    ×