SDEC2011 NoSQL Data modelling
Upcoming SlideShare
Loading in...5

SDEC2011 NoSQL Data modelling






Total Views
Views on SlideShare
Embed Views



7 Embeds 607 570 20 6 5 4 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • nosql
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

SDEC2011 NoSQL Data modelling SDEC2011 NoSQL Data modelling Presentation Transcript

  • NoSQL Data ModelingConcepts and CasesShashank Tiwariblog: | twitter:
  • NoSQL?
  • NoSQL : Various Shapes and Sizes• Document Databases• Column-family Oriented Stores• Key/value Data stores• XML Databases• Object Databases• Graph Databases View slide
  • Key Questions• How do I model data for my application?• How do I determine which one is right for me?• Can I easily shift from one database to the other?• Is there a standard way of storing, accessing, and querying data? View slide
  • Agenda for this session• Explore some of the main NoSQL products• Understand how they are similar and different• How best to use these products in the stack•
  • Document Databases• also GenieDB, SimpleDB
  • What is a document db?• One that stores documents• Popular options: • MongoDB -- C++ • CouchDB -- Erlang • Also Amazon’s SimpleDB• ...what exactly is a document?
  • In the real world• (Source:
  • In terms of JSON• {name: “John Doe”,• zip: 10001}
  • What about db schema?• Schema-less• Different documents could be stored in a single collection
  • Data types: MongoDB• Essential JSON types:• string• integer• boolean• double
  • Data types: MongoDB (...cont)• Additional JSON types• null, array and object• BSON types -- binary encoded serialization of JSON like documents • date, binary data, object id, regular expression and code • (Reference:
  • A BSON example: object id
  • Data types: CouchDB• Everything JSON• Large objects: attachments
  • CRUD operations for documents• Create• Read• Update• Delete
  • MongoDB: Create Document• use mydb• w = {name: “John Doe”, zip: 10001};•;
  • Create db and collection• Lazily created• Implicitly created• use mydb•
  • MongoDB: Read Document• db.location.find({zip: 10001});• { "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe", "zip" : 10001 }
  • MongoDB: Read Document (...cont)• db.location.find({name: "John Doe"});• { "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe", "zip" : 10001 }
  • MongoDB: Update Document• Atomic operations on single documents• db.location.update( { name:"John Doe" }, { $set: { name: "Jane Doe" } } );
  • CouchDB: RESTful• Supports REST verbs: GET, HEAD, PUT, POST, DELETE• Supports Replication• Supports the notion of attachments• Could work in offline modes and supports small footprint profiles
  • Sorted Ordered Column-family Datastores• Sorted• Ordered• Distributed• Map
  • Essential schema
  • Multi-dimensional View
  • A Map/Hash View•{• "row_key_1" : { "name" : {• "first_name" : "Jolly", "last_name" : "Goodfellow"• } } },• "location" : { "zip": "94301" },
  • Architectural View (HBase)
  • The Persistence Mechanism
  • Model Wrappers (The GAE Way)• Python • Model, Expando, PolyModel• Java • JDO, JPA
  • HBase Data Access• Thrift + Avro• Java API -- HTable, HBaseAdmin• Hive (SQL like)• MapReduce -- sink and/or source
  • Transactions• Atomic row level• GAE Entity Groups
  • Indexes• Row ordered• Secondary indexes• GAE style multiple indexes • thinking from output to query
  • Use cases• Many Google’s Products• Facebook Messaging• StumbleUpon • Open TSDB• Mahalo, Ning, Meetup, Twitter, Yahoo!• Lily -- open source CMS built on HBase & Solr
  • Brewer’s CAP Theorem••
  • Distributed Systems & Consistency (case: success)
  • Distributed Systems & Consistency (case: failure)
  • Binding by Transactions
  • Consistency Spectrum
  • Inconsistency Window
  • RWN Math• R – Number of nodes that are read from.• W – Number of nodes that are written to.• N – Total number of nodes in the cluster.• In general: R < N and W < N for higher availability
  • R+W>N• Easy to determine consistent state• R + W = 2N • absolutely consistent, can provide ACID gaurantee• In all cases when R + W > N there is some overlap between read and write nodes.
  • R = 1, W = N• more reads than writes•W=N • 1 node failure = entire system unavailable
  • R = N, W =1•W=N • Chance of data inconsistency quite high•R=N • Read only possible when all nodes in the cluster are available
  • R = W = ceiling ((N + 1)/2)Effective quorum for eventual consistency
  • Eventual consistency variants• Causal consistency -- A writes and informs B then B always sees updated value• Read-your-writes-consistency -- A writes a new value and never see the old one• Session consistency -- read-your-writes-consistency within a client session• Monotonic read consistency -- once seen a new value, never return previous value• Monotonic write consistency -- serialize writes by the same process
  • Dynamo Techniques• Consistent Hashing (Incremental scalability)• Vector clocks (high availability for writes)• Sloppy quorum and hinted handoff (recover from temporary failure)• Gossip based membership protocol (periodic, pair wise, inter-process interactions, low reliability, random peer selection)• Anti-entropy using Merkle trees• (source: dynamo-sosp2007.pdf)
  • Consistent Hashing
  • CouchDB MVCC Style• (Source:
  • Key/value Stores• Memcached• Membase• Redis• Tokyo Cabinet• Kyoto Cabinet• Berkeley DB
  • Questions?• blog: | twitter: @tshanky•