NO-SQL: WHY, WHAT, HOW
Igor Moochnick
Director, Cloud Platforms
BlueMetal Architects
igorm@bluemetal.com
Blog: igorshare.wordpress.com
What is wrong with SQL?
 Is it answering your needs?
 Does it fit your solution?
 Do you rip the benefits of the
relational storage?
 Will it support the needs of your
projects in the future?
 More users?
 More data?
Gov't data stored in the US (2009):
more than 800 petabytes
Assumptions
 The data doesn’t fit on one node
 The data may not fit one rack
 Each machine operates independently with minimal
coordination between themselves
 Conclusion:
 There is a need to partition data across lots of machines
Scale up or out
?
There is a limit to RDBMS scale
 Scaling up doesn't work
 Scaling out with traditional RDBMSs isn't so hot either
 Sharding scales, but you lose all the features that make RDBMSs
useful!
 Sharding and Table partitioning are operationally heavy
 If we don't need relational features, we want a distributed
NRDBMS.
Fallacies of Distributed Computing
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous
http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
The magic CAP
C
(Consistency)
P
(Partitioning)
A
(Availability)
AP
Commodity Hardware
Here Calxeda's EnergyCard atop a HP Redstone server
prototype. Source: Jon Snyder
CNET: Google uncloaks once-secret server
RAM is new Disk, Disk is new Tape
- Jim Gray, (former) manager of Microsoft Research’s eScience Group
Why it is important
 New levels of scalability
 Rapid development
 Cloud ready
 Distributed by nature
 There’s no need for DBA, no need for complicated SQL
queries and it is fast. Hooray, freedom for the people!
 WORLD, PEACE!
Battle of the viewpoints
Who is using NOSQL?
Beware!
 Data models are still important
 Data duplication
 Interfaces and interoperability - nonexistent
 Understand limitations of the technology
 OPS are screwed
Advantages of NOSQL
 Cheap, easy to implement
 Removes impedance mismatch between objects and tables
 Quickly process large amounts of data
 Data Modeling Flexibility (including schema evolution)
Disadvantages of NOSQL
 New Technology
 Data is generally duplicated, potential for inconsistency
 No standard language or format for queries
 Depends on the application layer to enforce data integrity
 Document Databases
 Based loosely on documents / POCO
 Data model – collections of documents
 Graph Databases
 Based on Graph theory
 Data model – graph, nodes, edges, properties
NOSQL categories
NOSQL categories
 Key Value Stores
 Based on DHT (Distributed Hash Table),
Amazon’s Dynamo design
 Data model – collection of key value pairs
 Column Stores
 Based on Google’s BigTable design
 Data model - big table, column families
Types of NOSQL Databases
 Document (examples: MongoDB, CouchDB, RavenDB)
 Graph (examples: Neo4J, Sones, TinkerGraph)
 Key/Value (examples: Cassandra, SimpleDB, Dynamo,
Voldemort, Riak, Redis)
 Tabular/Wide Column (examples: BigTable, Hbase, Cassandra)
 Search (example: Lucene)
http://NOSQL-databases.org
Consistency Models
 Full Consistency
 Read-what-I-wrote
 Session Consistency
 Monotonic Read Consistency
 Eventual Consistency
Write collision resolutions
 Timestamps
 Vector Clocks
 HiLo algorithm
Quorum
Sharding
 Is NOT
 Replication
 Clustering
 Backup
 It is a smart way of splitting
data across databases
 Requires aggregation
 Enables parallelization
Where is my data?
 Lookup tables
 (Consistent) Hash
functions
Node A
Node B
Node C
Gossip
Node
Node
Node
Node
Node Node
Node
Node
Gossip (round 1)
Node
Node
Node
Node
Node Node
Node
Node
Gossip (round 2)
Node
Node
Node
Node
Node Node
Node
Node
Gossip (round 3)
Node
Node
Node
Node
Node Node
Node
Node
Gossip (round 3)
Node
Node
Node
Node
Node Node
Node
Node
Gossip (round 4)
Node
Node
Node
Node
Node Node
Node
Node
Don’t forget backups !!!
Replication ≠ Backups
Modeling
 Stop thinking relational
 Start thinking about how your data will be used
 Usage scenarios
 Optimize for reads? Writes?
 Think about your domain objects and business logic in native .Net
(POCO) classes
 Deformalize if needed
 Reference entities to other entities, or collections of them
 Identify aggregate root(s)
Map-Reduce
 First impressions, but it’s get better over time
Original
Play Nice
 Know and use the tools you need for the job at hand
Extra Links and References
 NOSQL debrief
 Distributed Storage -
http://horicky.blogspot.com/2008/08/distributed-
storage.html
 Images from here

NO SQL: What, Why, How

  • 1.
    NO-SQL: WHY, WHAT,HOW Igor Moochnick Director, Cloud Platforms BlueMetal Architects igorm@bluemetal.com Blog: igorshare.wordpress.com
  • 4.
    What is wrongwith SQL?  Is it answering your needs?  Does it fit your solution?  Do you rip the benefits of the relational storage?  Will it support the needs of your projects in the future?  More users?  More data?
  • 5.
    Gov't data storedin the US (2009): more than 800 petabytes
  • 6.
    Assumptions  The datadoesn’t fit on one node  The data may not fit one rack  Each machine operates independently with minimal coordination between themselves  Conclusion:  There is a need to partition data across lots of machines
  • 7.
  • 8.
    There is alimit to RDBMS scale  Scaling up doesn't work  Scaling out with traditional RDBMSs isn't so hot either  Sharding scales, but you lose all the features that make RDBMSs useful!  Sharding and Table partitioning are operationally heavy  If we don't need relational features, we want a distributed NRDBMS.
  • 9.
    Fallacies of DistributedComputing 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
  • 10.
  • 11.
    Commodity Hardware Here Calxeda'sEnergyCard atop a HP Redstone server prototype. Source: Jon Snyder CNET: Google uncloaks once-secret server RAM is new Disk, Disk is new Tape - Jim Gray, (former) manager of Microsoft Research’s eScience Group
  • 12.
    Why it isimportant  New levels of scalability  Rapid development  Cloud ready  Distributed by nature  There’s no need for DBA, no need for complicated SQL queries and it is fast. Hooray, freedom for the people!  WORLD, PEACE!
  • 13.
    Battle of theviewpoints
  • 14.
  • 15.
    Beware!  Data modelsare still important  Data duplication  Interfaces and interoperability - nonexistent  Understand limitations of the technology  OPS are screwed
  • 16.
    Advantages of NOSQL Cheap, easy to implement  Removes impedance mismatch between objects and tables  Quickly process large amounts of data  Data Modeling Flexibility (including schema evolution) Disadvantages of NOSQL  New Technology  Data is generally duplicated, potential for inconsistency  No standard language or format for queries  Depends on the application layer to enforce data integrity
  • 17.
     Document Databases Based loosely on documents / POCO  Data model – collections of documents  Graph Databases  Based on Graph theory  Data model – graph, nodes, edges, properties NOSQL categories
  • 18.
    NOSQL categories  KeyValue Stores  Based on DHT (Distributed Hash Table), Amazon’s Dynamo design  Data model – collection of key value pairs  Column Stores  Based on Google’s BigTable design  Data model - big table, column families
  • 19.
    Types of NOSQLDatabases  Document (examples: MongoDB, CouchDB, RavenDB)  Graph (examples: Neo4J, Sones, TinkerGraph)  Key/Value (examples: Cassandra, SimpleDB, Dynamo, Voldemort, Riak, Redis)  Tabular/Wide Column (examples: BigTable, Hbase, Cassandra)  Search (example: Lucene) http://NOSQL-databases.org
  • 20.
    Consistency Models  FullConsistency  Read-what-I-wrote  Session Consistency  Monotonic Read Consistency  Eventual Consistency
  • 21.
    Write collision resolutions Timestamps  Vector Clocks  HiLo algorithm
  • 22.
  • 23.
    Sharding  Is NOT Replication  Clustering  Backup  It is a smart way of splitting data across databases  Requires aggregation  Enables parallelization
  • 24.
    Where is mydata?  Lookup tables  (Consistent) Hash functions Node A Node B Node C
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Don’t forget backups!!! Replication ≠ Backups
  • 32.
    Modeling  Stop thinkingrelational  Start thinking about how your data will be used  Usage scenarios  Optimize for reads? Writes?  Think about your domain objects and business logic in native .Net (POCO) classes  Deformalize if needed  Reference entities to other entities, or collections of them  Identify aggregate root(s)
  • 33.
    Map-Reduce  First impressions,but it’s get better over time Original
  • 34.
    Play Nice  Knowand use the tools you need for the job at hand
  • 35.
    Extra Links andReferences  NOSQL debrief  Distributed Storage - http://horicky.blogspot.com/2008/08/distributed- storage.html  Images from here

Editor's Notes

  • #5 Why are we here? What is wrong with the status quo?
  • #8 What is wrong with the scale up? It worked before? Moore theorem is still valid and according to predictions, will be for some time to come.
  • #11 CAP theorem, also known as (Eric) Brewer's theorem Consistency: All clients always have the same view of the data Availability: Each client can always read and write Partitioning: The system works well despite physical network partitions
  • #12 Commodity hardware – in fact the real commodity hardware. But the industry acknowledges the need of packaging a lot of “commodity” machines in a small space – new computers on a chip models are coming.
  • #14 According to NOSQL-databases.org lists 122+ databases: Next Generation Databases address some of the following points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, replication support, easy API, eventually consistency, and more. So the misleading term "NOSQL" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.
  • #15 Twitter generates 7TB/day (2PB+ year) – Hadoop for data analysis, Scribe for logging LinkedIn - Voldemort
  • #17 Scalability:  relational databases were not designed to handle and do not generally cope well with Internet-scale, “big data” applications.  Most of the big Internet companies (e.g., Google, Yahoo, Facebook) do not rely on RDBMS technology for this reason.
  • #20 Cassandra – Facebook Inbox Search Amazon Dynamo: not open source Voldemort: Open-Source implementation of Amazons Dynamo Key-Value Store.  Google Big Table: a sparse, distributed multi-dimensional sorted map
  • #21 Distributed Storage Consistency Models - http://horicky.blogspot.com/2008/08/distributed-storage.html There is a number of client consistency models (http://horicky.blogspot.com/2009/11/nosql-patterns.html) Strict Consistency (one copy serializability): This provides the semantics as if there is only one copy of data. Any update is observed instantaneously. Read your write consistency: The allows the client to see his own update immediately (and the client can switch server between requests), but not the updates made by other clients Session consistency: Provide the read-your-write consistency only when the client is issuing the request under the same session scope (which is usually bind to the same server) Monotonic Read Consistency: This provide the time monotonicity guarantee that the client will only see more updated version of the data in future requests. Eventual Consistency: This provides the weakness form of guarantee. The client can see an inconsistent view as the update are in progress. This model works when concurrent access of the same data is very unlikely, and the client need to wait for some time if he needs to see his previous update.
  • #25 The Simple Magic of Consistent Hashing - http://www.paperplanes.de/2011/12/9/the-magic-of-consistent-hashing.html Consistent Hashing - http://michaelnielsen.org/blog/consistent-hashing/