Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
July 11th, 2010
Apples, Oranges and NOSQL

   Roi Aldaag Architect & Consultant
   Nadav Wiener Architect & Consultant
Agenda

Introduction
» What is NoSQL?
» What’s “wrong” with RDBMS?
» Why now?




                               3
Agenda

RDBMS vs. NoSQL
» Scaling
» CAP Theorem
» ACID vs. BASE




                  4
Agenda

NoSQL Taxonomy
»   Key / Value
»   Column
»   Document
»   Graph




                  5
Agenda

How to choose?
» Comparing Apples to Oranges
» Polyglot Persistence




                                6
Introduction
Introduction

Question: What do they all have in common?




                                             8
Introduction

Before we answer – some facts:




                                 9
Introduction

  Before we answer – some facts:




Daily Page Views       7.8x109        7.1x109     550x106     350x106  ...
Introduction

Answer: They use NoSQL data stores




                                     11
Introduction




               Why!?




                       12
Introduction

Relational DBs Have Scaling Limitations
» ACID doesn’t scale well horizontally
    Sharding breaks relation...
Introduction

What is NoSQL?
» NO SQL / Not Only SQL
» A collective description of Open Source, Non-relational,
  data sto...
Introduction




               15
Introduction

Why now?
» NoSQL data stores predate RDBMS (1970)
    But remained a niche
» RDBMS – most popular and gener...
Introduction

               It’s theory time:




                                   17
ali
Sc   ng
          18
Scaling

Scaling Up
» Adding resources to a single node in a system
   » Add more CPUs or memory
» Move system to a larger...
Scaling

Scaling Out
» Add more nodes to a system
» Functional Scaling (vertical)
    Grouping data by function and sprea...
Distributed
 Databases
Distributed Databases


» Many nodes
                        Node 1   Node 2
» Same database




                         ...
Distributed Databases

What are the requirements from distributed databases?
» Consistency
    All clients can see the sa...
Distributed Databases

CAP Theorem (E. Brewer, N. Lynch)
» You can fully satisfy at most 2 out of 3
    Compromise on 3rd...
Distributed Databases

CA: Consistency & Availability
» Partition Tolerance is compromised
» Single site clusters (easier ...
Distributed Databases

CP: Consistency & Partitioning
»   Availability is compromised
»   Access to some data may be tempo...
Distributed Databases

AP: Availability & Partitioning
»   Consistency is compromised
»   System is still available under ...
ACID vs. BASE
ACID vs. BASE

ACID – a quick recap
» Atomicity
    When a part of the transaction fails -> the entire transaction fails;...
ACID vs. BASE

BASE
» The CAP compliment of ACID
    Just had to be called BASE
    Backronym:
» Basically Available
» S...
ACID vs. BASE

RDBMS & ACID / NoSQL & BASE
» RDBMSs strive to provide ACID guarantees
    ACID forces consistency


» NoS...
Taxonomy
Taxonomy

        Key / Value    Column




                 XML       Graph

 Document        TXT




                 BI...
Taxonomy




           Key / Value Databases




                                   34
Taxonomy

Key/Value Stores
»   Simple Key / Value lookups (DHT)
»   Value is opaque
»   Focus on scaling to huge amounts o...
Taxonomy

Key/Value e.g.: Riak
»   No single point of failure
»   No machines are special or central
»   MapReduce queries...
Key/Value e.g.: Riak

Data Model
» Key / Value pairs are stored in a Bucket
» A Bucket ~ a namespace

Versioning
» Each up...
Key/Value e.g.: Riak

Example: REST API
» Read an object

   GET /riak/bucket/key

» Store a new object

   POST /riak/buc...
Key/Value e.g.: Riak

MapReduce
» A framework supporting distributed computing on large data
  sets on clusters of machine...
Key/Value e.g.: Riak

MapReduce example: Inverted Index
» Map
  » Parse each document
  » Emit a sequence of <word, doc_id...
Key/Value e.g.: Riak

MapReduce example: Inverted Index
» Reduce
   » Accept all pairs for a given word
   » Sort the corr...
Taxonomy




           BigTable and
           Column Oriented Databases




                                       42
Taxonomy

Column Stores – BigTable derivatives
»   Conceptually a single, infinitely large table
»   Each rows can have di...
Taxonomy

Use Case: Manage products with diverse attributes
» RDBMS:
    Create a central table with common attributes
  ...
Taxonomy

Column Store e.g.: Cassandra
»   Data model: Google’s BigTable
»   Infrastructure: Amazon Dynamo
»   Incremental...
Column e.g.: Cassandra

Data Model
» Column
   Smallest increment of data: tuple of name, value, timestamp

   {
        ...
Column e.g.: Cassandra


» SuperColumn
    A sorted, associative, unbounded
     array of columns


{ // this is a SuperC...
Column e.g.: Cassandra

» ColumnFamily
    A container (~Table) for columns sorted by their names
    Column Families ar...
Column e.g.: Cassandra

» Keyspace
    The outer most grouping of data (~DB Schema)
    Contains ColumnFamily’s
    The...
Column e.g.: Cassandra

» Example
                         Tweets CF




Keyspace
                              Timeline C...
Taxonomy



    XML




    TXT
           Document Oriented Databases
    BIN




                                       ...
Taxonomy

Document Store
»   Store semi-structured documents (think JSON)
»   Document versioning
»   Map/Reduce based que...
Taxonomy

Use Case: Blog with tagged posts and comments
» RDBMS:
    Table for each: posts, comments, tags
    Foreign r...
Taxonomy

Document Store e.g: MongoDB
»   MongoDB (from "humongous")
»   Manages collections of JSON-like documents (BSON)...
Docment e.g.: MongoDB

Example: Blog posts
» Suppose you host a blog, where each post is tagged:

   db.posts.save({
     ...
Docment e.g.: MongoDB

» MongoDB supports secondary indexes and a query optimizer
    Compound indexes are also supported...
Docment e.g.: MongoDB

» Let's update our posts to include some comments:

  db.posts.update({ _id: 3 }, {
      $inc: { c...
Taxonomy




           Graph Databases




                             58
Taxonomy

Graph databases
»   Inspired by mathematical graph theory G=(E,V)
»   Models the structure of data
»   Navigatio...
Taxonomy

Use Case: Connected data - deep relationship links
between users in a social network

» RDBMS
    Complex recur...
Taxonomy

Graph e.g.: Neo4J
»   High-performance graph engine
»   Embedded / disk based
»   Work with OO model: nodes, rel...
Graph e.g.: Neo4j




                    http://neo4j.org/

                                62
Comparing Apples to Oranges
Comparing Apples to Oranges

Comparing Data Structures
» RDBMS
    Databases contains tables, columns and rows
    All r...
Comparing Apples to Oranges

Comparing Schema Flexibility
» RDBMS
    Strict schema, difficult to evolve
    Maintains r...
Comparing Apples to Oranges

Comparing Normalization & Relations
» RDBMS
    The data model is normalized to remove data ...
Comparing Apples to Oranges

Comparing Data Acces
» RDBMS
    CRUD operations using SQL
    Access data from multiple ta...
Comparing Apples to Oranges

Comparing Reporting Capabilities
» RDBMS
    Slice and Dice data, then reassemble any way yo...
Summary
Summary

Why NOSQL / BASE
» ACID ruled exclusively in the last 40 years
    doesn’t compromise on consistency
» Database ...
Summary

NoSQL Limitations
» Missing some query capabilities
     joins / composite transaction
»   Eventual consistency ...
Summary

Choose the right tool for the job
» Relational databases and NoSQL databases are designed to
  meet different nee...
Summary

Polyglot Persistence
» Poly: many Glot: language
» Meshing up persistence mechanisms to best meet
  requirements
...
Upcoming SlideShare
Loading in …5
×

Seminar.2010.NoSql

4,485 views

Published on

Seminar.2010.NoSql

  1. 1. July 11th, 2010
  2. 2. Apples, Oranges and NOSQL Roi Aldaag Architect & Consultant Nadav Wiener Architect & Consultant
  3. 3. Agenda Introduction » What is NoSQL? » What’s “wrong” with RDBMS? » Why now? 3
  4. 4. Agenda RDBMS vs. NoSQL » Scaling » CAP Theorem » ACID vs. BASE 4
  5. 5. Agenda NoSQL Taxonomy » Key / Value » Column » Document » Graph 5
  6. 6. Agenda How to choose? » Comparing Apples to Oranges » Polyglot Persistence 6
  7. 7. Introduction
  8. 8. Introduction Question: What do they all have in common? 8
  9. 9. Introduction Before we answer – some facts: 9
  10. 10. Introduction Before we answer – some facts: Daily Page Views 7.8x109 7.1x109 550x106 350x106 82x106 Daily Visitors 620x106 500x106 56x106 37x106 12x106 Data size Petabytes Petabytes Petabytes Terabytes Terabytes July, 2010: http://www.alexa.com 10
  11. 11. Introduction Answer: They use NoSQL data stores 11
  12. 12. Introduction Why!? 12
  13. 13. Introduction Relational DBs Have Scaling Limitations » ACID doesn’t scale well horizontally  Sharding breaks relations  Joins are inefficient » Transactions overhead » Schema is not flexible  Predfined  Hard to evolve 13
  14. 14. Introduction What is NoSQL? » NO SQL / Not Only SQL » A collective description of Open Source, Non-relational, data stores  Highly distributed  Highly scalable  Not ACID and... doesn’t use SQL » Term coined in a convention in 2009 called “NoSQL” (Eric Evans) » Started a movement that is gaining momentum 14
  15. 15. Introduction 15
  16. 16. Introduction Why now? » NoSQL data stores predate RDBMS (1970)  But remained a niche » RDBMS – most popular and generic option » Web 2.0 introduced new requirements:  Exponential increase in data  Information connectivity  Semi-structured data » NoSQL data stores had answers  When time was right  When RDBMSs didn’t 16
  17. 17. Introduction It’s theory time: 17
  18. 18. ali Sc ng 18
  19. 19. Scaling Scaling Up » Adding resources to a single node in a system » Add more CPUs or memory » Move system to a larger machine » Pros:  Quick and Simple » Cons:  Outgrowing the capacity of largest system available (More’s law)  Expensive  Creates vendor lock-in 19
  20. 20. Scaling Scaling Out » Add more nodes to a system » Functional Scaling (vertical)  Grouping data by function and spreading functional groups across databases » Sharding (horizontal)  Splitting same functional data across multiple databases » Pros: More flexible » Cons: More complex 20
  21. 21. Distributed Databases
  22. 22. Distributed Databases » Many nodes Node 1 Node 2 » Same database Node 3 22
  23. 23. Distributed Databases What are the requirements from distributed databases? » Consistency  All clients can see the same data » Availability  All clients can always access data » Partition tolerance  The ability to continue working when the network topology is broken  The ability to recover once the network is healed 23
  24. 24. Distributed Databases CAP Theorem (E. Brewer, N. Lynch) » You can fully satisfy at most 2 out of 3  Compromise on 3rd » Not “all or nothing”  Choose various levels of consistency, availability or partition tolerance » Recognize which of the CAP rules your business needs for the task 24
  25. 25. Distributed Databases CA: Consistency & Availability » Partition Tolerance is compromised » Single site clusters (easier to ensure all nodes are always in contact) » When a network partition occurs, the system blocks » e.g. Two Phase Commit (2PC) Partition Tolerance 25
  26. 26. Distributed Databases CP: Consistency & Partitioning » Availability is compromised » Access to some data may be temporarily limited » The rest is still consistent/accurate » e.g. Sharded database » TBD sample Partition Tolerance 26
  27. 27. Distributed Databases AP: Availability & Partitioning » Consistency is compromised » System is still available under partitioning » Some data returned may be temporarily not up-to-date » Requires conflict resolution strategy » e.g. DNS, caches, Master/Slave replication » TBD sample Partition Tolerance 27
  28. 28. ACID vs. BASE
  29. 29. ACID vs. BASE ACID – a quick recap » Atomicity  When a part of the transaction fails -> the entire transaction fails; Database state is left unchanged » Consistency  A transaction takes database from one consistent state to another » Isolation  A transaction can't see dirty state from other transactions » Durability  Commit means commit. 29
  30. 30. ACID vs. BASE BASE » The CAP compliment of ACID  Just had to be called BASE  Backronym: » Basically Available » Soft State » Eventually Consistent 30
  31. 31. ACID vs. BASE RDBMS & ACID / NoSQL & BASE » RDBMSs strive to provide ACID guarantees  ACID forces consistency » NoSQL solutions often scale through BASE  BASE accepts that conflicts will happen 31
  32. 32. Taxonomy
  33. 33. Taxonomy Key / Value Column XML Graph Document TXT BIN 33
  34. 34. Taxonomy Key / Value Databases 34
  35. 35. Taxonomy Key/Value Stores » Simple Key / Value lookups (DHT) » Value is opaque » Focus on scaling to huge amounts of data » Designed to handle massive load » E.g.  Riak Based on Amazon’s  Project Voldemort Dynamo paper  Redis 35
  36. 36. Taxonomy Key/Value e.g.: Riak » No single point of failure » No machines are special or central » MapReduce queries (Erlang / Javascript) » HTTP/JSON API » Ring cluster with automatic replication » Elastic / partition rebalancing » Written in: Erlang, C, Javascript » Developed by: Basho Technologies » Java client: (jonjlee / riak-java-client) 36
  37. 37. Key/Value e.g.: Riak Data Model » Key / Value pairs are stored in a Bucket » A Bucket ~ a namespace Versioning » Each update is tracked by a Vector Clock  An algorithm for determining ordering and detecting conflicts » When in conflict  Last wins / manual resolution 37
  38. 38. Key/Value e.g.: Riak Example: REST API » Read an object GET /riak/bucket/key » Store a new object POST /riak/bucket » Store an object with existing key (update) PUT /riak/bucket/key 38
  39. 39. Key/Value e.g.: Riak MapReduce » A framework supporting distributed computing on large data sets on clusters of machines » Leverage parallel processing power » Introduced by Google » Inspired by map / reduce functions in functional programming » Map step » Reduce step 39
  40. 40. Key/Value e.g.: Riak MapReduce example: Inverted Index » Map » Parse each document » Emit a sequence of <word, doc_id> pairs <doc_id, doc_text> <word ,doc_id> Node < word1 ,100>, <100, TXT1 >, 1 < word2 ,100>, Node <200, TXT2 >, 2 < word2 ,200>, TXT3 Node <300, > 3 < word2 ,300> 40
  41. 41. Key/Value e.g.: Riak MapReduce example: Inverted Index » Reduce » Accept all pairs for a given word » Sort the corresponding document IDs » Emit a <word, list(document ID)> pair <word, list(document_id)> < word1 ,(100) >, < word2 ,(100,200)>, < word3 ,(300) > 41
  42. 42. Taxonomy BigTable and Column Oriented Databases 42
  43. 43. Taxonomy Column Stores – BigTable derivatives » Conceptually a single, infinitely large table » Each rows can have different number of columns » Table is sparse: |rows|*|columns| > |values | » Based on Google’s BigTable paper » E.g.  Cassandra  Hbase  Hypertable 43
  44. 44. Taxonomy Use Case: Manage products with diverse attributes » RDBMS:  Create a central table with common attributes  Create a table per product with unique attributes  Use a join query  Alternatively create a table that holds meta data on products » NoSQL:  Column oriented database  Use arbitrarily columns 44
  45. 45. Taxonomy Column Store e.g.: Cassandra » Data model: Google’s BigTable » Infrastructure: Amazon Dynamo » Incremental scalability » Flexible schema » No single point of failure (Distributed P2P) » Optimistic replication (Gossip protocol) » Written in: Java » Developed by: Facebook » Java client: e.g. Hector / Thrift 45
  46. 46. Column e.g.: Cassandra Data Model » Column  Smallest increment of data: tuple of name, value, timestamp { name: "emailAddress", value: “nosql@alphacsp.com", timestamp: 123456789 } 46
  47. 47. Column e.g.: Cassandra » SuperColumn  A sorted, associative, unbounded array of columns { // this is a SuperColumn name: "homeAddress", // with an unbounded array of Columns value: { // the keys is the name of the Column street: {name: "street", value: "s", timestamp:...}, city: {name: "city", value: "c", timestamp:...}, zip: {name: "zip", value: "z", timestamp:...} } } 47
  48. 48. Column e.g.: Cassandra » ColumnFamily  A container (~Table) for columns sorted by their names  Column Families are referenced and sorted by row keys Users = { // ColumnFamily john: { // key to row in CF "role" : "admin", "status" : "offline", "nick" : "dude1934" }, // end row fred: { // another row "nick" : “freddy", "email" :"fred@example.com", "age" : "25", "gender" : "male",… },… // more rows } Column Family 48
  49. 49. Column e.g.: Cassandra » Keyspace  The outer most grouping of data (~DB Schema)  Contains ColumnFamily’s  There is no imposed relationship between ColumsFamily’s 49
  50. 50. Column e.g.: Cassandra » Example Tweets CF Keyspace Timeline CF 50
  51. 51. Taxonomy XML TXT Document Oriented Databases BIN 51
  52. 52. Taxonomy Document Store » Store semi-structured documents (think JSON) » Document versioning » Map/Reduce based queries, sorting, aggregation, etc. » DB is aware of internal structure » E.g.  MongoDB  CouchDB  JackRabbit (JCR JSR 170) 52
  53. 53. Taxonomy Use Case: Blog with tagged posts and comments » RDBMS:  Table for each: posts, comments, tags  Foreign relations » NoSQL:  Document storage  Store post + tags + comments as a document 53
  54. 54. Taxonomy Document Store e.g: MongoDB » MongoDB (from "humongous") » Manages collections of JSON-like documents (BSON) » Queries can return specific fields of documents » Supports secondary indexes » Atomic operations on single documents » Developed by: 10gen » Written in: C++ » Clients: Java, Scala and more 54
  55. 55. Docment e.g.: MongoDB Example: Blog posts » Suppose you host a blog, where each post is tagged: db.posts.save({ _id : 3, author:"john", title : “Apples, Oranges and NOSQL", text : “This article will…", tags : [ “database", “nosql" ] }); » Notice how posts have an array of tags 55
  56. 56. Docment e.g.: MongoDB » MongoDB supports secondary indexes and a query optimizer  Compound indexes are also supported db.posts.ensureIndex({ tags: 1 }); db.posts.ensureIndex({ author: 1}); db.posts.find({ author: "john", tags: "nosql" }); // Result: { "_id" : 3, "author" : "john", "title" : "Apples, Oranges and NOSQL", "text" : "This article will…", "tags" : ["database", "nosql", "mongodb" ] } 56
  57. 57. Docment e.g.: MongoDB » Let's update our posts to include some comments: db.posts.update({ _id: 3 }, { $inc: { comments_count: 4}, $pushAll : { comments: [ { text: “Comment 1" }, { text: “Comment 2", author: "Mr. T" }, { text: “Comment 3" }, { text: “Comment 4" } ] } }); 57
  58. 58. Taxonomy Graph Databases 58
  59. 59. Taxonomy Graph databases » Inspired by mathematical graph theory G=(E,V) » Models the structure of data » Navigational data model » Scalability / data complexity » Data model: Key-Value pairs on Edges / Nodes » Relationships: Edges between Nodes » E.g.  Neo4j  Pregel (Google’s PageRank)  AllegroGraph 59
  60. 60. Taxonomy Use Case: Connected data - deep relationship links between users in a social network » RDBMS  Complex recursive algorithm  Multiple Self joins  Round trips to DB / bulk read and resolve in RAM » NoSQL:  Graph Storage  Network traversal 60
  61. 61. Taxonomy Graph e.g.: Neo4J » High-performance graph engine » Embedded / disk based » Work with OO model: nodes, relationships, properties » ACID Transactions  JTA support – participate in 2PC with your RDBMS » Developed by: Neo Technologies » Written in: Java » Clients: Java, client libraries in other platforms 61
  62. 62. Graph e.g.: Neo4j http://neo4j.org/ 62
  63. 63. Comparing Apples to Oranges
  64. 64. Comparing Apples to Oranges Comparing Data Structures » RDBMS  Databases contains tables, columns and rows  All rows the same structure  Inherent ORM mismatch » NoSQL  Choose your data structure  Data is stored in natural structure (e.g. Documents, Graphs, Objects) 64
  65. 65. Comparing Apples to Oranges Comparing Schema Flexibility » RDBMS  Strict schema, difficult to evolve  Maintains relations and forces data integrity » NoSQL  Structure of data can be changed dynamically • e.g. Column stores – Cassandra  Data can sometimes be completely opaque • e.g Key/Value – Project Voldemort 65
  66. 66. Comparing Apples to Oranges Comparing Normalization & Relations » RDBMS  The data model is normalized to remove data duplication  Normalization establishes table relations » NoSQL  Denormalization is not a dirty word  Relations are not explicitly defined  Related data is usually grouped and stored as one unit • E.g. document, column 66
  67. 67. Comparing Apples to Oranges Comparing Data Acces » RDBMS  CRUD operations using SQL  Access data from multiple tables using SQL joins  Generic API such as JDBC » NoSQL  Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)  MapReduce, graph traversals  REST APIs, portable serialization formats • BSON, JSON, Apache Thrift, Memcached 67
  68. 68. Comparing Apples to Oranges Comparing Reporting Capabilities » RDBMS  Slice and Dice data, then reassemble any way you like » NoSQL  Hard to repurpose data for ad-hoc usage • Plan ahead  Think in advance • How and what you store • Data access patterns 68
  69. 69. Summary
  70. 70. Summary Why NOSQL / BASE » ACID ruled exclusively in the last 40 years  doesn’t compromise on consistency » Database industry neglected distributed DBs w/ availability » Vacuum was filled with “NoSQL” BASE architectures  Strict A and P, minimize C compromise » Relational databases are now trying to catch up 70
  71. 71. Summary NoSQL Limitations » Missing some query capabilities  joins / composite transaction » Eventual consistency -- not for every problem » Not a drop in replacement for RDBMS “on ACID” » No standardization -> product lock-in » Relatively immature (support, bugs, community) 71
  72. 72. Summary Choose the right tool for the job » Relational databases and NoSQL databases are designed to meet different needs » RDBMS-only should not be a default » NOSQL databases outperform RDBMS’s in their particular niche » No one size fits all / Silver bullet ...but you don’t have to choose one 72
  73. 73. Summary Polyglot Persistence » Poly: many Glot: language » Meshing up persistence mechanisms to best meet requirements » Good integration stories:  E.g. Neo4j + JDBC using JTA 73

×