0
July 11th, 2010
Apples, Oranges and NOSQL

   Roi Aldaag Architect & Consultant
   Nadav Wiener Architect & Consultant
Agenda

Introduction
» What is NoSQL?
» What’s “wrong” with RDBMS?
» Why now?




                               3
Agenda

RDBMS vs. NoSQL
» Scaling
» CAP Theorem
» ACID vs. BASE




                  4
Agenda

NoSQL Taxonomy
»   Key / Value
»   Column
»   Document
»   Graph




                  5
Agenda

How to choose?
» Comparing Apples to Oranges
» Polyglot Persistence




                                6
Introduction
Introduction

Question: What do they all have in common?




                                             8
Introduction

Before we answer – some facts:




                                 9
Introduction

  Before we answer – some facts:




Daily Page Views       7.8x109        7.1x109     550x106     350x106  ...
Introduction

Answer: They use NoSQL data stores




                                     11
Introduction




               Why!?




                       12
Introduction

Relational DBs Have Scaling Limitations
» ACID doesn’t scale well horizontally
    Sharding breaks relation...
Introduction

What is NoSQL?
» NO SQL / Not Only SQL
» A collective description of Open Source, Non-relational,
  data sto...
Introduction




               15
Introduction

Why now?
» NoSQL data stores predate RDBMS (1970)
    But remained a niche
» RDBMS – most popular and gener...
Introduction

               It’s theory time:




                                   17
ali
Sc   ng
          18
Scaling

Scaling Up
» Adding resources to a single node in a system
   » Add more CPUs or memory
» Move system to a larger...
Scaling

Scaling Out
» Add more nodes to a system
» Functional Scaling (vertical)
    Grouping data by function and sprea...
Distributed
 Databases
Distributed Databases


» Many nodes
                        Node 1   Node 2
» Same database




                         ...
Distributed Databases

What are the requirements from distributed databases?
» Consistency
    All clients can see the sa...
Distributed Databases

CAP Theorem (E. Brewer, N. Lynch)
» You can fully satisfy at most 2 out of 3
    Compromise on 3rd...
Distributed Databases

CA: Consistency & Availability
» Partition Tolerance is compromised
» Single site clusters (easier ...
Distributed Databases

CP: Consistency & Partitioning
»   Availability is compromised
»   Access to some data may be tempo...
Distributed Databases

AP: Availability & Partitioning
»   Consistency is compromised
»   System is still available under ...
ACID vs. BASE
ACID vs. BASE

ACID – a quick recap
» Atomicity
    When a part of the transaction fails -> the entire transaction fails;...
ACID vs. BASE

BASE
» The CAP compliment of ACID
    Just had to be called BASE
    Backronym:
» Basically Available
» S...
ACID vs. BASE

RDBMS & ACID / NoSQL & BASE
» RDBMSs strive to provide ACID guarantees
    ACID forces consistency


» NoS...
Taxonomy
Taxonomy

        Key / Value    Column




                 XML       Graph

 Document        TXT




                 BI...
Taxonomy




           Key / Value Databases




                                   34
Taxonomy

Key/Value Stores
»   Simple Key / Value lookups (DHT)
»   Value is opaque
»   Focus on scaling to huge amounts o...
Taxonomy

Key/Value e.g.: Riak
»   No single point of failure
»   No machines are special or central
»   MapReduce queries...
Key/Value e.g.: Riak

Data Model
» Key / Value pairs are stored in a Bucket
» A Bucket ~ a namespace

Versioning
» Each up...
Key/Value e.g.: Riak

Example: REST API
» Read an object

   GET /riak/bucket/key

» Store a new object

   POST /riak/buc...
Key/Value e.g.: Riak

MapReduce
» A framework supporting distributed computing on large data
  sets on clusters of machine...
Key/Value e.g.: Riak

MapReduce example: Inverted Index
» Map
  » Parse each document
  » Emit a sequence of <word, doc_id...
Key/Value e.g.: Riak

MapReduce example: Inverted Index
» Reduce
   » Accept all pairs for a given word
   » Sort the corr...
Taxonomy




           BigTable and
           Column Oriented Databases




                                       42
Taxonomy

Column Stores – BigTable derivatives
»   Conceptually a single, infinitely large table
»   Each rows can have di...
Taxonomy

Use Case: Manage products with diverse attributes
» RDBMS:
    Create a central table with common attributes
  ...
Taxonomy

Column Store e.g.: Cassandra
»   Data model: Google’s BigTable
»   Infrastructure: Amazon Dynamo
»   Incremental...
Column e.g.: Cassandra

Data Model
» Column
   Smallest increment of data: tuple of name, value, timestamp

   {
        ...
Column e.g.: Cassandra


» SuperColumn
    A sorted, associative, unbounded
     array of columns


{ // this is a SuperC...
Column e.g.: Cassandra

» ColumnFamily
    A container (~Table) for columns sorted by their names
    Column Families ar...
Column e.g.: Cassandra

» Keyspace
    The outer most grouping of data (~DB Schema)
    Contains ColumnFamily’s
    The...
Column e.g.: Cassandra

» Example
                         Tweets CF




Keyspace
                              Timeline C...
Taxonomy



    XML




    TXT
           Document Oriented Databases
    BIN




                                       ...
Taxonomy

Document Store
»   Store semi-structured documents (think JSON)
»   Document versioning
»   Map/Reduce based que...
Taxonomy

Use Case: Blog with tagged posts and comments
» RDBMS:
    Table for each: posts, comments, tags
    Foreign r...
Taxonomy

Document Store e.g: MongoDB
»   MongoDB (from "humongous")
»   Manages collections of JSON-like documents (BSON)...
Docment e.g.: MongoDB

Example: Blog posts
» Suppose you host a blog, where each post is tagged:

   db.posts.save({
     ...
Docment e.g.: MongoDB

» MongoDB supports secondary indexes and a query optimizer
    Compound indexes are also supported...
Docment e.g.: MongoDB

» Let's update our posts to include some comments:

  db.posts.update({ _id: 3 }, {
      $inc: { c...
Taxonomy




           Graph Databases




                             58
Taxonomy

Graph databases
»   Inspired by mathematical graph theory G=(E,V)
»   Models the structure of data
»   Navigatio...
Taxonomy

Use Case: Connected data - deep relationship links
between users in a social network

» RDBMS
    Complex recur...
Taxonomy

Graph e.g.: Neo4J
»   High-performance graph engine
»   Embedded / disk based
»   Work with OO model: nodes, rel...
Graph e.g.: Neo4j




                    http://neo4j.org/

                                62
Comparing Apples to Oranges
Comparing Apples to Oranges

Comparing Data Structures
» RDBMS
    Databases contains tables, columns and rows
    All r...
Comparing Apples to Oranges

Comparing Schema Flexibility
» RDBMS
    Strict schema, difficult to evolve
    Maintains r...
Comparing Apples to Oranges

Comparing Normalization & Relations
» RDBMS
    The data model is normalized to remove data ...
Comparing Apples to Oranges

Comparing Data Acces
» RDBMS
    CRUD operations using SQL
    Access data from multiple ta...
Comparing Apples to Oranges

Comparing Reporting Capabilities
» RDBMS
    Slice and Dice data, then reassemble any way yo...
Summary
Summary

Why NOSQL / BASE
» ACID ruled exclusively in the last 40 years
    doesn’t compromise on consistency
» Database ...
Summary

NoSQL Limitations
» Missing some query capabilities
     joins / composite transaction
»   Eventual consistency ...
Summary

Choose the right tool for the job
» Relational databases and NoSQL databases are designed to
  meet different nee...
Summary

Polyglot Persistence
» Poly: many Glot: language
» Meshing up persistence mechanisms to best meet
  requirements
...
Upcoming SlideShare
Loading in...5
×

Seminar.2010.NoSql

4,104

Published on

4 Comments
4 Likes
Statistics
Notes
No Downloads
Views
Total Views
4,104
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
274
Comments
4
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Seminar.2010.NoSql"

  1. 1. July 11th, 2010
  2. 2. Apples, Oranges and NOSQL Roi Aldaag Architect & Consultant Nadav Wiener Architect & Consultant
  3. 3. Agenda Introduction » What is NoSQL? » What’s “wrong” with RDBMS? » Why now? 3
  4. 4. Agenda RDBMS vs. NoSQL » Scaling » CAP Theorem » ACID vs. BASE 4
  5. 5. Agenda NoSQL Taxonomy » Key / Value » Column » Document » Graph 5
  6. 6. Agenda How to choose? » Comparing Apples to Oranges » Polyglot Persistence 6
  7. 7. Introduction
  8. 8. Introduction Question: What do they all have in common? 8
  9. 9. Introduction Before we answer – some facts: 9
  10. 10. Introduction Before we answer – some facts: Daily Page Views 7.8x109 7.1x109 550x106 350x106 82x106 Daily Visitors 620x106 500x106 56x106 37x106 12x106 Data size Petabytes Petabytes Petabytes Terabytes Terabytes July, 2010: http://www.alexa.com 10
  11. 11. Introduction Answer: They use NoSQL data stores 11
  12. 12. Introduction Why!? 12
  13. 13. Introduction Relational DBs Have Scaling Limitations » ACID doesn’t scale well horizontally  Sharding breaks relations  Joins are inefficient » Transactions overhead » Schema is not flexible  Predfined  Hard to evolve 13
  14. 14. Introduction What is NoSQL? » NO SQL / Not Only SQL » A collective description of Open Source, Non-relational, data stores  Highly distributed  Highly scalable  Not ACID and... doesn’t use SQL » Term coined in a convention in 2009 called “NoSQL” (Eric Evans) » Started a movement that is gaining momentum 14
  15. 15. Introduction 15
  16. 16. Introduction Why now? » NoSQL data stores predate RDBMS (1970)  But remained a niche » RDBMS – most popular and generic option » Web 2.0 introduced new requirements:  Exponential increase in data  Information connectivity  Semi-structured data » NoSQL data stores had answers  When time was right  When RDBMSs didn’t 16
  17. 17. Introduction It’s theory time: 17
  18. 18. ali Sc ng 18
  19. 19. Scaling Scaling Up » Adding resources to a single node in a system » Add more CPUs or memory » Move system to a larger machine » Pros:  Quick and Simple » Cons:  Outgrowing the capacity of largest system available (More’s law)  Expensive  Creates vendor lock-in 19
  20. 20. Scaling Scaling Out » Add more nodes to a system » Functional Scaling (vertical)  Grouping data by function and spreading functional groups across databases » Sharding (horizontal)  Splitting same functional data across multiple databases » Pros: More flexible » Cons: More complex 20
  21. 21. Distributed Databases
  22. 22. Distributed Databases » Many nodes Node 1 Node 2 » Same database Node 3 22
  23. 23. Distributed Databases What are the requirements from distributed databases? » Consistency  All clients can see the same data » Availability  All clients can always access data » Partition tolerance  The ability to continue working when the network topology is broken  The ability to recover once the network is healed 23
  24. 24. Distributed Databases CAP Theorem (E. Brewer, N. Lynch) » You can fully satisfy at most 2 out of 3  Compromise on 3rd » Not “all or nothing”  Choose various levels of consistency, availability or partition tolerance » Recognize which of the CAP rules your business needs for the task 24
  25. 25. Distributed Databases CA: Consistency & Availability » Partition Tolerance is compromised » Single site clusters (easier to ensure all nodes are always in contact) » When a network partition occurs, the system blocks » e.g. Two Phase Commit (2PC) Partition Tolerance 25
  26. 26. Distributed Databases CP: Consistency & Partitioning » Availability is compromised » Access to some data may be temporarily limited » The rest is still consistent/accurate » e.g. Sharded database » TBD sample Partition Tolerance 26
  27. 27. Distributed Databases AP: Availability & Partitioning » Consistency is compromised » System is still available under partitioning » Some data returned may be temporarily not up-to-date » Requires conflict resolution strategy » e.g. DNS, caches, Master/Slave replication » TBD sample Partition Tolerance 27
  28. 28. ACID vs. BASE
  29. 29. ACID vs. BASE ACID – a quick recap » Atomicity  When a part of the transaction fails -> the entire transaction fails; Database state is left unchanged » Consistency  A transaction takes database from one consistent state to another » Isolation  A transaction can't see dirty state from other transactions » Durability  Commit means commit. 29
  30. 30. ACID vs. BASE BASE » The CAP compliment of ACID  Just had to be called BASE  Backronym: » Basically Available » Soft State » Eventually Consistent 30
  31. 31. ACID vs. BASE RDBMS & ACID / NoSQL & BASE » RDBMSs strive to provide ACID guarantees  ACID forces consistency » NoSQL solutions often scale through BASE  BASE accepts that conflicts will happen 31
  32. 32. Taxonomy
  33. 33. Taxonomy Key / Value Column XML Graph Document TXT BIN 33
  34. 34. Taxonomy Key / Value Databases 34
  35. 35. Taxonomy Key/Value Stores » Simple Key / Value lookups (DHT) » Value is opaque » Focus on scaling to huge amounts of data » Designed to handle massive load » E.g.  Riak Based on Amazon’s  Project Voldemort Dynamo paper  Redis 35
  36. 36. Taxonomy Key/Value e.g.: Riak » No single point of failure » No machines are special or central » MapReduce queries (Erlang / Javascript) » HTTP/JSON API » Ring cluster with automatic replication » Elastic / partition rebalancing » Written in: Erlang, C, Javascript » Developed by: Basho Technologies » Java client: (jonjlee / riak-java-client) 36
  37. 37. Key/Value e.g.: Riak Data Model » Key / Value pairs are stored in a Bucket » A Bucket ~ a namespace Versioning » Each update is tracked by a Vector Clock  An algorithm for determining ordering and detecting conflicts » When in conflict  Last wins / manual resolution 37
  38. 38. Key/Value e.g.: Riak Example: REST API » Read an object GET /riak/bucket/key » Store a new object POST /riak/bucket » Store an object with existing key (update) PUT /riak/bucket/key 38
  39. 39. Key/Value e.g.: Riak MapReduce » A framework supporting distributed computing on large data sets on clusters of machines » Leverage parallel processing power » Introduced by Google » Inspired by map / reduce functions in functional programming » Map step » Reduce step 39
  40. 40. Key/Value e.g.: Riak MapReduce example: Inverted Index » Map » Parse each document » Emit a sequence of <word, doc_id> pairs <doc_id, doc_text> <word ,doc_id> Node < word1 ,100>, <100, TXT1 >, 1 < word2 ,100>, Node <200, TXT2 >, 2 < word2 ,200>, TXT3 Node <300, > 3 < word2 ,300> 40
  41. 41. Key/Value e.g.: Riak MapReduce example: Inverted Index » Reduce » Accept all pairs for a given word » Sort the corresponding document IDs » Emit a <word, list(document ID)> pair <word, list(document_id)> < word1 ,(100) >, < word2 ,(100,200)>, < word3 ,(300) > 41
  42. 42. Taxonomy BigTable and Column Oriented Databases 42
  43. 43. Taxonomy Column Stores – BigTable derivatives » Conceptually a single, infinitely large table » Each rows can have different number of columns » Table is sparse: |rows|*|columns| > |values | » Based on Google’s BigTable paper » E.g.  Cassandra  Hbase  Hypertable 43
  44. 44. Taxonomy Use Case: Manage products with diverse attributes » RDBMS:  Create a central table with common attributes  Create a table per product with unique attributes  Use a join query  Alternatively create a table that holds meta data on products » NoSQL:  Column oriented database  Use arbitrarily columns 44
  45. 45. Taxonomy Column Store e.g.: Cassandra » Data model: Google’s BigTable » Infrastructure: Amazon Dynamo » Incremental scalability » Flexible schema » No single point of failure (Distributed P2P) » Optimistic replication (Gossip protocol) » Written in: Java » Developed by: Facebook » Java client: e.g. Hector / Thrift 45
  46. 46. Column e.g.: Cassandra Data Model » Column  Smallest increment of data: tuple of name, value, timestamp { name: "emailAddress", value: “nosql@alphacsp.com", timestamp: 123456789 } 46
  47. 47. Column e.g.: Cassandra » SuperColumn  A sorted, associative, unbounded array of columns { // this is a SuperColumn name: "homeAddress", // with an unbounded array of Columns value: { // the keys is the name of the Column street: {name: "street", value: "s", timestamp:...}, city: {name: "city", value: "c", timestamp:...}, zip: {name: "zip", value: "z", timestamp:...} } } 47
  48. 48. Column e.g.: Cassandra » ColumnFamily  A container (~Table) for columns sorted by their names  Column Families are referenced and sorted by row keys Users = { // ColumnFamily john: { // key to row in CF "role" : "admin", "status" : "offline", "nick" : "dude1934" }, // end row fred: { // another row "nick" : “freddy", "email" :"fred@example.com", "age" : "25", "gender" : "male",… },… // more rows } Column Family 48
  49. 49. Column e.g.: Cassandra » Keyspace  The outer most grouping of data (~DB Schema)  Contains ColumnFamily’s  There is no imposed relationship between ColumsFamily’s 49
  50. 50. Column e.g.: Cassandra » Example Tweets CF Keyspace Timeline CF 50
  51. 51. Taxonomy XML TXT Document Oriented Databases BIN 51
  52. 52. Taxonomy Document Store » Store semi-structured documents (think JSON) » Document versioning » Map/Reduce based queries, sorting, aggregation, etc. » DB is aware of internal structure » E.g.  MongoDB  CouchDB  JackRabbit (JCR JSR 170) 52
  53. 53. Taxonomy Use Case: Blog with tagged posts and comments » RDBMS:  Table for each: posts, comments, tags  Foreign relations » NoSQL:  Document storage  Store post + tags + comments as a document 53
  54. 54. Taxonomy Document Store e.g: MongoDB » MongoDB (from "humongous") » Manages collections of JSON-like documents (BSON) » Queries can return specific fields of documents » Supports secondary indexes » Atomic operations on single documents » Developed by: 10gen » Written in: C++ » Clients: Java, Scala and more 54
  55. 55. Docment e.g.: MongoDB Example: Blog posts » Suppose you host a blog, where each post is tagged: db.posts.save({ _id : 3, author:"john", title : “Apples, Oranges and NOSQL", text : “This article will…", tags : [ “database", “nosql" ] }); » Notice how posts have an array of tags 55
  56. 56. Docment e.g.: MongoDB » MongoDB supports secondary indexes and a query optimizer  Compound indexes are also supported db.posts.ensureIndex({ tags: 1 }); db.posts.ensureIndex({ author: 1}); db.posts.find({ author: "john", tags: "nosql" }); // Result: { "_id" : 3, "author" : "john", "title" : "Apples, Oranges and NOSQL", "text" : "This article will…", "tags" : ["database", "nosql", "mongodb" ] } 56
  57. 57. Docment e.g.: MongoDB » Let's update our posts to include some comments: db.posts.update({ _id: 3 }, { $inc: { comments_count: 4}, $pushAll : { comments: [ { text: “Comment 1" }, { text: “Comment 2", author: "Mr. T" }, { text: “Comment 3" }, { text: “Comment 4" } ] } }); 57
  58. 58. Taxonomy Graph Databases 58
  59. 59. Taxonomy Graph databases » Inspired by mathematical graph theory G=(E,V) » Models the structure of data » Navigational data model » Scalability / data complexity » Data model: Key-Value pairs on Edges / Nodes » Relationships: Edges between Nodes » E.g.  Neo4j  Pregel (Google’s PageRank)  AllegroGraph 59
  60. 60. Taxonomy Use Case: Connected data - deep relationship links between users in a social network » RDBMS  Complex recursive algorithm  Multiple Self joins  Round trips to DB / bulk read and resolve in RAM » NoSQL:  Graph Storage  Network traversal 60
  61. 61. Taxonomy Graph e.g.: Neo4J » High-performance graph engine » Embedded / disk based » Work with OO model: nodes, relationships, properties » ACID Transactions  JTA support – participate in 2PC with your RDBMS » Developed by: Neo Technologies » Written in: Java » Clients: Java, client libraries in other platforms 61
  62. 62. Graph e.g.: Neo4j http://neo4j.org/ 62
  63. 63. Comparing Apples to Oranges
  64. 64. Comparing Apples to Oranges Comparing Data Structures » RDBMS  Databases contains tables, columns and rows  All rows the same structure  Inherent ORM mismatch » NoSQL  Choose your data structure  Data is stored in natural structure (e.g. Documents, Graphs, Objects) 64
  65. 65. Comparing Apples to Oranges Comparing Schema Flexibility » RDBMS  Strict schema, difficult to evolve  Maintains relations and forces data integrity » NoSQL  Structure of data can be changed dynamically • e.g. Column stores – Cassandra  Data can sometimes be completely opaque • e.g Key/Value – Project Voldemort 65
  66. 66. Comparing Apples to Oranges Comparing Normalization & Relations » RDBMS  The data model is normalized to remove data duplication  Normalization establishes table relations » NoSQL  Denormalization is not a dirty word  Relations are not explicitly defined  Related data is usually grouped and stored as one unit • E.g. document, column 66
  67. 67. Comparing Apples to Oranges Comparing Data Acces » RDBMS  CRUD operations using SQL  Access data from multiple tables using SQL joins  Generic API such as JDBC » NoSQL  Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)  MapReduce, graph traversals  REST APIs, portable serialization formats • BSON, JSON, Apache Thrift, Memcached 67
  68. 68. Comparing Apples to Oranges Comparing Reporting Capabilities » RDBMS  Slice and Dice data, then reassemble any way you like » NoSQL  Hard to repurpose data for ad-hoc usage • Plan ahead  Think in advance • How and what you store • Data access patterns 68
  69. 69. Summary
  70. 70. Summary Why NOSQL / BASE » ACID ruled exclusively in the last 40 years  doesn’t compromise on consistency » Database industry neglected distributed DBs w/ availability » Vacuum was filled with “NoSQL” BASE architectures  Strict A and P, minimize C compromise » Relational databases are now trying to catch up 70
  71. 71. Summary NoSQL Limitations » Missing some query capabilities  joins / composite transaction » Eventual consistency -- not for every problem » Not a drop in replacement for RDBMS “on ACID” » No standardization -> product lock-in » Relatively immature (support, bugs, community) 71
  72. 72. Summary Choose the right tool for the job » Relational databases and NoSQL databases are designed to meet different needs » RDBMS-only should not be a default » NOSQL databases outperform RDBMS’s in their particular niche » No one size fits all / Silver bullet ...but you don’t have to choose one 72
  73. 73. Summary Polyglot Persistence » Poly: many Glot: language » Meshing up persistence mechanisms to best meet requirements » Good integration stories:  E.g. Neo4j + JDBC using JTA 73
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×