SlideShare a Scribd company logo
1 of 73
Download to read offline
July 11th, 2010
Apples, Oranges and NOSQL

   Roi Aldaag Architect & Consultant
   Nadav Wiener Architect & Consultant
Agenda

Introduction
» What is NoSQL?
» What’s “wrong” with RDBMS?
» Why now?




                               3
Agenda

RDBMS vs. NoSQL
» Scaling
» CAP Theorem
» ACID vs. BASE




                  4
Agenda

NoSQL Taxonomy
»   Key / Value
»   Column
»   Document
»   Graph




                  5
Agenda

How to choose?
» Comparing Apples to Oranges
» Polyglot Persistence




                                6
Introduction
Introduction

Question: What do they all have in common?




                                             8
Introduction

Before we answer – some facts:




                                 9
Introduction

  Before we answer – some facts:




Daily Page Views       7.8x109        7.1x109     550x106     350x106      82x106

   Daily Visitors      620x106        500x106      56x106      37x106      12x106

        Data size     Petabytes       Petabytes   Petabytes   Terabytes   Terabytes




   July, 2010: http://www.alexa.com

                                                                                10
Introduction

Answer: They use NoSQL data stores




                                     11
Introduction




               Why!?




                       12
Introduction

Relational DBs Have Scaling Limitations
» ACID doesn’t scale well horizontally
    Sharding breaks relations
    Joins are inefficient
» Transactions overhead
» Schema is not flexible
    Predfined
    Hard to evolve




                                          13
Introduction

What is NoSQL?
» NO SQL / Not Only SQL
» A collective description of Open Source, Non-relational,
  data stores
    Highly distributed
    Highly scalable
    Not ACID and... doesn’t use SQL
» Term coined in a convention in 2009 called “NoSQL” (Eric Evans)
» Started a movement that is gaining momentum



                                                                    14
Introduction




               15
Introduction

Why now?
» NoSQL data stores predate RDBMS (1970)
    But remained a niche
» RDBMS – most popular and generic option
» Web 2.0 introduced new requirements:
    Exponential increase in data
    Information connectivity
    Semi-structured data
» NoSQL data stores had answers
    When time was right
    When RDBMSs didn’t

                                            16
Introduction

               It’s theory time:




                                   17
ali
Sc   ng
          18
Scaling

Scaling Up
» Adding resources to a single node in a system
   » Add more CPUs or memory
» Move system to a larger machine
» Pros:
    Quick and Simple
» Cons:
    Outgrowing the capacity of largest
     system available (More’s law)
    Expensive
    Creates vendor lock-in

                                                  19
Scaling

Scaling Out
» Add more nodes to a system
» Functional Scaling (vertical)
    Grouping data by function and spreading
     functional groups across databases
» Sharding (horizontal)
    Splitting same functional data across
     multiple databases
» Pros: More flexible
» Cons: More complex


                                               20
Distributed
 Databases
Distributed Databases


» Many nodes
                        Node 1   Node 2
» Same database




                                 Node 3




                                          22
Distributed Databases

What are the requirements from distributed databases?
» Consistency
    All clients can see the same data
» Availability
    All clients can always access data
» Partition tolerance
    The ability to continue working when the network topology is
     broken
    The ability to recover once the network is healed




                                                                    23
Distributed Databases

CAP Theorem (E. Brewer, N. Lynch)
» You can fully satisfy at most 2 out of 3
    Compromise on 3rd
» Not “all or nothing”
    Choose various levels of consistency, availability or partition
     tolerance
» Recognize which of the CAP rules your business needs for the
  task




                                                                       24
Distributed Databases

CA: Consistency & Availability
» Partition Tolerance is compromised
» Single site clusters (easier to ensure all nodes are always in
  contact)
» When a network partition occurs, the system blocks
» e.g. Two Phase Commit (2PC)




                                                          Partition
                                                         Tolerance
                                                                      25
Distributed Databases

CP: Consistency & Partitioning
»   Availability is compromised
»   Access to some data may be temporarily limited
»   The rest is still consistent/accurate
»   e.g. Sharded database
»   TBD sample




                                                      Partition
                                                     Tolerance

                                                                  26
Distributed Databases

AP: Availability & Partitioning
»   Consistency is compromised
»   System is still available under partitioning
»   Some data returned may be temporarily not up-to-date
»   Requires conflict resolution strategy
»   e.g. DNS, caches, Master/Slave replication
»   TBD sample


                                                     Partition
                                                    Tolerance

                                                                 27
ACID vs. BASE
ACID vs. BASE

ACID – a quick recap
» Atomicity
    When a part of the transaction fails -> the entire transaction fails;
     Database state is left unchanged
» Consistency
    A transaction takes database from one consistent state to another
» Isolation
    A transaction can't see dirty state from other transactions
» Durability
    Commit means commit.


                                                                             29
ACID vs. BASE

BASE
» The CAP compliment of ACID
    Just had to be called BASE
    Backronym:
» Basically Available
» Soft State
» Eventually Consistent




                                  30
ACID vs. BASE

RDBMS & ACID / NoSQL & BASE
» RDBMSs strive to provide ACID guarantees
    ACID forces consistency


» NoSQL solutions often scale through BASE
    BASE accepts that conflicts will happen




                                               31
Taxonomy
Taxonomy

        Key / Value    Column




                 XML       Graph

 Document        TXT




                 BIN




                                   33
Taxonomy




           Key / Value Databases




                                   34
Taxonomy

Key/Value Stores
»   Simple Key / Value lookups (DHT)
»   Value is opaque
»   Focus on scaling to huge amounts of data
»   Designed to handle massive load
»   E.g.
     Riak                    Based on Amazon’s
     Project Voldemort       Dynamo paper
     Redis



                                                  35
Taxonomy

Key/Value e.g.: Riak
»   No single point of failure
»   No machines are special or central
»   MapReduce queries (Erlang / Javascript)
»   HTTP/JSON API
»   Ring cluster with automatic replication
»   Elastic / partition rebalancing
» Written in: Erlang, C, Javascript
» Developed by: Basho Technologies
» Java client: (jonjlee / riak-java-client)


                                              36
Key/Value e.g.: Riak

Data Model
» Key / Value pairs are stored in a Bucket
» A Bucket ~ a namespace

Versioning
» Each update is tracked by a Vector Clock
    An algorithm for determining ordering and detecting conflicts
» When in conflict
    Last wins / manual resolution



                                                                     37
Key/Value e.g.: Riak

Example: REST API
» Read an object

   GET /riak/bucket/key

» Store a new object

   POST /riak/bucket

» Store an object with existing key (update)
   PUT /riak/bucket/key


                                               38
Key/Value e.g.: Riak

MapReduce
» A framework supporting distributed computing on large data
  sets on clusters of machines
» Leverage parallel processing power
» Introduced by Google
» Inspired by map / reduce functions in functional programming
   » Map step
   » Reduce step




                                                                 39
Key/Value e.g.: Riak

MapReduce example: Inverted Index
» Map
  » Parse each document
  » Emit a sequence of <word, doc_id> pairs

<doc_id, doc_text>                       <word   ,doc_id>
                              Node       < word1 ,100>,
<100,     TXT1   >,            1         < word2 ,100>,
                              Node
<200,     TXT2
                 >,            2         <    word2 ,200>,
          TXT3                Node
<300,            >             3         <    word2 ,300>

                                                             40
Key/Value e.g.: Riak

MapReduce example: Inverted Index
» Reduce
   » Accept all pairs for a given word
   » Sort the corresponding document IDs
   » Emit a <word, list(document ID)> pair


              <word,     list(document_id)>
              < word1   ,(100)    >,
              < word2   ,(100,200)>,
              < word3   ,(300)    >


                                              41
Taxonomy




           BigTable and
           Column Oriented Databases




                                       42
Taxonomy

Column Stores – BigTable derivatives
»   Conceptually a single, infinitely large table
»   Each rows can have different number of columns
»   Table is sparse: |rows|*|columns| > |values |
»   Based on Google’s BigTable paper
»   E.g.
      Cassandra
      Hbase
      Hypertable



                                                     43
Taxonomy

Use Case: Manage products with diverse attributes
» RDBMS:
    Create a central table with common attributes
    Create a table per product with unique attributes
    Use a join query
    Alternatively create a table that holds meta data on products
» NoSQL:
    Column oriented database
    Use arbitrarily columns



                                                                 44
Taxonomy

Column Store e.g.: Cassandra
»   Data model: Google’s BigTable
»   Infrastructure: Amazon Dynamo
»   Incremental scalability
»   Flexible schema
»   No single point of failure (Distributed P2P)
»   Optimistic replication (Gossip protocol)
»   Written in: Java
»   Developed by: Facebook
»   Java client: e.g. Hector / Thrift

                                                   45
Column e.g.: Cassandra

Data Model
» Column
   Smallest increment of data: tuple of name, value, timestamp

   {
        name: "emailAddress",
        value: “nosql@alphacsp.com",
        timestamp: 123456789
   }




                                                                  46
Column e.g.: Cassandra


» SuperColumn
    A sorted, associative, unbounded
     array of columns


{ // this is a SuperColumn
  name: "homeAddress",
  // with an unbounded array of Columns
  value: {
    // the keys is the name of the Column
    street: {name: "street", value: "s", timestamp:...},
    city:   {name: "city", value: "c", timestamp:...},
    zip:    {name: "zip", value: "z", timestamp:...}
  }
}

                                                           47
Column e.g.: Cassandra

» ColumnFamily
    A container (~Table) for columns sorted by their names
    Column Families are referenced and sorted by row keys
Users = { // ColumnFamily
  john: {   // key to row in CF
   "role"   : "admin",
   "status" : "offline",
   "nick"   : "dude1934"
  }, // end row
  fred: {   // another row
   "nick"    : “freddy",
   "email" :"fred@example.com",
   "age"     : "25",
   "gender" : "male",…
  },… // more rows
}                                          Column Family
                                                              48
Column e.g.: Cassandra

» Keyspace
    The outer most grouping of data (~DB Schema)
    Contains ColumnFamily’s
    There is no imposed relationship between ColumsFamily’s




                                                               49
Column e.g.: Cassandra

» Example
                         Tweets CF




Keyspace
                              Timeline CF




                                            50
Taxonomy



    XML




    TXT
           Document Oriented Databases
    BIN




                                         51
Taxonomy

Document Store
»   Store semi-structured documents (think JSON)
»   Document versioning
»   Map/Reduce based queries, sorting, aggregation, etc.
»   DB is aware of internal structure
»   E.g.
     MongoDB
     CouchDB
     JackRabbit (JCR JSR 170)




                                                           52
Taxonomy

Use Case: Blog with tagged posts and comments
» RDBMS:
    Table for each: posts, comments, tags
    Foreign relations
» NoSQL:
    Document storage
    Store post + tags + comments as a document




                                                  53
Taxonomy

Document Store e.g: MongoDB
»   MongoDB (from "humongous")
»   Manages collections of JSON-like documents (BSON)
»   Queries can return specific fields of documents
»   Supports secondary indexes
»   Atomic operations on single documents

» Developed by: 10gen
» Written in: C++
» Clients: Java, Scala and more


                                                        54
Docment e.g.: MongoDB

Example: Blog posts
» Suppose you host a blog, where each post is tagged:

   db.posts.save({
       _id   : 3,
       author:"john",
       title : “Apples, Oranges and NOSQL",
       text : “This article will…",
       tags : [ “database", “nosql" ]
   });


» Notice how posts have an array of tags

                                                        55
Docment e.g.: MongoDB

» MongoDB supports secondary indexes and a query optimizer
    Compound indexes are also supported

   db.posts.ensureIndex({ tags: 1 });
   db.posts.ensureIndex({ author: 1});

   db.posts.find({ author: "john", tags: "nosql" });

   // Result:
   {
           "_id"      :   3,
           "author"   :   "john",
           "title"    :   "Apples, Oranges and NOSQL",
           "text"     :   "This article will…",
           "tags"     :   ["database", "nosql", "mongodb" ]
   }


                                                              56
Docment e.g.: MongoDB

» Let's update our posts to include some comments:

  db.posts.update({ _id: 3 }, {
      $inc: { comments_count: 4},
      $pushAll : {
           comments: [
               { text: “Comment 1" },
               { text: “Comment 2", author: "Mr. T" },
               { text: “Comment 3" },
               { text: “Comment 4" }
         ]
      }
  });



                                                         57
Taxonomy




           Graph Databases




                             58
Taxonomy

Graph databases
»   Inspired by mathematical graph theory G=(E,V)
»   Models the structure of data
»   Navigational data model
»   Scalability / data complexity
»   Data model: Key-Value pairs on Edges / Nodes
»   Relationships: Edges between Nodes
»   E.g.
     Neo4j
     Pregel (Google’s PageRank)
     AllegroGraph

                                                    59
Taxonomy

Use Case: Connected data - deep relationship links
between users in a social network

» RDBMS
    Complex recursive algorithm
    Multiple Self joins
    Round trips to DB / bulk read and resolve in RAM
» NoSQL:
    Graph Storage
    Network traversal


                                                        60
Taxonomy

Graph e.g.: Neo4J
»   High-performance graph engine
»   Embedded / disk based
»   Work with OO model: nodes, relationships, properties
»   ACID Transactions
     JTA support – participate in 2PC with your RDBMS
» Developed by: Neo Technologies
» Written in: Java
» Clients: Java, client libraries in other platforms



                                                           61
Graph e.g.: Neo4j




                    http://neo4j.org/

                                62
Comparing Apples to Oranges
Comparing Apples to Oranges

Comparing Data Structures
» RDBMS
    Databases contains tables, columns and rows
    All rows the same structure
    Inherent ORM mismatch
» NoSQL
    Choose your data structure
    Data is stored in natural structure (e.g. Documents, Graphs,
     Objects)



                                                                    64
Comparing Apples to Oranges

Comparing Schema Flexibility
» RDBMS
    Strict schema, difficult to evolve
    Maintains relations and forces data integrity
» NoSQL
    Structure of data can be changed dynamically
      • e.g. Column stores – Cassandra
    Data can sometimes be completely opaque
      • e.g Key/Value – Project Voldemort



                                                     65
Comparing Apples to Oranges

Comparing Normalization & Relations
» RDBMS
    The data model is normalized to remove data duplication
    Normalization establishes table relations
» NoSQL
    Denormalization is not a dirty word
    Relations are not explicitly defined
    Related data is usually grouped and stored as one unit
      • E.g. document, column



                                                               66
Comparing Apples to Oranges

Comparing Data Acces
» RDBMS
    CRUD operations using SQL
    Access data from multiple tables using SQL joins
    Generic API such as JDBC
» NoSQL
    Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)
    MapReduce, graph traversals
    REST APIs, portable serialization formats
       • BSON, JSON, Apache Thrift, Memcached


                                                            67
Comparing Apples to Oranges

Comparing Reporting Capabilities
» RDBMS
    Slice and Dice data, then reassemble any way you like
» NoSQL
    Hard to repurpose data for ad-hoc usage
      • Plan ahead
    Think in advance
      • How and what you store
      • Data access patterns



                                                             68
Summary
Summary

Why NOSQL / BASE
» ACID ruled exclusively in the last 40 years
    doesn’t compromise on consistency
» Database industry neglected distributed DBs w/ availability
» Vacuum was filled with “NoSQL” BASE architectures
    Strict A and P, minimize C compromise
» Relational databases are now trying to catch up




                                                                70
Summary

NoSQL Limitations
» Missing some query capabilities
     joins / composite transaction
»   Eventual consistency -- not for every problem
»   Not a drop in replacement for RDBMS “on ACID”
»   No standardization -> product lock-in
»   Relatively immature (support, bugs, community)




                                                     71
Summary

Choose the right tool for the job
» Relational databases and NoSQL databases are designed to
  meet different needs
» RDBMS-only should not be a default
» NOSQL databases outperform RDBMS’s
  in their particular niche
» No one size fits all / Silver bullet

  ...but you don’t have to choose one




                                                             72
Summary

Polyglot Persistence
» Poly: many Glot: language
» Meshing up persistence mechanisms to best meet
  requirements
» Good integration stories:
    E.g. Neo4j + JDBC using JTA




                                                   73

More Related Content

What's hot

State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012jbellis
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0jbellis
 
Erlang Cache
Erlang CacheErlang Cache
Erlang Cacheice j
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Futurepcmanus
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Embedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero TechnologiesEmbedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero TechnologiesMichael Findling
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandrajbellis
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
Why Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueDataWhy Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueDataData Con LA
 
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
Cloud lockin and interoperability v2   indic threads cloud computing conferen...Cloud lockin and interoperability v2   indic threads cloud computing conferen...
Cloud lockin and interoperability v2 indic threads cloud computing conferen...IndicThreads
 
JATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web siteJATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web siteKlortho
 
Getting Started with jClouds: Multi Cloud Framework
Getting Started with jClouds: Multi Cloud FrameworkGetting Started with jClouds: Multi Cloud Framework
Getting Started with jClouds: Multi Cloud FrameworkIndicThreads
 
Conference tutorial: MySQL Cluster as NoSQL
Conference tutorial: MySQL Cluster as NoSQLConference tutorial: MySQL Cluster as NoSQL
Conference tutorial: MySQL Cluster as NoSQLSeveralnines
 
DBArtisan® vs Quest Toad with DB Admin Module
DBArtisan® vs Quest Toad with DB Admin ModuleDBArtisan® vs Quest Toad with DB Admin Module
DBArtisan® vs Quest Toad with DB Admin ModuleEmbarcadero Technologies
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 

What's hot (20)

State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
 
Scaling Up vs. Scaling-out
Scaling Up vs. Scaling-outScaling Up vs. Scaling-out
Scaling Up vs. Scaling-out
 
Erlang Cache
Erlang CacheErlang Cache
Erlang Cache
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Future
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Embedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero TechnologiesEmbedded Database Technology | Interbase From Embarcadero Technologies
Embedded Database Technology | Interbase From Embarcadero Technologies
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
NuoDB Product Brochure
NuoDB Product BrochureNuoDB Product Brochure
NuoDB Product Brochure
 
Why Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueDataWhy Virtualization is important by Tom Phelan of BlueData
Why Virtualization is important by Tom Phelan of BlueData
 
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
Cloud lockin and interoperability v2   indic threads cloud computing conferen...Cloud lockin and interoperability v2   indic threads cloud computing conferen...
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
 
JATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web siteJATSPack and JATSPAN, a packaging format specification and a web site
JATSPack and JATSPAN, a packaging format specification and a web site
 
Getting Started with jClouds: Multi Cloud Framework
Getting Started with jClouds: Multi Cloud FrameworkGetting Started with jClouds: Multi Cloud Framework
Getting Started with jClouds: Multi Cloud Framework
 
Conference tutorial: MySQL Cluster as NoSQL
Conference tutorial: MySQL Cluster as NoSQLConference tutorial: MySQL Cluster as NoSQL
Conference tutorial: MySQL Cluster as NoSQL
 
DBArtisan® vs Quest Toad with DB Admin Module
DBArtisan® vs Quest Toad with DB Admin ModuleDBArtisan® vs Quest Toad with DB Admin Module
DBArtisan® vs Quest Toad with DB Admin Module
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
NoSQL
NoSQLNoSQL
NoSQL
 

Similar to Seminar.2010.NoSql

NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation freeBenoit Perroud
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explainedUwe Friedrichsen
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdfShaimaaMohamedGalal
 
Presentacion redislabs-ihub
Presentacion redislabs-ihubPresentacion redislabs-ihub
Presentacion redislabs-ihubssuser9d7c90
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
Latency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesLatency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesScyllaDB
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)Ontico
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labUgo Landini
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandraBrian Enochson
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBen Stopford
 

Similar to Seminar.2010.NoSql (20)

NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation free
 
6269441.ppt
6269441.ppt6269441.ppt
6269441.ppt
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
 
1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf1. Lecture1_NOSQL_Introduction.pdf
1. Lecture1_NOSQL_Introduction.pdf
 
Presentacion redislabs-ihub
Presentacion redislabs-ihubPresentacion redislabs-ihub
Presentacion redislabs-ihub
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Latency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed DatabasesLatency and Consistency Tradeoffs in Modern Distributed Databases
Latency and Consistency Tradeoffs in Modern Distributed Databases
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Codemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech labCodemotion 2015 Infinispan Tech lab
Codemotion 2015 Infinispan Tech lab
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
NOSQL
NOSQLNOSQL
NOSQL
 
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBeyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
 

Seminar.2010.NoSql

  • 2. Apples, Oranges and NOSQL Roi Aldaag Architect & Consultant Nadav Wiener Architect & Consultant
  • 3. Agenda Introduction » What is NoSQL? » What’s “wrong” with RDBMS? » Why now? 3
  • 4. Agenda RDBMS vs. NoSQL » Scaling » CAP Theorem » ACID vs. BASE 4
  • 5. Agenda NoSQL Taxonomy » Key / Value » Column » Document » Graph 5
  • 6. Agenda How to choose? » Comparing Apples to Oranges » Polyglot Persistence 6
  • 8. Introduction Question: What do they all have in common? 8
  • 9. Introduction Before we answer – some facts: 9
  • 10. Introduction Before we answer – some facts: Daily Page Views 7.8x109 7.1x109 550x106 350x106 82x106 Daily Visitors 620x106 500x106 56x106 37x106 12x106 Data size Petabytes Petabytes Petabytes Terabytes Terabytes July, 2010: http://www.alexa.com 10
  • 11. Introduction Answer: They use NoSQL data stores 11
  • 12. Introduction Why!? 12
  • 13. Introduction Relational DBs Have Scaling Limitations » ACID doesn’t scale well horizontally  Sharding breaks relations  Joins are inefficient » Transactions overhead » Schema is not flexible  Predfined  Hard to evolve 13
  • 14. Introduction What is NoSQL? » NO SQL / Not Only SQL » A collective description of Open Source, Non-relational, data stores  Highly distributed  Highly scalable  Not ACID and... doesn’t use SQL » Term coined in a convention in 2009 called “NoSQL” (Eric Evans) » Started a movement that is gaining momentum 14
  • 16. Introduction Why now? » NoSQL data stores predate RDBMS (1970)  But remained a niche » RDBMS – most popular and generic option » Web 2.0 introduced new requirements:  Exponential increase in data  Information connectivity  Semi-structured data » NoSQL data stores had answers  When time was right  When RDBMSs didn’t 16
  • 17. Introduction It’s theory time: 17
  • 18. ali Sc ng 18
  • 19. Scaling Scaling Up » Adding resources to a single node in a system » Add more CPUs or memory » Move system to a larger machine » Pros:  Quick and Simple » Cons:  Outgrowing the capacity of largest system available (More’s law)  Expensive  Creates vendor lock-in 19
  • 20. Scaling Scaling Out » Add more nodes to a system » Functional Scaling (vertical)  Grouping data by function and spreading functional groups across databases » Sharding (horizontal)  Splitting same functional data across multiple databases » Pros: More flexible » Cons: More complex 20
  • 22. Distributed Databases » Many nodes Node 1 Node 2 » Same database Node 3 22
  • 23. Distributed Databases What are the requirements from distributed databases? » Consistency  All clients can see the same data » Availability  All clients can always access data » Partition tolerance  The ability to continue working when the network topology is broken  The ability to recover once the network is healed 23
  • 24. Distributed Databases CAP Theorem (E. Brewer, N. Lynch) » You can fully satisfy at most 2 out of 3  Compromise on 3rd » Not “all or nothing”  Choose various levels of consistency, availability or partition tolerance » Recognize which of the CAP rules your business needs for the task 24
  • 25. Distributed Databases CA: Consistency & Availability » Partition Tolerance is compromised » Single site clusters (easier to ensure all nodes are always in contact) » When a network partition occurs, the system blocks » e.g. Two Phase Commit (2PC) Partition Tolerance 25
  • 26. Distributed Databases CP: Consistency & Partitioning » Availability is compromised » Access to some data may be temporarily limited » The rest is still consistent/accurate » e.g. Sharded database » TBD sample Partition Tolerance 26
  • 27. Distributed Databases AP: Availability & Partitioning » Consistency is compromised » System is still available under partitioning » Some data returned may be temporarily not up-to-date » Requires conflict resolution strategy » e.g. DNS, caches, Master/Slave replication » TBD sample Partition Tolerance 27
  • 29. ACID vs. BASE ACID – a quick recap » Atomicity  When a part of the transaction fails -> the entire transaction fails; Database state is left unchanged » Consistency  A transaction takes database from one consistent state to another » Isolation  A transaction can't see dirty state from other transactions » Durability  Commit means commit. 29
  • 30. ACID vs. BASE BASE » The CAP compliment of ACID  Just had to be called BASE  Backronym: » Basically Available » Soft State » Eventually Consistent 30
  • 31. ACID vs. BASE RDBMS & ACID / NoSQL & BASE » RDBMSs strive to provide ACID guarantees  ACID forces consistency » NoSQL solutions often scale through BASE  BASE accepts that conflicts will happen 31
  • 33. Taxonomy Key / Value Column XML Graph Document TXT BIN 33
  • 34. Taxonomy Key / Value Databases 34
  • 35. Taxonomy Key/Value Stores » Simple Key / Value lookups (DHT) » Value is opaque » Focus on scaling to huge amounts of data » Designed to handle massive load » E.g.  Riak Based on Amazon’s  Project Voldemort Dynamo paper  Redis 35
  • 36. Taxonomy Key/Value e.g.: Riak » No single point of failure » No machines are special or central » MapReduce queries (Erlang / Javascript) » HTTP/JSON API » Ring cluster with automatic replication » Elastic / partition rebalancing » Written in: Erlang, C, Javascript » Developed by: Basho Technologies » Java client: (jonjlee / riak-java-client) 36
  • 37. Key/Value e.g.: Riak Data Model » Key / Value pairs are stored in a Bucket » A Bucket ~ a namespace Versioning » Each update is tracked by a Vector Clock  An algorithm for determining ordering and detecting conflicts » When in conflict  Last wins / manual resolution 37
  • 38. Key/Value e.g.: Riak Example: REST API » Read an object GET /riak/bucket/key » Store a new object POST /riak/bucket » Store an object with existing key (update) PUT /riak/bucket/key 38
  • 39. Key/Value e.g.: Riak MapReduce » A framework supporting distributed computing on large data sets on clusters of machines » Leverage parallel processing power » Introduced by Google » Inspired by map / reduce functions in functional programming » Map step » Reduce step 39
  • 40. Key/Value e.g.: Riak MapReduce example: Inverted Index » Map » Parse each document » Emit a sequence of <word, doc_id> pairs <doc_id, doc_text> <word ,doc_id> Node < word1 ,100>, <100, TXT1 >, 1 < word2 ,100>, Node <200, TXT2 >, 2 < word2 ,200>, TXT3 Node <300, > 3 < word2 ,300> 40
  • 41. Key/Value e.g.: Riak MapReduce example: Inverted Index » Reduce » Accept all pairs for a given word » Sort the corresponding document IDs » Emit a <word, list(document ID)> pair <word, list(document_id)> < word1 ,(100) >, < word2 ,(100,200)>, < word3 ,(300) > 41
  • 42. Taxonomy BigTable and Column Oriented Databases 42
  • 43. Taxonomy Column Stores – BigTable derivatives » Conceptually a single, infinitely large table » Each rows can have different number of columns » Table is sparse: |rows|*|columns| > |values | » Based on Google’s BigTable paper » E.g.  Cassandra  Hbase  Hypertable 43
  • 44. Taxonomy Use Case: Manage products with diverse attributes » RDBMS:  Create a central table with common attributes  Create a table per product with unique attributes  Use a join query  Alternatively create a table that holds meta data on products » NoSQL:  Column oriented database  Use arbitrarily columns 44
  • 45. Taxonomy Column Store e.g.: Cassandra » Data model: Google’s BigTable » Infrastructure: Amazon Dynamo » Incremental scalability » Flexible schema » No single point of failure (Distributed P2P) » Optimistic replication (Gossip protocol) » Written in: Java » Developed by: Facebook » Java client: e.g. Hector / Thrift 45
  • 46. Column e.g.: Cassandra Data Model » Column  Smallest increment of data: tuple of name, value, timestamp { name: "emailAddress", value: “nosql@alphacsp.com", timestamp: 123456789 } 46
  • 47. Column e.g.: Cassandra » SuperColumn  A sorted, associative, unbounded array of columns { // this is a SuperColumn name: "homeAddress", // with an unbounded array of Columns value: { // the keys is the name of the Column street: {name: "street", value: "s", timestamp:...}, city: {name: "city", value: "c", timestamp:...}, zip: {name: "zip", value: "z", timestamp:...} } } 47
  • 48. Column e.g.: Cassandra » ColumnFamily  A container (~Table) for columns sorted by their names  Column Families are referenced and sorted by row keys Users = { // ColumnFamily john: { // key to row in CF "role" : "admin", "status" : "offline", "nick" : "dude1934" }, // end row fred: { // another row "nick" : “freddy", "email" :"fred@example.com", "age" : "25", "gender" : "male",… },… // more rows } Column Family 48
  • 49. Column e.g.: Cassandra » Keyspace  The outer most grouping of data (~DB Schema)  Contains ColumnFamily’s  There is no imposed relationship between ColumsFamily’s 49
  • 50. Column e.g.: Cassandra » Example Tweets CF Keyspace Timeline CF 50
  • 51. Taxonomy XML TXT Document Oriented Databases BIN 51
  • 52. Taxonomy Document Store » Store semi-structured documents (think JSON) » Document versioning » Map/Reduce based queries, sorting, aggregation, etc. » DB is aware of internal structure » E.g.  MongoDB  CouchDB  JackRabbit (JCR JSR 170) 52
  • 53. Taxonomy Use Case: Blog with tagged posts and comments » RDBMS:  Table for each: posts, comments, tags  Foreign relations » NoSQL:  Document storage  Store post + tags + comments as a document 53
  • 54. Taxonomy Document Store e.g: MongoDB » MongoDB (from "humongous") » Manages collections of JSON-like documents (BSON) » Queries can return specific fields of documents » Supports secondary indexes » Atomic operations on single documents » Developed by: 10gen » Written in: C++ » Clients: Java, Scala and more 54
  • 55. Docment e.g.: MongoDB Example: Blog posts » Suppose you host a blog, where each post is tagged: db.posts.save({ _id : 3, author:"john", title : “Apples, Oranges and NOSQL", text : “This article will…", tags : [ “database", “nosql" ] }); » Notice how posts have an array of tags 55
  • 56. Docment e.g.: MongoDB » MongoDB supports secondary indexes and a query optimizer  Compound indexes are also supported db.posts.ensureIndex({ tags: 1 }); db.posts.ensureIndex({ author: 1}); db.posts.find({ author: "john", tags: "nosql" }); // Result: { "_id" : 3, "author" : "john", "title" : "Apples, Oranges and NOSQL", "text" : "This article will…", "tags" : ["database", "nosql", "mongodb" ] } 56
  • 57. Docment e.g.: MongoDB » Let's update our posts to include some comments: db.posts.update({ _id: 3 }, { $inc: { comments_count: 4}, $pushAll : { comments: [ { text: “Comment 1" }, { text: “Comment 2", author: "Mr. T" }, { text: “Comment 3" }, { text: “Comment 4" } ] } }); 57
  • 58. Taxonomy Graph Databases 58
  • 59. Taxonomy Graph databases » Inspired by mathematical graph theory G=(E,V) » Models the structure of data » Navigational data model » Scalability / data complexity » Data model: Key-Value pairs on Edges / Nodes » Relationships: Edges between Nodes » E.g.  Neo4j  Pregel (Google’s PageRank)  AllegroGraph 59
  • 60. Taxonomy Use Case: Connected data - deep relationship links between users in a social network » RDBMS  Complex recursive algorithm  Multiple Self joins  Round trips to DB / bulk read and resolve in RAM » NoSQL:  Graph Storage  Network traversal 60
  • 61. Taxonomy Graph e.g.: Neo4J » High-performance graph engine » Embedded / disk based » Work with OO model: nodes, relationships, properties » ACID Transactions  JTA support – participate in 2PC with your RDBMS » Developed by: Neo Technologies » Written in: Java » Clients: Java, client libraries in other platforms 61
  • 62. Graph e.g.: Neo4j http://neo4j.org/ 62
  • 64. Comparing Apples to Oranges Comparing Data Structures » RDBMS  Databases contains tables, columns and rows  All rows the same structure  Inherent ORM mismatch » NoSQL  Choose your data structure  Data is stored in natural structure (e.g. Documents, Graphs, Objects) 64
  • 65. Comparing Apples to Oranges Comparing Schema Flexibility » RDBMS  Strict schema, difficult to evolve  Maintains relations and forces data integrity » NoSQL  Structure of data can be changed dynamically • e.g. Column stores – Cassandra  Data can sometimes be completely opaque • e.g Key/Value – Project Voldemort 65
  • 66. Comparing Apples to Oranges Comparing Normalization & Relations » RDBMS  The data model is normalized to remove data duplication  Normalization establishes table relations » NoSQL  Denormalization is not a dirty word  Relations are not explicitly defined  Related data is usually grouped and stored as one unit • E.g. document, column 66
  • 67. Comparing Apples to Oranges Comparing Data Acces » RDBMS  CRUD operations using SQL  Access data from multiple tables using SQL joins  Generic API such as JDBC » NoSQL  Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)  MapReduce, graph traversals  REST APIs, portable serialization formats • BSON, JSON, Apache Thrift, Memcached 67
  • 68. Comparing Apples to Oranges Comparing Reporting Capabilities » RDBMS  Slice and Dice data, then reassemble any way you like » NoSQL  Hard to repurpose data for ad-hoc usage • Plan ahead  Think in advance • How and what you store • Data access patterns 68
  • 70. Summary Why NOSQL / BASE » ACID ruled exclusively in the last 40 years  doesn’t compromise on consistency » Database industry neglected distributed DBs w/ availability » Vacuum was filled with “NoSQL” BASE architectures  Strict A and P, minimize C compromise » Relational databases are now trying to catch up 70
  • 71. Summary NoSQL Limitations » Missing some query capabilities  joins / composite transaction » Eventual consistency -- not for every problem » Not a drop in replacement for RDBMS “on ACID” » No standardization -> product lock-in » Relatively immature (support, bugs, community) 71
  • 72. Summary Choose the right tool for the job » Relational databases and NoSQL databases are designed to meet different needs » RDBMS-only should not be a default » NOSQL databases outperform RDBMS’s in their particular niche » No one size fits all / Silver bullet ...but you don’t have to choose one 72
  • 73. Summary Polyglot Persistence » Poly: many Glot: language » Meshing up persistence mechanisms to best meet requirements » Good integration stories:  E.g. Neo4j + JDBC using JTA 73