Seminar.2010.NoSql

Apples, Oranges and NOSQL

Roi Aldaag Architect & Consultant
Nadav Wiener Architect & Consultant

Agenda

Introduction
» What is NoSQL?
» What’s “wrong” with RDBMS?
» Why now?

3

Agenda

RDBMS vs. NoSQL
» Scaling
» CAP Theorem
» ACID vs. BASE

4

Agenda

NoSQL Taxonomy
» Key / Value
» Column
» Document
» Graph

5

Agenda

How to choose?
» Comparing Apples to Oranges
» Polyglot Persistence

6

Introduction

Question: What do they all have in common?

8

Introduction

Before we answer – some facts:

9

Introduction

Before we answer – some facts:

Daily Page Views 7.8x109 7.1x109 550x106 350x106 82x106

Daily Visitors 620x106 500x106 56x106 37x106 12x106

Data size Petabytes Petabytes Petabytes Terabytes Terabytes

July, 2010: http://www.alexa.com

10

Introduction

Answer: They use NoSQL data stores

11

Introduction

Why!?

12

Introduction

Relational DBs Have Scaling Limitations
» ACID doesn’t scale well horizontally
 Sharding breaks relations
 Joins are inefficient
» Transactions overhead
» Schema is not flexible
 Predfined
 Hard to evolve

13

Introduction

What is NoSQL?
» NO SQL / Not Only SQL
» A collective description of Open Source, Non-relational,
data stores
 Highly distributed
 Highly scalable
 Not ACID and... doesn’t use SQL
» Term coined in a convention in 2009 called “NoSQL” (Eric Evans)
» Started a movement that is gaining momentum

14

Introduction

Why now?
» NoSQL data stores predate RDBMS (1970)
 But remained a niche
» RDBMS – most popular and generic option
» Web 2.0 introduced new requirements:
 Exponential increase in data
 Information connectivity
 Semi-structured data
» NoSQL data stores had answers
 When time was right
 When RDBMSs didn’t

16

Introduction

It’s theory time:

17

Scaling

Scaling Up
» Adding resources to a single node in a system
» Add more CPUs or memory
» Move system to a larger machine
» Pros:
 Quick and Simple
» Cons:
 Outgrowing the capacity of largest
system available (More’s law)
 Expensive
 Creates vendor lock-in

19

Scaling

Scaling Out
» Add more nodes to a system
» Functional Scaling (vertical)
 Grouping data by function and spreading
functional groups across databases
» Sharding (horizontal)
 Splitting same functional data across
multiple databases
» Pros: More flexible
» Cons: More complex

20

Distributed Databases

» Many nodes
Node 1 Node 2
» Same database

Node 3

22


What are the requirements from distributed databases?
» Consistency
 All clients can see the same data
» Availability
 All clients can always access data
» Partition tolerance
 The ability to continue working when the network topology is
broken
 The ability to recover once the network is healed

23


CAP Theorem (E. Brewer, N. Lynch)
» You can fully satisfy at most 2 out of 3
 Compromise on 3rd
» Not “all or nothing”
 Choose various levels of consistency, availability or partition
tolerance
» Recognize which of the CAP rules your business needs for the
task

24


CA: Consistency & Availability
» Partition Tolerance is compromised
» Single site clusters (easier to ensure all nodes are always in
contact)
» When a network partition occurs, the system blocks
» e.g. Two Phase Commit (2PC)

Partition
Tolerance
25


CP: Consistency & Partitioning
» Availability is compromised
» Access to some data may be temporarily limited
» The rest is still consistent/accurate
» e.g. Sharded database
» TBD sample

Partition
Tolerance

26


AP: Availability & Partitioning
» Consistency is compromised
» System is still available under partitioning
» Some data returned may be temporarily not up-to-date
» Requires conflict resolution strategy
» e.g. DNS, caches, Master/Slave replication
» TBD sample

Partition
Tolerance

27

ACID vs. BASE

ACID – a quick recap
» Atomicity
 When a part of the transaction fails -> the entire transaction fails;
Database state is left unchanged
» Consistency
 A transaction takes database from one consistent state to another
» Isolation
 A transaction can't see dirty state from other transactions
» Durability
 Commit means commit.

29

ACID vs. BASE

BASE
» The CAP compliment of ACID
 Just had to be called BASE
 Backronym:
» Basically Available
» Soft State
» Eventually Consistent

30

ACID vs. BASE

RDBMS & ACID / NoSQL & BASE
» RDBMSs strive to provide ACID guarantees
 ACID forces consistency

» NoSQL solutions often scale through BASE
 BASE accepts that conflicts will happen

31

Taxonomy

Key / Value Column

XML Graph

Document TXT

BIN

33

Taxonomy

Key / Value Databases

34

Taxonomy

Key/Value Stores
» Simple Key / Value lookups (DHT)
» Value is opaque
» Focus on scaling to huge amounts of data
» Designed to handle massive load
» E.g.
 Riak Based on Amazon’s
 Project Voldemort Dynamo paper
 Redis

35

Taxonomy

Key/Value e.g.: Riak
» No single point of failure
» No machines are special or central
» MapReduce queries (Erlang / Javascript)
» HTTP/JSON API
» Ring cluster with automatic replication
» Elastic / partition rebalancing
» Written in: Erlang, C, Javascript
» Developed by: Basho Technologies
» Java client: (jonjlee / riak-java-client)

36


Data Model
» Key / Value pairs are stored in a Bucket
» A Bucket ~ a namespace

Versioning
» Each update is tracked by a Vector Clock
 An algorithm for determining ordering and detecting conflicts
» When in conflict
 Last wins / manual resolution

37


Example: REST API
» Read an object

GET /riak/bucket/key

» Store a new object

POST /riak/bucket

» Store an object with existing key (update)
PUT /riak/bucket/key

38


MapReduce
» A framework supporting distributed computing on large data
sets on clusters of machines
» Leverage parallel processing power
» Introduced by Google
» Inspired by map / reduce functions in functional programming
» Map step
» Reduce step

39


MapReduce example: Inverted Index
» Map
» Parse each document
» Emit a sequence of <word, doc_id> pairs

<doc_id, doc_text> <word ,doc_id>
Node < word1 ,100>,
<100, TXT1 >, 1 < word2 ,100>,
Node
<200, TXT2
>, 2 < word2 ,200>,
TXT3 Node
<300, > 3 < word2 ,300>

40


MapReduce example: Inverted Index
» Reduce
» Accept all pairs for a given word
» Sort the corresponding document IDs
» Emit a <word, list(document ID)> pair

<word, list(document_id)>
< word1 ,(100) >,
< word2 ,(100,200)>,
< word3 ,(300) >

41

Taxonomy

BigTable and
Column Oriented Databases

42

Taxonomy

Use Case: Manage products with diverse attributes
» RDBMS:
 Create a central table with common attributes
 Create a table per product with unique attributes
 Use a join query
 Alternatively create a table that holds meta data on products
» NoSQL:
 Column oriented database
 Use arbitrarily columns

44

Taxonomy

Column Store e.g.: Cassandra
» Data model: Google’s BigTable
» Infrastructure: Amazon Dynamo
» Incremental scalability
» Flexible schema
» No single point of failure (Distributed P2P)
» Optimistic replication (Gossip protocol)
» Written in: Java
» Developed by: Facebook
» Java client: e.g. Hector / Thrift

45

Column e.g.: Cassandra

Data Model
» Column
 Smallest increment of data: tuple of name, value, timestamp

{
name: "emailAddress",
value: “nosql@alphacsp.com",
timestamp: 123456789
}

46


» SuperColumn
 A sorted, associative, unbounded
array of columns

{ // this is a SuperColumn
name: "homeAddress",
// with an unbounded array of Columns
value: {
// the keys is the name of the Column
street: {name: "street", value: "s", timestamp:...},
city: {name: "city", value: "c", timestamp:...},
zip: {name: "zip", value: "z", timestamp:...}
}
}

47


» ColumnFamily
 A container (~Table) for columns sorted by their names
 Column Families are referenced and sorted by row keys
Users = { // ColumnFamily
john: { // key to row in CF
"role" : "admin",
"status" : "offline",
"nick" : "dude1934"
}, // end row
fred: { // another row
"nick" : “freddy",
"email" :"fred@example.com",
"age" : "25",
"gender" : "male",…
},… // more rows
} Column Family
48


» Keyspace
 The outer most grouping of data (~DB Schema)
 Contains ColumnFamily’s
 There is no imposed relationship between ColumsFamily’s

49


» Example
Tweets CF

Keyspace
Timeline CF

50

Taxonomy

XML

TXT
Document Oriented Databases
BIN

51

Taxonomy

Document Store
» Store semi-structured documents (think JSON)
» Document versioning
» Map/Reduce based queries, sorting, aggregation, etc.
» DB is aware of internal structure
» E.g.
 MongoDB
 CouchDB
 JackRabbit (JCR JSR 170)

52

Taxonomy

Use Case: Blog with tagged posts and comments
» RDBMS:
 Table for each: posts, comments, tags
 Foreign relations
» NoSQL:
 Document storage
 Store post + tags + comments as a document

53

Taxonomy

Document Store e.g: MongoDB
» MongoDB (from "humongous")
» Manages collections of JSON-like documents (BSON)
» Queries can return specific fields of documents
» Supports secondary indexes
» Atomic operations on single documents

» Developed by: 10gen
» Written in: C++
» Clients: Java, Scala and more

54

Docment e.g.: MongoDB

Example: Blog posts
» Suppose you host a blog, where each post is tagged:

db.posts.save({
_id : 3,
author:"john",
title : “Apples, Oranges and NOSQL",
text : “This article will…",
tags : [ “database", “nosql" ]
});

» Notice how posts have an array of tags

55


» MongoDB supports secondary indexes and a query optimizer
 Compound indexes are also supported

db.posts.ensureIndex({ tags: 1 });
db.posts.ensureIndex({ author: 1});

db.posts.find({ author: "john", tags: "nosql" });

// Result:
{
"_id" : 3,
"author" : "john",
"title" : "Apples, Oranges and NOSQL",
"text" : "This article will…",
"tags" : ["database", "nosql", "mongodb" ]
}

56


» Let's update our posts to include some comments:

db.posts.update({ _id: 3 }, {
$inc: { comments_count: 4},
$pushAll : {
comments: [
{ text: “Comment 1" },
{ text: “Comment 2", author: "Mr. T" },
{ text: “Comment 3" },
{ text: “Comment 4" }
]
}
});

57

Taxonomy

Graph Databases

58

Taxonomy

Graph databases
» Inspired by mathematical graph theory G=(E,V)
» Models the structure of data
» Navigational data model
» Scalability / data complexity
» Data model: Key-Value pairs on Edges / Nodes
» Relationships: Edges between Nodes
» E.g.
 Neo4j
 Pregel (Google’s PageRank)
 AllegroGraph

59

Taxonomy

Use Case: Connected data - deep relationship links
between users in a social network

» RDBMS
 Complex recursive algorithm
 Multiple Self joins
 Round trips to DB / bulk read and resolve in RAM
» NoSQL:
 Graph Storage
 Network traversal

60

Taxonomy

Graph e.g.: Neo4J
» High-performance graph engine
» Embedded / disk based
» Work with OO model: nodes, relationships, properties
» ACID Transactions
 JTA support – participate in 2PC with your RDBMS
» Developed by: Neo Technologies
» Written in: Java
» Clients: Java, client libraries in other platforms

61

Graph e.g.: Neo4j

http://neo4j.org/

62

Comparing Apples to Oranges

Comparing Data Structures
» RDBMS
 Databases contains tables, columns and rows
 All rows the same structure
 Inherent ORM mismatch
» NoSQL
 Choose your data structure
 Data is stored in natural structure (e.g. Documents, Graphs,
Objects)

64


Comparing Schema Flexibility
» RDBMS
 Strict schema, difficult to evolve
 Maintains relations and forces data integrity
» NoSQL
 Structure of data can be changed dynamically
• e.g. Column stores – Cassandra
 Data can sometimes be completely opaque
• e.g Key/Value – Project Voldemort

65


Comparing Normalization & Relations
» RDBMS
 The data model is normalized to remove data duplication
 Normalization establishes table relations
» NoSQL
 Denormalization is not a dirty word
 Relations are not explicitly defined
 Related data is usually grouped and stored as one unit
• E.g. document, column

66


Comparing Data Acces
» RDBMS
 CRUD operations using SQL
 Access data from multiple tables using SQL joins
 Generic API such as JDBC
» NoSQL
 Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)
 MapReduce, graph traversals
 REST APIs, portable serialization formats
• BSON, JSON, Apache Thrift, Memcached

67


Comparing Reporting Capabilities
» RDBMS
 Slice and Dice data, then reassemble any way you like
» NoSQL
 Hard to repurpose data for ad-hoc usage
• Plan ahead
 Think in advance
• How and what you store
• Data access patterns

68

Summary

Why NOSQL / BASE
» ACID ruled exclusively in the last 40 years
 doesn’t compromise on consistency
» Database industry neglected distributed DBs w/ availability
» Vacuum was filled with “NoSQL” BASE architectures
 Strict A and P, minimize C compromise
» Relational databases are now trying to catch up

70

Summary

NoSQL Limitations
» Missing some query capabilities
 joins / composite transaction
» Eventual consistency -- not for every problem
» Not a drop in replacement for RDBMS “on ACID”
» No standardization -> product lock-in
» Relatively immature (support, bugs, community)

71

Summary

Choose the right tool for the job
» Relational databases and NoSQL databases are designed to
meet different needs
» RDBMS-only should not be a default
» NOSQL databases outperform RDBMS’s
in their particular niche
» No one size fits all / Silver bullet

...but you don’t have to choose one

72

Summary

Polyglot Persistence
» Poly: many Glot: language
» Meshing up persistence mechanisms to best meet
requirements
» Good integration stories:
 E.g. Neo4j + JDBC using JTA

73

Seminar.2010.NoSql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Seminar.2010.NoSql

Similar to Seminar.2010.NoSql (20)

Seminar.2010.NoSql