Seminar.2010.NoSql
Upcoming SlideShare
Loading in...5
×
 

Seminar.2010.NoSql

on

  • 4,165 views

 

Statistics

Views

Total Views
4,165
Views on SlideShare
4,157
Embed Views
8

Actions

Likes
4
Downloads
262
Comments
4

1 Embed 8

http://www.linkedin.com 8

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Seminar.2010.NoSql Seminar.2010.NoSql Presentation Transcript

  • July 11th, 2010
  • Apples, Oranges and NOSQL Roi Aldaag Architect & Consultant Nadav Wiener Architect & Consultant
  • Agenda Introduction » What is NoSQL? » What’s “wrong” with RDBMS? » Why now? 3
  • Agenda RDBMS vs. NoSQL » Scaling » CAP Theorem » ACID vs. BASE 4
  • Agenda NoSQL Taxonomy » Key / Value » Column » Document » Graph 5
  • Agenda How to choose? » Comparing Apples to Oranges » Polyglot Persistence 6
  • Introduction
  • Introduction Question: What do they all have in common? 8
  • Introduction Before we answer – some facts: 9
  • Introduction Before we answer – some facts: Daily Page Views 7.8x109 7.1x109 550x106 350x106 82x106 Daily Visitors 620x106 500x106 56x106 37x106 12x106 Data size Petabytes Petabytes Petabytes Terabytes Terabytes July, 2010: http://www.alexa.com 10
  • Introduction Answer: They use NoSQL data stores 11
  • Introduction Why!? 12
  • Introduction Relational DBs Have Scaling Limitations » ACID doesn’t scale well horizontally  Sharding breaks relations  Joins are inefficient » Transactions overhead » Schema is not flexible  Predfined  Hard to evolve 13
  • Introduction What is NoSQL? » NO SQL / Not Only SQL » A collective description of Open Source, Non-relational, data stores  Highly distributed  Highly scalable  Not ACID and... doesn’t use SQL » Term coined in a convention in 2009 called “NoSQL” (Eric Evans) » Started a movement that is gaining momentum 14
  • Introduction 15
  • Introduction Why now? » NoSQL data stores predate RDBMS (1970)  But remained a niche » RDBMS – most popular and generic option » Web 2.0 introduced new requirements:  Exponential increase in data  Information connectivity  Semi-structured data » NoSQL data stores had answers  When time was right  When RDBMSs didn’t 16
  • Introduction It’s theory time: 17
  • ali Sc ng 18
  • Scaling Scaling Up » Adding resources to a single node in a system » Add more CPUs or memory » Move system to a larger machine » Pros:  Quick and Simple » Cons:  Outgrowing the capacity of largest system available (More’s law)  Expensive  Creates vendor lock-in 19
  • Scaling Scaling Out » Add more nodes to a system » Functional Scaling (vertical)  Grouping data by function and spreading functional groups across databases » Sharding (horizontal)  Splitting same functional data across multiple databases » Pros: More flexible » Cons: More complex 20
  • Distributed Databases
  • Distributed Databases » Many nodes Node 1 Node 2 » Same database Node 3 22
  • Distributed Databases What are the requirements from distributed databases? » Consistency  All clients can see the same data » Availability  All clients can always access data » Partition tolerance  The ability to continue working when the network topology is broken  The ability to recover once the network is healed 23
  • Distributed Databases CAP Theorem (E. Brewer, N. Lynch) » You can fully satisfy at most 2 out of 3  Compromise on 3rd » Not “all or nothing”  Choose various levels of consistency, availability or partition tolerance » Recognize which of the CAP rules your business needs for the task 24
  • Distributed Databases CA: Consistency & Availability » Partition Tolerance is compromised » Single site clusters (easier to ensure all nodes are always in contact) » When a network partition occurs, the system blocks » e.g. Two Phase Commit (2PC) Partition Tolerance 25
  • Distributed Databases CP: Consistency & Partitioning » Availability is compromised » Access to some data may be temporarily limited » The rest is still consistent/accurate » e.g. Sharded database » TBD sample Partition Tolerance 26
  • Distributed Databases AP: Availability & Partitioning » Consistency is compromised » System is still available under partitioning » Some data returned may be temporarily not up-to-date » Requires conflict resolution strategy » e.g. DNS, caches, Master/Slave replication » TBD sample Partition Tolerance 27
  • ACID vs. BASE
  • ACID vs. BASE ACID – a quick recap » Atomicity  When a part of the transaction fails -> the entire transaction fails; Database state is left unchanged » Consistency  A transaction takes database from one consistent state to another » Isolation  A transaction can't see dirty state from other transactions » Durability  Commit means commit. 29
  • ACID vs. BASE BASE » The CAP compliment of ACID  Just had to be called BASE  Backronym: » Basically Available » Soft State » Eventually Consistent 30
  • ACID vs. BASE RDBMS & ACID / NoSQL & BASE » RDBMSs strive to provide ACID guarantees  ACID forces consistency » NoSQL solutions often scale through BASE  BASE accepts that conflicts will happen 31
  • Taxonomy
  • Taxonomy Key / Value Column XML Graph Document TXT BIN 33
  • Taxonomy Key / Value Databases 34
  • Taxonomy Key/Value Stores » Simple Key / Value lookups (DHT) » Value is opaque » Focus on scaling to huge amounts of data » Designed to handle massive load » E.g.  Riak Based on Amazon’s  Project Voldemort Dynamo paper  Redis 35
  • Taxonomy Key/Value e.g.: Riak » No single point of failure » No machines are special or central » MapReduce queries (Erlang / Javascript) » HTTP/JSON API » Ring cluster with automatic replication » Elastic / partition rebalancing » Written in: Erlang, C, Javascript » Developed by: Basho Technologies » Java client: (jonjlee / riak-java-client) 36
  • Key/Value e.g.: Riak Data Model » Key / Value pairs are stored in a Bucket » A Bucket ~ a namespace Versioning » Each update is tracked by a Vector Clock  An algorithm for determining ordering and detecting conflicts » When in conflict  Last wins / manual resolution 37
  • Key/Value e.g.: Riak Example: REST API » Read an object GET /riak/bucket/key » Store a new object POST /riak/bucket » Store an object with existing key (update) PUT /riak/bucket/key 38
  • Key/Value e.g.: Riak MapReduce » A framework supporting distributed computing on large data sets on clusters of machines » Leverage parallel processing power » Introduced by Google » Inspired by map / reduce functions in functional programming » Map step » Reduce step 39
  • Key/Value e.g.: Riak MapReduce example: Inverted Index » Map » Parse each document » Emit a sequence of <word, doc_id> pairs <doc_id, doc_text> <word ,doc_id> Node < word1 ,100>, <100, TXT1 >, 1 < word2 ,100>, Node <200, TXT2 >, 2 < word2 ,200>, TXT3 Node <300, > 3 < word2 ,300> 40
  • Key/Value e.g.: Riak MapReduce example: Inverted Index » Reduce » Accept all pairs for a given word » Sort the corresponding document IDs » Emit a <word, list(document ID)> pair <word, list(document_id)> < word1 ,(100) >, < word2 ,(100,200)>, < word3 ,(300) > 41
  • Taxonomy BigTable and Column Oriented Databases 42
  • Taxonomy Column Stores – BigTable derivatives » Conceptually a single, infinitely large table » Each rows can have different number of columns » Table is sparse: |rows|*|columns| > |values | » Based on Google’s BigTable paper » E.g.  Cassandra  Hbase  Hypertable 43
  • Taxonomy Use Case: Manage products with diverse attributes » RDBMS:  Create a central table with common attributes  Create a table per product with unique attributes  Use a join query  Alternatively create a table that holds meta data on products » NoSQL:  Column oriented database  Use arbitrarily columns 44
  • Taxonomy Column Store e.g.: Cassandra » Data model: Google’s BigTable » Infrastructure: Amazon Dynamo » Incremental scalability » Flexible schema » No single point of failure (Distributed P2P) » Optimistic replication (Gossip protocol) » Written in: Java » Developed by: Facebook » Java client: e.g. Hector / Thrift 45
  • Column e.g.: Cassandra Data Model » Column  Smallest increment of data: tuple of name, value, timestamp { name: "emailAddress", value: “nosql@alphacsp.com", timestamp: 123456789 } 46
  • Column e.g.: Cassandra » SuperColumn  A sorted, associative, unbounded array of columns { // this is a SuperColumn name: "homeAddress", // with an unbounded array of Columns value: { // the keys is the name of the Column street: {name: "street", value: "s", timestamp:...}, city: {name: "city", value: "c", timestamp:...}, zip: {name: "zip", value: "z", timestamp:...} } } 47
  • Column e.g.: Cassandra » ColumnFamily  A container (~Table) for columns sorted by their names  Column Families are referenced and sorted by row keys Users = { // ColumnFamily john: { // key to row in CF "role" : "admin", "status" : "offline", "nick" : "dude1934" }, // end row fred: { // another row "nick" : “freddy", "email" :"fred@example.com", "age" : "25", "gender" : "male",… },… // more rows } Column Family 48
  • Column e.g.: Cassandra » Keyspace  The outer most grouping of data (~DB Schema)  Contains ColumnFamily’s  There is no imposed relationship between ColumsFamily’s 49
  • Column e.g.: Cassandra » Example Tweets CF Keyspace Timeline CF 50
  • Taxonomy XML TXT Document Oriented Databases BIN 51
  • Taxonomy Document Store » Store semi-structured documents (think JSON) » Document versioning » Map/Reduce based queries, sorting, aggregation, etc. » DB is aware of internal structure » E.g.  MongoDB  CouchDB  JackRabbit (JCR JSR 170) 52
  • Taxonomy Use Case: Blog with tagged posts and comments » RDBMS:  Table for each: posts, comments, tags  Foreign relations » NoSQL:  Document storage  Store post + tags + comments as a document 53
  • Taxonomy Document Store e.g: MongoDB » MongoDB (from "humongous") » Manages collections of JSON-like documents (BSON) » Queries can return specific fields of documents » Supports secondary indexes » Atomic operations on single documents » Developed by: 10gen » Written in: C++ » Clients: Java, Scala and more 54
  • Docment e.g.: MongoDB Example: Blog posts » Suppose you host a blog, where each post is tagged: db.posts.save({ _id : 3, author:"john", title : “Apples, Oranges and NOSQL", text : “This article will…", tags : [ “database", “nosql" ] }); » Notice how posts have an array of tags 55
  • Docment e.g.: MongoDB » MongoDB supports secondary indexes and a query optimizer  Compound indexes are also supported db.posts.ensureIndex({ tags: 1 }); db.posts.ensureIndex({ author: 1}); db.posts.find({ author: "john", tags: "nosql" }); // Result: { "_id" : 3, "author" : "john", "title" : "Apples, Oranges and NOSQL", "text" : "This article will…", "tags" : ["database", "nosql", "mongodb" ] } 56
  • Docment e.g.: MongoDB » Let's update our posts to include some comments: db.posts.update({ _id: 3 }, { $inc: { comments_count: 4}, $pushAll : { comments: [ { text: “Comment 1" }, { text: “Comment 2", author: "Mr. T" }, { text: “Comment 3" }, { text: “Comment 4" } ] } }); 57
  • Taxonomy Graph Databases 58
  • Taxonomy Graph databases » Inspired by mathematical graph theory G=(E,V) » Models the structure of data » Navigational data model » Scalability / data complexity » Data model: Key-Value pairs on Edges / Nodes » Relationships: Edges between Nodes » E.g.  Neo4j  Pregel (Google’s PageRank)  AllegroGraph 59
  • Taxonomy Use Case: Connected data - deep relationship links between users in a social network » RDBMS  Complex recursive algorithm  Multiple Self joins  Round trips to DB / bulk read and resolve in RAM » NoSQL:  Graph Storage  Network traversal 60
  • Taxonomy Graph e.g.: Neo4J » High-performance graph engine » Embedded / disk based » Work with OO model: nodes, relationships, properties » ACID Transactions  JTA support – participate in 2PC with your RDBMS » Developed by: Neo Technologies » Written in: Java » Clients: Java, client libraries in other platforms 61
  • Graph e.g.: Neo4j http://neo4j.org/ 62
  • Comparing Apples to Oranges
  • Comparing Apples to Oranges Comparing Data Structures » RDBMS  Databases contains tables, columns and rows  All rows the same structure  Inherent ORM mismatch » NoSQL  Choose your data structure  Data is stored in natural structure (e.g. Documents, Graphs, Objects) 64
  • Comparing Apples to Oranges Comparing Schema Flexibility » RDBMS  Strict schema, difficult to evolve  Maintains relations and forces data integrity » NoSQL  Structure of data can be changed dynamically • e.g. Column stores – Cassandra  Data can sometimes be completely opaque • e.g Key/Value – Project Voldemort 65
  • Comparing Apples to Oranges Comparing Normalization & Relations » RDBMS  The data model is normalized to remove data duplication  Normalization establishes table relations » NoSQL  Denormalization is not a dirty word  Relations are not explicitly defined  Related data is usually grouped and stored as one unit • E.g. document, column 66
  • Comparing Apples to Oranges Comparing Data Acces » RDBMS  CRUD operations using SQL  Access data from multiple tables using SQL joins  Generic API such as JDBC » NoSQL  Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)  MapReduce, graph traversals  REST APIs, portable serialization formats • BSON, JSON, Apache Thrift, Memcached 67
  • Comparing Apples to Oranges Comparing Reporting Capabilities » RDBMS  Slice and Dice data, then reassemble any way you like » NoSQL  Hard to repurpose data for ad-hoc usage • Plan ahead  Think in advance • How and what you store • Data access patterns 68
  • Summary
  • Summary Why NOSQL / BASE » ACID ruled exclusively in the last 40 years  doesn’t compromise on consistency » Database industry neglected distributed DBs w/ availability » Vacuum was filled with “NoSQL” BASE architectures  Strict A and P, minimize C compromise » Relational databases are now trying to catch up 70
  • Summary NoSQL Limitations » Missing some query capabilities  joins / composite transaction » Eventual consistency -- not for every problem » Not a drop in replacement for RDBMS “on ACID” » No standardization -> product lock-in » Relatively immature (support, bugs, community) 71
  • Summary Choose the right tool for the job » Relational databases and NoSQL databases are designed to meet different needs » RDBMS-only should not be a default » NOSQL databases outperform RDBMS’s in their particular niche » No one size fits all / Silver bullet ...but you don’t have to choose one 72
  • Summary Polyglot Persistence » Poly: many Glot: language » Meshing up persistence mechanisms to best meet requirements » Good integration stories:  E.g. Neo4j + JDBC using JTA 73