NOSQL
Agenda
 Introduction to NOSQL
 Objective
 Examples of NOSQL databases
 NOSQL vs SQL
 Conclusion
Basic Concepts

 Database – is a organized collection of data.
 Data base Management System (DBMS)- is a software
  package with computer program that controls the
  creation , maintainance & use of a database.
     for DBMS , we use structured language to interact with it
     Ex. Oracle , IBM DB2 , Ms Access , MySQL , FoxPro etc.
 Relational DBMS - A relational database is a
  collection of data items organized as a set of formally
  described tables from which data can be accessed easily.
  A relational database is created using the relational
  model. The software used in a relational database is
  called a relational database management
  system (RDBMS).
SQL

 Stuctured Query Language
 Special purpose programming language designed for
    managing data in RDBMS.
   Origininally based upon relational algebra & tuple relation
    calculas.
   SQl’s scope include data insert,upadte & delete, schema
    creation and modification , data access control.
   It is static and strong used in database.
   Most used widely used database language.
   Query is the most important operation in SQL.
   Ex. SELECT *
         FROM Book
         WHERE price > 100.00
         ORDER BY title;
NOSQL

 Stands for Not Only SQL
 Class of non-relational data storage systems
 Usually do not require a fixed table schema nor do
  they use the concept of joins
 All NOSQL offerings relax one or more of the ACID
  properties .
    Atomicity , Consistancy , Isolation , Durability ( ACID )
 “NOSQL” = “Not Only SQL” =
       Not Only using traditional relational DBMS
NOSQL

•   Alternative to traditional relational DBMS
    •   Flexible schema
    •   Quicker/cheaper to set up
    •   Massive scalability
    •   Relaxed consistency higher performance &
        availability

    * No declarative query language more programming
    * Relaxed consistency fewer guarantees
Why NOSQL?


 Every problem cannot be solved by traditional
    relational database system exclusively.
   Handles huge databases.
   Redundancy, data is pretty safe on commodity
    hardware
   Super flexible queries using map/reduce
   Rapid development (no fixed schema, yeah!)
   Very fast for common use cases
Contd..


 Inspired by Distributed Data Storage problems
 Scale easily by adding servers
 Not suited to all problem types, but super-suited to
  certain large problem types
 High-write situations (eg activity tracking or timeline
  rendering for millions of users)
 A lot of relational uses are really dumbed down (eg
  fetch by PK with update)
Architecture
How does it work?

 Clients know how to:
  Send items to servers (consistent hashing)
  What to do when a server fails
  How to fetch keys from servers
  Can “weigh” to server capacities

 Servers know how to:
  Store items they receive
  Expire them from the cache
  No inter-server comms – everything is unaware
Performance

 RDBMS uses buffer to ensure ACID properties
 NoSQL does not guarantee ACID and is therefore
  much faster
 We don’t need ACID everywhere!
 Ex. Data processing (every minute) is 4x faster with
  MongoDB, despite being a lot more detailed (due to
  much simple development)
Why NOSQL is faster than SQL ? - Scalling

 Simple web application with not much traffic
   Application server, database server all on one machine
Scalling contd..

 More traffic comes in
   Application server

   Database server




 Even more traffic comes in
   Load balancer

   Application server x2

   Database server
Scalling contd..


 Even more traffic comes in
     Load balancer x N
       easy
     Application server x N
       easy
     Database server xN
       hard for SQL databases
SQL Slowdown




 Not linear!
Scalling contd..


 NoSQL Scalling -
 Need more storage?
   Add more servers!

 Need higher performance?
   Add more servers!

 Need better reliability?
   Add more servers!
Scalling Summary

 You can scale SQL databases (Oracle, MySQL, SQL
  Server…)
     This will cost you dearly
     If you don’t have a lot of money, you will reach limits quickly
 You can scale NoSQL databases
   Very easy horizontal scaling

   Lots of open-source solutions

   Scaling is one of the basic incentives for design, so it is well
    handled
   Scaling is the cause of trade-offs causing you to have to use
    map/reduce
Characterstics

 Almost infinite horizontal scaling
 Very fast
 Performance doesn’t deteriorate with growth (much)
 No fixed table schemas
 No join operations
 Ad-hoc queries difficult or impossible
 Structured storage
 Almost everything happens in RAM
NOSQL Types


 Wide Column Store / Column Families
 Document Store
 Key Value / Tuple Store
 Graph Databases
 Object Databases
 XML Databases
 Multivalue Databases
Main types -

 Key-Value Stores
 Map Reduce Framework
 Document Databases
 Graph Databases
Key Value Stores

 Lineage: Amazon's Dynamo paper and Distributed
  HashTables.
 Data model: A global collection of key-value pairs
 Example systems
   Google BigTable , Amazon Dynamo, Cassandra,
     Voldemort , Hbase , …
 Implementation: efficiency, scalability, fault-tolerance
   Records distributed to nodes based on key
   Replication

   Single-record transactions, “eventual consistency”
Documented Databases

 Lineage: Inspired by Lotus Notes.
 Data model: Collections of documents, which
  contain key-value collections (called "documents").
 Example: CouchDB, MongoDB, Riak
Graph Database

 Lineage: Draws from Euler and graph theory.
 Data model: Nodes & relationships, both which can
  hold key-value pairs
 Example: AllegroGraph, InfoGrid, Neo4j
Map Reduce Framework

 Google’s framework for processing highly
  distributable problems across huge datasets
  using a large number of computers
 Let’s define large number of computers
    Cluster if all of them have same hardware
    Grid unless Cluster (if !Cluster for old-style programmers)
 Process split into two phases
   Map
      Take the input, partition it delegate to other machines
      Other machines can repeat the process, leading to tree structure
      Each machine returns results to the machine who gave it the task
Map Reduce Framework contd..

   Reduce
     collect results from machines you gave the tasks
     combine results and return it to requester

   Slower than sequential data processing, but massively parallel
   Sort petabyte of data in a few hours
   Input, Map, Shuffle, Reduce, Output
Popular NoSQL


 Hadoop / Hbase       MemcacheDB
 Cassandra            Voldemort
 Amazon               Hypertable
  SimpleDB             Cloudata
 MongoDB              IBM
 CouchDB              Lotus/Domino
 Redis
Real World Use

 Cassandra
   Facebook (original developer, used it till late 2010)
   Twitter
   Digg
   Reddit
   Rackspace
   Cisco

 BigTable
   Google (open-source version is HBase)

 MongoDB
   Foursquare
   Craigslist
   Bit.ly
   SourceForge
   GitHub
MONGODB

  Document store
  Basic support for dynamic (ad hoc) queries
  Query by example (nice!)




 Conditional Operators
    <, <=, >, >=
    $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $si
     ze, $type
MONGODB

 Data is stored as BSON (binary JSON)
     Makes it very well suited for languages with native JSON support
 Map/Reduce written in Javascript
     Slow! There is one single thread of execution in Javascript
 Master/slave replication (auto failover with replica sets)
 Sharding built-in
 Uses memory mapped files for data storage
 Performance over features
 On 32bit systems, limited to ~2.5Gb
 An empty database takes up 192Mb
 GridFS to store big data + metadata (not actually an FS)
CASANDRA

 Written in: Java
 Protocol: Custom, binary (Thrift)
 Tunable trade-offs for distribution and replication
  (N, R, W)
 Querying by column, range of keys
 BigTable-like features: columns, column families
 Writes are much faster than reads (!)
    Constant write time regardless of database size
 Map/reduce possible with Apache Hadoop
Some more info about Cassndra in Facebook

 Cassandra is open source DBMS from Appache
  software foundation.
 Cassandra provides a structured key-value
  store with tunable consistency
 Cassandra is a distributed storage system for
  managing structured data that is designed to scale to
  a very large size across many commodity
  servers, with no single point of failure
 It is a NoSQL solution that was initially developed
  by Facebook and powered their Inbox Search feature
  until late 2010
HBASE

 Written in: Java
 Main point: Billions of rows X millions of columns
 Modeled after BigTable
 Map/reduce with Hadoop
 Query predicate push down via server side scan and get filters
 Optimizations for real time queries
 A high performance Thrift gateway
 HTTP supports XML, Protobuf, and binary
 Cascading, hive, and pig source and sink modules
 No single point of failure
 While Hadoop streams data efficiently, it has overhead for
  starting map/reduce jobs. HBase is column oriented
  key/value store and allows for low latency read and writes.
 Random access performance is like MySQL
COUCHDB

 Written in: Erlang
 Main point: DB consistency, ease of use
 Bi-directional (!) replication, continuous or ad-hoc, with conflict
    detection, thus, master-master replication. (!)
   MVCC - write operations do not block reads
   Previous versions of documents are available
   Crash-only (reliable) design
   Needs compacting from time to time
   Views: embedded map/reduce
   Formatting views: lists & shows
   Server-side document validation possible
   Authentication possible
   Real-time updates via _changes (!)
   Attachment handling
   CouchApps (standalone JS apps)
HADOOP

 Apache project
 A framework that allows for the distributed processing of
    large data sets across clusters of computers
   Designed to scale up from single servers to thousands of
    machines
   Designed to detect and handle failures at the application
    layer, instead of relying on hardware for it
   Created by Doug Cutting, who named it after his son's toy
    elephant
   Hadoop subprojects
       Cassandra
       HBase
       Pig
   Hive was a Hadoop subproject, but is now a top-level Apache project
HADOOP contd..

 Scales to hundreds or thousands of computers, each with several
    processor cores
   Designed to efficiently distribute large amounts of work across a
    set of machines
   Hundreds of gigabytes of data constitute the low end of Hadoop-
    scale
   Built to process "web-scale" data on the order of hundreds of
    gigabytes to terabytes or petabytes
   Uses Java, but allows streaming so other languages can easily
    send and accept data items to/from Hadoop
HADOOP contd..

 Uses distributed file system (HDFS)
   Designed to hold very large amounts of data (terabytes or even
    petabytes)
   Files are stored in a redundant fashion across multiple
    machines to ensure their durability to failure and high
    availability to very parallel applications
   Data organized into directories and files

   Files are divided into block (64MB by default) and distributed
    across nodes
 Design of HDFS is based on the design of the Google
  File System
HIVE

 A petabyte-scale data warehouse system for Hadoop
 Easy data summarization, ad-hoc queries
 Query the data using a SQL-like language called
  HiveQL
 Hive compiler generates map-reduce jobs for most
  queries
Conclusion

 NoSQL is a great problem solver if you need it
 Choose your NoSQL platform carefully as each is
  designed for specific purpose
 Get used to Map/Reduce
 It’s not a sin to use NoSQL alongside (yes)SQL
  database
Referance

 http://www.facebook.com/note.php?note_id=24413
    138919
   http://en.wikipedia.org/wiki/Apache_Cassandra
   http://en.wikipedia.org/wiki/SQL
   http://en.wikipedia.org/wiki/NoSQL
   www.slideshare.com
THANK
YOU..!!

Nosql seminar

  • 1.
  • 2.
    Agenda  Introduction toNOSQL  Objective  Examples of NOSQL databases  NOSQL vs SQL  Conclusion
  • 3.
    Basic Concepts  Database– is a organized collection of data.  Data base Management System (DBMS)- is a software package with computer program that controls the creation , maintainance & use of a database.  for DBMS , we use structured language to interact with it  Ex. Oracle , IBM DB2 , Ms Access , MySQL , FoxPro etc.  Relational DBMS - A relational database is a collection of data items organized as a set of formally described tables from which data can be accessed easily. A relational database is created using the relational model. The software used in a relational database is called a relational database management system (RDBMS).
  • 4.
    SQL  Stuctured QueryLanguage  Special purpose programming language designed for managing data in RDBMS.  Origininally based upon relational algebra & tuple relation calculas.  SQl’s scope include data insert,upadte & delete, schema creation and modification , data access control.  It is static and strong used in database.  Most used widely used database language.  Query is the most important operation in SQL.  Ex. SELECT * FROM Book WHERE price > 100.00 ORDER BY title;
  • 5.
    NOSQL  Stands forNot Only SQL  Class of non-relational data storage systems  Usually do not require a fixed table schema nor do they use the concept of joins  All NOSQL offerings relax one or more of the ACID properties .  Atomicity , Consistancy , Isolation , Durability ( ACID )  “NOSQL” = “Not Only SQL” = Not Only using traditional relational DBMS
  • 6.
    NOSQL • Alternative to traditional relational DBMS • Flexible schema • Quicker/cheaper to set up • Massive scalability • Relaxed consistency higher performance & availability * No declarative query language more programming * Relaxed consistency fewer guarantees
  • 7.
    Why NOSQL?  Everyproblem cannot be solved by traditional relational database system exclusively.  Handles huge databases.  Redundancy, data is pretty safe on commodity hardware  Super flexible queries using map/reduce  Rapid development (no fixed schema, yeah!)  Very fast for common use cases
  • 8.
    Contd..  Inspired byDistributed Data Storage problems  Scale easily by adding servers  Not suited to all problem types, but super-suited to certain large problem types  High-write situations (eg activity tracking or timeline rendering for millions of users)  A lot of relational uses are really dumbed down (eg fetch by PK with update)
  • 9.
  • 10.
    How does itwork?  Clients know how to: Send items to servers (consistent hashing) What to do when a server fails How to fetch keys from servers Can “weigh” to server capacities  Servers know how to: Store items they receive Expire them from the cache No inter-server comms – everything is unaware
  • 11.
    Performance  RDBMS usesbuffer to ensure ACID properties  NoSQL does not guarantee ACID and is therefore much faster  We don’t need ACID everywhere!  Ex. Data processing (every minute) is 4x faster with MongoDB, despite being a lot more detailed (due to much simple development)
  • 12.
    Why NOSQL isfaster than SQL ? - Scalling  Simple web application with not much traffic  Application server, database server all on one machine
  • 13.
    Scalling contd..  Moretraffic comes in  Application server  Database server  Even more traffic comes in  Load balancer  Application server x2  Database server
  • 14.
    Scalling contd..  Evenmore traffic comes in  Load balancer x N  easy  Application server x N  easy  Database server xN  hard for SQL databases
  • 15.
  • 16.
    Scalling contd..  NoSQLScalling -  Need more storage?  Add more servers!  Need higher performance?  Add more servers!  Need better reliability?  Add more servers!
  • 17.
    Scalling Summary  Youcan scale SQL databases (Oracle, MySQL, SQL Server…)  This will cost you dearly  If you don’t have a lot of money, you will reach limits quickly  You can scale NoSQL databases  Very easy horizontal scaling  Lots of open-source solutions  Scaling is one of the basic incentives for design, so it is well handled  Scaling is the cause of trade-offs causing you to have to use map/reduce
  • 18.
    Characterstics  Almost infinitehorizontal scaling  Very fast  Performance doesn’t deteriorate with growth (much)  No fixed table schemas  No join operations  Ad-hoc queries difficult or impossible  Structured storage  Almost everything happens in RAM
  • 19.
    NOSQL Types  WideColumn Store / Column Families  Document Store  Key Value / Tuple Store  Graph Databases  Object Databases  XML Databases  Multivalue Databases
  • 20.
    Main types - Key-Value Stores  Map Reduce Framework  Document Databases  Graph Databases
  • 21.
    Key Value Stores Lineage: Amazon's Dynamo paper and Distributed HashTables.  Data model: A global collection of key-value pairs  Example systems  Google BigTable , Amazon Dynamo, Cassandra, Voldemort , Hbase , …  Implementation: efficiency, scalability, fault-tolerance  Records distributed to nodes based on key  Replication  Single-record transactions, “eventual consistency”
  • 22.
    Documented Databases  Lineage:Inspired by Lotus Notes.  Data model: Collections of documents, which contain key-value collections (called "documents").  Example: CouchDB, MongoDB, Riak
  • 23.
    Graph Database  Lineage:Draws from Euler and graph theory.  Data model: Nodes & relationships, both which can hold key-value pairs  Example: AllegroGraph, InfoGrid, Neo4j
  • 24.
    Map Reduce Framework Google’s framework for processing highly distributable problems across huge datasets using a large number of computers  Let’s define large number of computers  Cluster if all of them have same hardware  Grid unless Cluster (if !Cluster for old-style programmers)  Process split into two phases  Map  Take the input, partition it delegate to other machines  Other machines can repeat the process, leading to tree structure  Each machine returns results to the machine who gave it the task
  • 25.
    Map Reduce Frameworkcontd..  Reduce  collect results from machines you gave the tasks  combine results and return it to requester  Slower than sequential data processing, but massively parallel  Sort petabyte of data in a few hours  Input, Map, Shuffle, Reduce, Output
  • 26.
    Popular NoSQL  Hadoop/ Hbase  MemcacheDB  Cassandra  Voldemort  Amazon  Hypertable SimpleDB  Cloudata  MongoDB  IBM  CouchDB Lotus/Domino  Redis
  • 27.
    Real World Use Cassandra  Facebook (original developer, used it till late 2010)  Twitter  Digg  Reddit  Rackspace  Cisco  BigTable  Google (open-source version is HBase)  MongoDB  Foursquare  Craigslist  Bit.ly  SourceForge  GitHub
  • 28.
    MONGODB  Documentstore  Basic support for dynamic (ad hoc) queries  Query by example (nice!)  Conditional Operators  <, <=, >, >=  $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $si ze, $type
  • 29.
    MONGODB  Data isstored as BSON (binary JSON)  Makes it very well suited for languages with native JSON support  Map/Reduce written in Javascript  Slow! There is one single thread of execution in Javascript  Master/slave replication (auto failover with replica sets)  Sharding built-in  Uses memory mapped files for data storage  Performance over features  On 32bit systems, limited to ~2.5Gb  An empty database takes up 192Mb  GridFS to store big data + metadata (not actually an FS)
  • 30.
    CASANDRA  Written in:Java  Protocol: Custom, binary (Thrift)  Tunable trade-offs for distribution and replication (N, R, W)  Querying by column, range of keys  BigTable-like features: columns, column families  Writes are much faster than reads (!)  Constant write time regardless of database size  Map/reduce possible with Apache Hadoop
  • 31.
    Some more infoabout Cassndra in Facebook  Cassandra is open source DBMS from Appache software foundation.  Cassandra provides a structured key-value store with tunable consistency  Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure  It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010
  • 32.
    HBASE  Written in:Java  Main point: Billions of rows X millions of columns  Modeled after BigTable  Map/reduce with Hadoop  Query predicate push down via server side scan and get filters  Optimizations for real time queries  A high performance Thrift gateway  HTTP supports XML, Protobuf, and binary  Cascading, hive, and pig source and sink modules  No single point of failure  While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store and allows for low latency read and writes.  Random access performance is like MySQL
  • 33.
    COUCHDB  Written in:Erlang  Main point: DB consistency, ease of use  Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!)  MVCC - write operations do not block reads  Previous versions of documents are available  Crash-only (reliable) design  Needs compacting from time to time  Views: embedded map/reduce  Formatting views: lists & shows  Server-side document validation possible  Authentication possible  Real-time updates via _changes (!)  Attachment handling  CouchApps (standalone JS apps)
  • 34.
    HADOOP  Apache project A framework that allows for the distributed processing of large data sets across clusters of computers  Designed to scale up from single servers to thousands of machines  Designed to detect and handle failures at the application layer, instead of relying on hardware for it  Created by Doug Cutting, who named it after his son's toy elephant  Hadoop subprojects  Cassandra  HBase  Pig  Hive was a Hadoop subproject, but is now a top-level Apache project
  • 35.
    HADOOP contd..  Scalesto hundreds or thousands of computers, each with several processor cores  Designed to efficiently distribute large amounts of work across a set of machines  Hundreds of gigabytes of data constitute the low end of Hadoop- scale  Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes  Uses Java, but allows streaming so other languages can easily send and accept data items to/from Hadoop
  • 36.
    HADOOP contd..  Usesdistributed file system (HDFS)  Designed to hold very large amounts of data (terabytes or even petabytes)  Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications  Data organized into directories and files  Files are divided into block (64MB by default) and distributed across nodes  Design of HDFS is based on the design of the Google File System
  • 37.
    HIVE  A petabyte-scaledata warehouse system for Hadoop  Easy data summarization, ad-hoc queries  Query the data using a SQL-like language called HiveQL  Hive compiler generates map-reduce jobs for most queries
  • 38.
    Conclusion  NoSQL isa great problem solver if you need it  Choose your NoSQL platform carefully as each is designed for specific purpose  Get used to Map/Reduce  It’s not a sin to use NoSQL alongside (yes)SQL database
  • 39.
    Referance  http://www.facebook.com/note.php?note_id=24413 138919  http://en.wikipedia.org/wiki/Apache_Cassandra  http://en.wikipedia.org/wiki/SQL  http://en.wikipedia.org/wiki/NoSQL  www.slideshare.com
  • 40.