Nosql seminar

Agenda
 Introduction to NOSQL
 Objective
 Examples of NOSQL databases
 NOSQL vs SQL
 Conclusion

Basic Concepts

 Database – is a organized collection of data.
 Data base Management System (DBMS)- is a software
package with computer program that controls the
creation , maintainance & use of a database.
 for DBMS , we use structured language to interact with it
 Ex. Oracle , IBM DB2 , Ms Access , MySQL , FoxPro etc.
 Relational DBMS - A relational database is a
collection of data items organized as a set of formally
described tables from which data can be accessed easily.
A relational database is created using the relational
model. The software used in a relational database is
called a relational database management
system (RDBMS).

SQL

 Stuctured Query Language
 Special purpose programming language designed for
managing data in RDBMS.
 Origininally based upon relational algebra & tuple relation
calculas.
 SQl’s scope include data insert,upadte & delete, schema
creation and modification , data access control.
 It is static and strong used in database.
 Most used widely used database language.
 Query is the most important operation in SQL.
 Ex. SELECT *
FROM Book
WHERE price > 100.00
ORDER BY title;

NOSQL

 Stands for Not Only SQL
 Class of non-relational data storage systems
 Usually do not require a fixed table schema nor do
they use the concept of joins
 All NOSQL offerings relax one or more of the ACID
properties .
 Atomicity , Consistancy , Isolation , Durability ( ACID )
 “NOSQL” = “Not Only SQL” =
Not Only using traditional relational DBMS

NOSQL

• Alternative to traditional relational DBMS
• Flexible schema
• Quicker/cheaper to set up
• Massive scalability
• Relaxed consistency higher performance &
availability

* No declarative query language more programming
* Relaxed consistency fewer guarantees

Why NOSQL?

 Every problem cannot be solved by traditional
relational database system exclusively.
 Handles huge databases.
 Redundancy, data is pretty safe on commodity
hardware
 Super flexible queries using map/reduce
 Rapid development (no fixed schema, yeah!)
 Very fast for common use cases

Contd..

 Inspired by Distributed Data Storage problems
 Scale easily by adding servers
 Not suited to all problem types, but super-suited to
certain large problem types
 High-write situations (eg activity tracking or timeline
rendering for millions of users)
 A lot of relational uses are really dumbed down (eg
fetch by PK with update)

How does it work?

 Clients know how to:
Send items to servers (consistent hashing)
What to do when a server fails
How to fetch keys from servers
Can “weigh” to server capacities

 Servers know how to:
Store items they receive
Expire them from the cache
No inter-server comms – everything is unaware

Performance

 RDBMS uses buffer to ensure ACID properties
 NoSQL does not guarantee ACID and is therefore
much faster
 We don’t need ACID everywhere!
 Ex. Data processing (every minute) is 4x faster with
MongoDB, despite being a lot more detailed (due to
much simple development)

Why NOSQL is faster than SQL ? - Scalling

 Simple web application with not much traffic
 Application server, database server all on one machine

Scalling contd..

 More traffic comes in
 Application server

 Database server

 Even more traffic comes in
 Load balancer

 Application server x2

 Database server

Scalling contd..

 Even more traffic comes in
 Load balancer x N
 easy
 Application server x N
 easy
 Database server xN
 hard for SQL databases

SQL Slowdown

 Not linear!

Scalling contd..

 NoSQL Scalling -
 Need more storage?
 Add more servers!

 Need higher performance?

 Need better reliability?

Scalling Summary

 You can scale SQL databases (Oracle, MySQL, SQL
Server…)
 This will cost you dearly
 If you don’t have a lot of money, you will reach limits quickly
 You can scale NoSQL databases
 Very easy horizontal scaling

 Lots of open-source solutions

 Scaling is one of the basic incentives for design, so it is well
handled
 Scaling is the cause of trade-offs causing you to have to use
map/reduce

Characterstics

 Almost infinite horizontal scaling
 Very fast
 Performance doesn’t deteriorate with growth (much)
 No fixed table schemas
 No join operations
 Ad-hoc queries difficult or impossible
 Structured storage
 Almost everything happens in RAM

NOSQL Types

 Wide Column Store / Column Families
 Document Store
 Key Value / Tuple Store
 Graph Databases
 Object Databases
 XML Databases
 Multivalue Databases

Main types -

 Key-Value Stores
 Map Reduce Framework
 Document Databases
 Graph Databases

Key Value Stores

 Lineage: Amazon's Dynamo paper and Distributed
HashTables.
 Data model: A global collection of key-value pairs
 Example systems
 Google BigTable , Amazon Dynamo, Cassandra,
Voldemort , Hbase , …
 Implementation: efficiency, scalability, fault-tolerance
 Records distributed to nodes based on key
 Replication

 Single-record transactions, “eventual consistency”

Documented Databases

 Lineage: Inspired by Lotus Notes.
 Data model: Collections of documents, which
contain key-value collections (called "documents").
 Example: CouchDB, MongoDB, Riak

Graph Database

 Lineage: Draws from Euler and graph theory.
 Data model: Nodes & relationships, both which can
hold key-value pairs
 Example: AllegroGraph, InfoGrid, Neo4j

Map Reduce Framework

 Google’s framework for processing highly
distributable problems across huge datasets
using a large number of computers
 Let’s define large number of computers
 Cluster if all of them have same hardware
 Grid unless Cluster (if !Cluster for old-style programmers)
 Process split into two phases
 Map
 Take the input, partition it delegate to other machines
 Other machines can repeat the process, leading to tree structure
 Each machine returns results to the machine who gave it the task

Map Reduce Framework contd..

 Reduce
 collect results from machines you gave the tasks
 combine results and return it to requester

 Slower than sequential data processing, but massively parallel
 Sort petabyte of data in a few hours
 Input, Map, Shuffle, Reduce, Output

Popular NoSQL

 Hadoop / Hbase  MemcacheDB
 Cassandra  Voldemort
 Amazon  Hypertable
SimpleDB  Cloudata
 MongoDB  IBM
 CouchDB Lotus/Domino
 Redis

Real World Use

 Cassandra
 Facebook (original developer, used it till late 2010)
 Twitter
 Digg
 Reddit
 Rackspace
 Cisco

 BigTable
 Google (open-source version is HBase)

 MongoDB
 Foursquare
 Craigslist
 Bit.ly
 SourceForge
 GitHub

MONGODB

 Document store
 Basic support for dynamic (ad hoc) queries
 Query by example (nice!)

 Conditional Operators
 <, <=, >, >=
 $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $si
ze, $type

MONGODB

 Data is stored as BSON (binary JSON)
 Makes it very well suited for languages with native JSON support
 Map/Reduce written in Javascript
 Slow! There is one single thread of execution in Javascript
 Master/slave replication (auto failover with replica sets)
 Sharding built-in
 Uses memory mapped files for data storage
 Performance over features
 On 32bit systems, limited to ~2.5Gb
 An empty database takes up 192Mb
 GridFS to store big data + metadata (not actually an FS)

CASANDRA

 Written in: Java
 Protocol: Custom, binary (Thrift)
 Tunable trade-offs for distribution and replication
(N, R, W)
 Querying by column, range of keys
 BigTable-like features: columns, column families
 Writes are much faster than reads (!)
 Constant write time regardless of database size
 Map/reduce possible with Apache Hadoop

Some more info about Cassndra in Facebook

 Cassandra is open source DBMS from Appache
software foundation.
 Cassandra provides a structured key-value
store with tunable consistency
 Cassandra is a distributed storage system for
managing structured data that is designed to scale to
a very large size across many commodity
servers, with no single point of failure
 It is a NoSQL solution that was initially developed
by Facebook and powered their Inbox Search feature
until late 2010

HBASE

 Written in: Java
 Main point: Billions of rows X millions of columns
 Modeled after BigTable
 Map/reduce with Hadoop
 Query predicate push down via server side scan and get filters
 Optimizations for real time queries
 A high performance Thrift gateway
 HTTP supports XML, Protobuf, and binary
 Cascading, hive, and pig source and sink modules
 No single point of failure
 While Hadoop streams data efficiently, it has overhead for
starting map/reduce jobs. HBase is column oriented
key/value store and allows for low latency read and writes.
 Random access performance is like MySQL

COUCHDB

 Written in: Erlang
 Main point: DB consistency, ease of use
 Bi-directional (!) replication, continuous or ad-hoc, with conflict
detection, thus, master-master replication. (!)
 MVCC - write operations do not block reads
 Previous versions of documents are available
 Crash-only (reliable) design
 Needs compacting from time to time
 Views: embedded map/reduce
 Formatting views: lists & shows
 Server-side document validation possible
 Authentication possible
 Real-time updates via _changes (!)
 Attachment handling
 CouchApps (standalone JS apps)

HADOOP

 Apache project
 A framework that allows for the distributed processing of
large data sets across clusters of computers
 Designed to scale up from single servers to thousands of
machines
 Designed to detect and handle failures at the application
layer, instead of relying on hardware for it
 Created by Doug Cutting, who named it after his son's toy
elephant
 Hadoop subprojects
 Cassandra
 HBase
 Pig
 Hive was a Hadoop subproject, but is now a top-level Apache project

HADOOP contd..

 Scales to hundreds or thousands of computers, each with several
processor cores
 Designed to efficiently distribute large amounts of work across a
set of machines
 Hundreds of gigabytes of data constitute the low end of Hadoop-
scale
 Built to process "web-scale" data on the order of hundreds of
gigabytes to terabytes or petabytes
 Uses Java, but allows streaming so other languages can easily
send and accept data items to/from Hadoop

HADOOP contd..

 Uses distributed file system (HDFS)
 Designed to hold very large amounts of data (terabytes or even
petabytes)
 Files are stored in a redundant fashion across multiple
machines to ensure their durability to failure and high
availability to very parallel applications
 Data organized into directories and files

 Files are divided into block (64MB by default) and distributed
across nodes
 Design of HDFS is based on the design of the Google
File System

HIVE

 A petabyte-scale data warehouse system for Hadoop
 Easy data summarization, ad-hoc queries
 Query the data using a SQL-like language called
HiveQL
 Hive compiler generates map-reduce jobs for most
queries

Conclusion

 NoSQL is a great problem solver if you need it
 Choose your NoSQL platform carefully as each is
designed for specific purpose
 Get used to Map/Reduce
 It’s not a sin to use NoSQL alongside (yes)SQL
database

Referance

 http://www.facebook.com/note.php?note_id=24413
138919
 http://en.wikipedia.org/wiki/Apache_Cassandra
 http://en.wikipedia.org/wiki/SQL
 http://en.wikipedia.org/wiki/NoSQL
 www.slideshare.com

Nosql seminar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Nosql seminar

Similar to Nosql seminar (20)

Recently uploaded

Recently uploaded (20)

Nosql seminar