3. Agenda
• Before SQL and After SQL
• NoSQL universe
• Trend of NoSQL
• Characteristic of BigData
3V
• Where to use NoSQL
• What NoSQL must deliver
• Classification of NoSQL
databases
• Size Vs Complexity
• Visual Guide of CAP
Theorem
• Overview of key/Value
Store
• Overview of Document
Store
• Overview of Column
Family Store
• Overview of Graph Store
• Use Case of Twitter
3PolicyBazaar.com
4. Three Eras of Databases
4
Note: The era of using RDBMSes for all problems is over. Instead
we should use the database most suited for the problem at hand.
PolicyBazaar.com
6. Big Data Definition
• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks
6PolicyBazaar.com
12. RDBMS vs. NoSQL
12
Source: http://www.google.com/trends/explore#q=nosql%2C%20rdbms&date=1%2F2009%2051m&cmpt=q
PolicyBazaar.com
13. NoSQL or SQL?
• Wrong question
• What is your problem?
– Transactions
– Amount of data
– Data structure
– Scale-out Vs Scale-up
– OLTP Or OLAP
13PolicyBazaar.com
14. What is your problem…
• Key Evaluation Requirements
– Transactional, Durability & Consistency
– Response time
– Functionality
– Data characteristics
– Scalability, Clustering
– Failover
– Maintenance, Online changes, Node Management
– Maturity
– Community, Support
– Hosted or Managed
– Cost, open source
14PolicyBazaar.com
16. Character of Big Data: 3V
• Volume: Large volumes of data
– Today, Facebook ingests 500 terabytes of new data every day; a Boeing 737 will
generate 240 terabytes of flight data during a single flight across the US
• Velocity: rate of moving data
– E.g. Clickstreams and ad impressions capture user behavior at millions of events per
second;
• Variety: structured, semi structure, unstructured,
images, etc.
– Big Data data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D
data, audio and video, and unstructured text, including log files and social media
Source: http://www-01.ibm.com/software/data/bigdata/
16PolicyBazaar.com
17. Many Uses of Data
• Transactions (OLTP)
• Analysis (OLAP)
• Search and Findability
• Enterprise Agility
• Speed and Reliability
• Consistency and Availability
• Or anything else…
17PolicyBazaar.com
18. Where to use NoSQL?
• Social data
• Data processing (Hadoop)
• Search (Lucene)
• Caching (Memcache, ...)
• Data Warehousing
• Logging
• ...
18PolicyBazaar.com
19. What NoSQL must deliver
• Massive scalability
– No application-level sharding
• Performance
• High Availability/Fault Tolerance
• Ease of use
– Simple operations/administration
– No application-level sharding
– Simple APIs
– Quickly evolve application & schema
19PolicyBazaar.com
20. Classification of NoSQL Databases
• Key-Value
– Very popular for simple key-value lookup: disk/memory. e.g
Dynamo, Redis,, Voldemort, MemcachedDB, Berkeley, HazelCast etc
• Document
– Popular for document type storage. e.g. MongoDB, OrientDB, CouchDB,
Riak etc.
• Column Family
– Key value with fixed column families, allows dynamic columns
within column family. E.g. Cassandra, BigTable, HBase, Hypertable etc
• Graph
– Connected graph with entity Relationship. e.g.Titan, Neo4j,
infiniteGraph
20PolicyBazaar.com
22. NOSQL: Size Vs Complexity
22
Sources: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-
and-scaling-to-complexity.html
PolicyBazaar.com
23. Visual Guide to NoSQL
23Sources: http://blog.nahurst.com/visual-guide-to-nosql-systemsPolicyBazaar.com
24. Key-Value Store
• Focus on scaling to huge amounts of data
• Designed to handle massive load
• Based on Amazon’s Dynamo paper
• Data model: (global) collection of Key-Value
pairs
• Dynamo ring partitioning and replication
24PolicyBazaar.com
25. Types of Key-Value Stores
• Eventually-consistent key-value store
• Hierarchical key-value stores
• Key-Value stores in RAM
• Key-Value stores on disk
• High availability key-value store
• Ordered key-value stores
• Values that allow simple list operations
25PolicyBazaar.com
26. Key / value stores (Opaque)
• Keys are mapped to values
• Values are treated as BLOBs (opaque data)
• No type information is stored
• Values can be heterogeneous
• Example values:
{ name: “ranjeet“, age: 35, city: “DL“ } => JSON, but store will not care about it
xdexadxb0x0b => binary, but store will not care about it
26
Key Value
PolicyBazaar.com
27. • Open source in-memory key-value store with
optional durability
• Focus on high speed reads and writes of
common data structures to RAM
• Allows simple lists, sets and hashes to be
stored within the value and manipulated
• Many features that developers like
– expiration, transactions, pub/sub, partitioning
27PolicyBazaar.com
28. BigTable clones
• Like column oriented Relational Databases,
but with a twist
• Tables similarly to RDBMS, but handles semi-
structured
• Based on Google’s BigTable paper
28PolicyBazaar.com
29. Document Store
• Data stored in nested hierarchies
• Logical data remains stored together as a unit
• Any item in the document can be queried
• Similar to Key-Value stores, but the DB knows
what the Value is
• Inspired by Lotus Notes
• Documents are often versioned
29PolicyBazaar.com
30. Document Store …
• Data model: Collections of Key-Value
collections
• Pros: No object-relational mapping layer, ideal
for search, Schema less
• Cons: Complex to implement, incompatible
with SQL
• Examples: MongoDB, Couchbase, CouchDB
30PolicyBazaar.com
31. MongoDB (DocumentDB)
• Open Source JSON data store created
by 10gen
• Master-slave scale out model
• Strong developer community
• Sharding built-in, automatic
• Implemented in C++ with many APIs
(C++, JavaScript, Java, .net, Perl, Python etc.)
31PolicyBazaar.com
32. Column-Family
• Key includes a row, column family and column
name
• Store versioned blobs in one large table
• Queries can be done on rows, column families
and column names
• Pros: Great scale out, Performant, versioning
• Cons: Cannot query blob content, row and
column designs are critical
• Examples: Cassandra, Bigtable, HBase, Hypertable, Apache
Accumulo
32PolicyBazaar.com
34. Cassandra
• Apache open source column family database
supported by DataStax
• Peer-to-peer distribution model
• Strong reputation for linear scale out (millions
of writes/second)
• Database side security
• Written in Java and works well with HDFS and
MapReduce
34PolicyBazaar.com
35. Cassandra: Feature Headlines
• Elastic
– Read and write throughput increases linearly as
new machines are
• Decentralized
– Fault tolerant with no single point of failure; no
“master” node
• Rich data model
– Column based, range slices, column slices,
secondary indexes, counters, expiring columns
35
Source: http://cassandra.apache.org/
PolicyBazaar.com
36. • Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is designed to scale up from single servers to thousands
of machines, each providing computation and storage.
• Hadoop is an open-source implementation of Google
MapReduce, GFS(distributed file system).
• Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library.
• Hadoop fulfill need of common infrastructure
– Efficient, reliable, easy to use
– Open Source, Apache License Hadoop origins
36PolicyBazaar.com
37. HBase /Hadoop
• Open source implementation of MapReduce
algorithm written in Java
• Initially created by Yahoo
• Column-oriented data store
• Java interface
• HBase designed specifically to work with Hadoop
• High-level query language (Pig)
• Strong support by many vendors
37PolicyBazaar.com
38. Graph Store
• Focus on modeling the structure of data -
interconnectivity
• Scales to the complexity of the data
• Inspired by mathematical Graph Theory ( G=(E,V)
) Data is stored in a series of nodes, relationships
and properties
• Queries are really graph traversals
• Data is stored in a series of nodes, relationships
and properties
• Ideal when relationships between data is key:
– e.g. social networks
38PolicyBazaar.com
39. Graph Store (cont..)
• Ideal when relationships between data is key:
– e.g. social networks
• Data model: “Property Graph” ‣Nodes
‣Relationships/Edges between Nodes ‣Key-Value
pairs on both ‣Possibly Edge Labels and/or Node/
Edge Types
• Pros: fast network search, works with public
linked data sets
• Cons: specialized query languages (RDF uses
SPARQL) , gramlin, cypher)
• Examples: Neo4j, Titan, AllegroGraph, InfiniteGraph..
39PolicyBazaar.com
40. Graph Stores (cont..)
• Used when the relationship and relationships
types between items are critical
• Used for
– Social networking queries: "friends of my friends"
– Inference and rules engines
– Pattern recognition
– Used for working with open-linked data
• Automate "joins" of public data
40PolicyBazaar.com
41. Property Graph model
• Nodes i.e. Vertex
• Relationships between Nodes i.e Edge
• Relationships have Labels
• Relationships are directed, but traversed at equal
speed in both directions
• The semantics of the direction is up to the
application (LIVES WITH is reflexive, LOVES is not)
• Nodes have key-value properties
• Relationships have key-value properties
41PolicyBazaar.com
42. Neo4J
• Graph database designed to be easy to
use by Java developers
• Dual license (community edition is
GPL)
• Works as an embedded java library in
your application
• Disk-based (not just RAM)
• Full ACID
42PolicyBazaar.com
43. Decides what you need
• SQL
– Relational, transactional processing
• NoSQL
– Non relational, distributed, high performance and
highly scalable
• Analytics, Warehouse, BigData
– Data Warehousing, Analytics, Data science, and
reporting
• Combination of all 3
– Begin with SQL, NoSQL and eventually need BigData/
Analytics platform
43PolicyBazaar.com
44. Finally… in One liner…
• SQL
– Works great , can’t easily scale.
• NoSQL
• Works great , can’t fit for all
• Analytics, BigData
– Every Business need it.
44PolicyBazaar.com
45. Use Case: Twitter
• Twitter challenges
– Needs to store many graphs
• Who you are following
• Who’s following you
• Who you receive phone notifications from etc
– To deliver a tweet requires rapid paging of followers
– Heavy write load as followers are added and removed
– Set arithmetic for @mentions (intersection of users).
45PolicyBazaar.com
46. Use Case: Twitter …
• What did they try?
• Started with Relational Databases
• Tried Key-Value storage of denormalized lists
• Did it work?
– Nope
– Either good at Handling the write load or paging
large amounts of data But not both
46PolicyBazaar.com
47. Open source implementations to play
with!
• MongoDB - http://www.mongodb.org/
• Cassandra - http://cassandra.apache.org/
• Neo4j - http://neo4j.org/
• Hadoop + Hbase - http://hadoop.apache.org/
• Redis - http://code.google.com/p/redis/
• Oracle Berkley DB - http://www.oracle.com/
database/berkeley-db/
• … and Many more…
47PolicyBazaar.com
48. Thank You
For any Query or feedback write to me
ranjeet@policyBazaar.com
ranjeet.kr@gmail.com
PolicyBazaar.com 48