Your SlideShare is downloading. ×
Big Data and the growing relevance of NoSQL
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Big Data and the growing relevance of NoSQL


Published on

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Big Data trends and the rising importance of NOSQL
    Abhijit Sharma, Architect,
    Innovation & Incubation Lab, BMC Software
  • 2. Trends in cloud, web, and even enterprise scale apps
    Unprecedented growth in -
    Data set sizes which need to be stored, analyzed
    Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, Foursquare
    Connectedness and democratization of data
    social networks, feeds, blogs, wiki, tags, semantic web
    Data API’s - mash up data - use Twitter, FB, Flickr API’s
    Semi structured or unstructured data
    Performance requirements of these apps
    Humongous R/W Scalability
    High Availability
    Trading consistency for availability – ACID not mandatory
  • 3. RDBMS woes
    Challenge - Storing and scaling humongous amounts of data and remaining highly available
    Vertical scaling mostly - upper limit & expensive
    Horizontal scaling – no automatic sharding, no rebalancing – no infrastructure
    Distributed transactions & joins due to normalization inhibit performance, availability
    Schema less data models – rigid schema – alter table, null columns
    Deeply connected data – not designed for this
  • 4. NOSQL is
    No SQL
    The NOSQL Alternative
  • 5. NOSQL is
    Not only SQL
    The NOSQL Alternative
  • 6. NOSQL – So what else is it?
    “One size fits all” RDBMS is not working
    NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends.
    They can be categorized along these axes -
    Data Model - simple to complex
    Scalability – single to horizontal
  • 7. NOSQL categories
    Graph Databases
    Based on Graph theory
    Data model – graph, nodes, edges, properties
    Scalability – single node – high performance
    Persistence – On disk data structures
    Examples – Neo4J, AllegroGraph
    Document Databases
    Based loosely on documents/Lotus Notes
    Data model – collections of documents
    Scalability – horizontal, auto-sharding & replication
    Persistence – B-Tree
    Examples – mongoDB, CouchDB
  • 8. NOSQL categories
    Column Stores
    Based on Google’s BigTable design
    Data model - big table, column families
    Scalability – horizontal, auto-sharding & replication
    Persistence – Memory + File (on DFS)
    Examples – HBase, Cassandra
    Key Value Stores
    Based on DHT, Amazon’s Dynamo design
    Data model – collection of key value pairs
    Scalability – horizontal, auto-sharding & replication
    Persistence – Memory or File
    Examples – Redis, Amazon Dynamo, Voldemort
  • 9. Graph Databases
  • 10. Graph oriented data
    Graphs are ubiquitous – Social networks, wikis, the web, recommendation engines et. al.
    Deep trees, complex networks
    Graph traversal - apt for expressing graph related problems (shortest path, network size etc.)
  • 11. LinkedIn Social Graph
  • 12. Why not RDBMS for large scale graphs?
    Difficult to model and traverse graphs in RDBMS
    recursive approaches - slow SQL queries that span many table joins
    Hacks like storing paths for trees
  • 13. Graph Databases
    Designed for efficient storage & traversal of large scale graphs
    Natural modeling of graph network - nodes, relationships and their properties
    Neo4J is a leading graph db
    Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS
    Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs
    REST API and various language bindings
    Graph pattern matching, Cypher Query language, Indexer – Lucene
  • 14. Graph basics
  • 15. All Paths & My Network size
  • 16. Shortest path between …
  • 17. Is connected to?
  • 18. You may know…
  • 19. Mining your network
    Centrality Algorithms
    Closeness – who has the most followers on twitter
    Betweenness – who has more influential people following them
    Eigenvector – PageRank
  • 20. Document Databases
  • 21. Flexible document oriented data
    Document style unstructured data - schema less – e.g. JSON documents
    No alter table needed like in an RDBMS, de-normalized data
    Useful for iterative/agile development
    Humongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availability
    mongoDB is a leading document database
  • 22. Document Database – Use cases
    Use cases :
    Archiving of historic data which has undergone many schema changes
    Flexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documents
    Track near real time metrics - optimized increment of perf counters
    Geo Loc based mobile and gaming apps (Geospatial indices can be key here)
  • 23. Craigslist Archival Database
    Premium service to customers allowed search over their historical postings
    Archival (no purging) of 10 years of postings - billions of documents
    Schema changes across versions
    MySQL based archival database
    ALTER TABLE took a month to complete
  • 24. Foursquare
    • Find a venue whose name is Starbucks and mayor isAbhijit
    • 25. Geo : Optimized for geo location queries – Find Starbucks near my current GPS location
  • mongoDB Architecture
    Mongo Router
    Mongo Router
    Mongo Configuration Server
    Mongo Configuration Server
  • 26. mongoDB Features
    JSON documents, collection oriented storage
    Rich, document-based queries
    Indexes on document attributes
    Fast in-place updates
    Scalability features
    Horizontal scalability
    Configurable replication and high-availability
    Auto-sharding & rebalancing
    Language specific drivers – Java, Scala, Ruby etc.
  • 27. Column Stores
  • 28. Column Store
    Reasonably rich data model –
    sparse, distributed, persistent multi-dimensional sorted map
    Sorted row keys, columns
    Use cases - Large scale data storage and analysis like -
    Time series data along with associated dimension data
    Row keys are timestamps and thus sorted – helps time range queries
    Google analytics
    Provides aggregate statistics, # unique visitors/day, page views/URL/day
    Raw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted
    Data Cube - CPU
  • 29. Column Store
    Excellent R/W performance – large storage – PB’s
    High scalability - horizontal scaling, auto-sharding
    High Availability - transparent replication of data
    HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence
  • 30. Column Store - HBase
    Table defines Column Families - groups similar attributes , vertical partitioning
    (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value 
    Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key)
    Ordered Rows by key, Ordered columns in a Column Family
    Rows can have different number of columns
    Columns have value and versions (any number)
    Row range & column range and key range queries
  • 31. HBase Architecture
  • 32. Key Value Stores
  • 33. Key Value Stores
    Simplest possible data model
    Caching a user’s personalized, rendered page – avoid DB
    S3 bucket storage for blob data against a unique id
    Range of KV stores
    Distributed, scaleable persistent key-value storage – Dynamo, Voldemort
    Auto-Partitioned key space
    Replicated KV
    Highly Available
    Largely in-memory KV stores – Redis, memcached
    Redis blazing fast for cache and other interesting operations
  • 34. Redis
    In memory KV store
    Blazing fast – 100 K/sec R/W
    Async snapshot to disk
    More than KV store – a data structure store –
    Supports lists, queues, sets and operations on them
    Sorted list range operations
    Set operations UNION, INTERSECTION, DIFF
  • 35. Redis – Use Cases
    Web session caching with EXPIRE set for session expiry
    Live real time URL stats like clicks etc – fast increments of counters
    Auto Complete – Type first few characters – maps to a sort list and a range query is fired
    Publish / Subscribe – fan out a message to subscribers
    Set operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back
  • 36. Thanks
    Email : abhijit.sharma@gmail.comTwitter : sharmaabhijitBlog :