Big Data and the growing relevance of NoSQL


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big Data and the growing relevance of NoSQL

  1. 1. Big Data trends and the rising importance of NOSQL<br />Abhijit Sharma, Architect, <br />Innovation & Incubation Lab, BMC Software<br />
  2. 2. Trends in cloud, web, and even enterprise scale apps<br />Unprecedented growth in -<br />Data set sizes which need to be stored, analyzed<br />Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, Foursquare<br />Connectedness and democratization of data<br />social networks, feeds, blogs, wiki, tags, semantic web <br />Data API’s - mash up data - use Twitter, FB, Flickr API’s<br />Semi structured or unstructured data<br />Performance requirements of these apps<br />Humongous R/W Scalability <br />High Availability <br />Trading consistency for availability – ACID not mandatory<br />
  3. 3. RDBMS woes<br />Challenge - Storing and scaling humongous amounts of data and remaining highly available<br />Vertical scaling mostly - upper limit & expensive<br />Horizontal scaling – no automatic sharding, no rebalancing – no infrastructure<br />Distributed transactions & joins due to normalization inhibit performance, availability<br />Schema less data models – rigid schema – alter table, null columns <br />Deeply connected data – not designed for this<br />
  4. 4. NOSQL is <br />NOT <br />No SQL<br />The NOSQL Alternative<br />
  5. 5. NOSQL is <br />simply<br />Not only SQL<br />The NOSQL Alternative<br />
  6. 6. NOSQL – So what else is it?<br />“One size fits all” RDBMS is not working <br />NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends.<br />They can be categorized along these axes -<br />Data Model - simple to complex<br />Scalability – single to horizontal<br />Persistence <br />
  7. 7. NOSQL categories<br />Graph Databases<br />Based on Graph theory<br />Data model – graph, nodes, edges, properties<br />Scalability – single node – high performance<br />Persistence – On disk data structures<br />Examples – Neo4J, AllegroGraph<br />Document Databases<br />Based loosely on documents/Lotus Notes<br />Data model – collections of documents<br />Scalability – horizontal, auto-sharding & replication<br />Persistence – B-Tree<br />Examples – mongoDB, CouchDB<br />
  8. 8. NOSQL categories<br />Column Stores<br />Based on Google’s BigTable design<br />Data model - big table, column families<br />Scalability – horizontal, auto-sharding & replication<br />Persistence – Memory + File (on DFS)<br />Examples – HBase, Cassandra<br />Key Value Stores<br />Based on DHT, Amazon’s Dynamo design<br />Data model – collection of key value pairs<br />Scalability – horizontal, auto-sharding & replication<br />Persistence – Memory or File <br />Examples – Redis, Amazon Dynamo, Voldemort<br />
  9. 9. Graph Databases<br />
  10. 10. Graph oriented data<br />Graphs are ubiquitous – Social networks, wikis, the web, recommendation engines et. al.<br />Deep trees, complex networks<br />Graph traversal - apt for expressing graph related problems (shortest path, network size etc.)<br />
  11. 11. LinkedIn Social Graph<br />
  12. 12. Why not RDBMS for large scale graphs?<br />Difficult to model and traverse graphs in RDBMS<br />recursive approaches - slow SQL queries that span many table joins<br />Hacks like storing paths for trees <br />
  13. 13. Graph Databases<br />Designed for efficient storage & traversal of large scale graphs<br />Natural modeling of graph network - nodes, relationships and their properties<br />Neo4J is a leading graph db<br />Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS<br />Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs<br />REST API and various language bindings<br />Graph pattern matching, Cypher Query language, Indexer – Lucene<br />
  14. 14. Graph basics<br />
  15. 15. All Paths & My Network size<br />
  16. 16. Shortest path between …<br />
  17. 17. Is connected to?<br />
  18. 18. You may know…<br />
  19. 19. Mining your network<br />Centrality Algorithms <br />Closeness – who has the most followers on twitter <br />Betweenness – who has more influential people following them<br />Eigenvector – PageRank<br />
  20. 20. Document Databases<br />
  21. 21. Flexible document oriented data<br />Document style unstructured data - schema less – e.g. JSON documents<br />No alter table needed like in an RDBMS, de-normalized data<br />Useful for iterative/agile development<br />Humongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availability<br />mongoDB is a leading document database<br />
  22. 22. Document Database – Use cases<br />Use cases :<br />Archiving of historic data which has undergone many schema changes<br />Flexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documents<br />Track near real time metrics - optimized increment of perf counters<br />Geo Loc based mobile and gaming apps (Geospatial indices can be key here)<br />
  23. 23. Craigslist Archival Database<br />Premium service to customers allowed search over their historical postings<br />Archival (no purging) of 10 years of postings - billions of documents<br />Schema changes across versions<br />MySQL based archival database <br />ALTER TABLE took a month to complete<br />
  24. 24. Foursquare<br /><ul><li>Find a venue whose name is Starbucks and mayor isAbhijit
  25. 25. Geo : Optimized for geo location queries – Find Starbucks near my current GPS location</li></li></ul><li>mongoDB Architecture<br />Client<br />Shard<br />Shard<br />Shard<br />Mongo Router<br />Mongo Router<br />Mongo Configuration Server<br />Mongo Configuration Server<br />
  26. 26. mongoDB Features<br />JSON documents, collection oriented storage<br />Rich, document-based queries<br />Indexes on document attributes<br />Fast in-place updates<br />Scalability features <br />Horizontal scalability<br />Configurable replication and high-availability<br />Auto-sharding & rebalancing<br />Language specific drivers – Java, Scala, Ruby etc.<br />
  27. 27. Column Stores<br />
  28. 28. Column Store<br />Reasonably rich data model – <br />sparse, distributed, persistent multi-dimensional sorted map<br />Sorted row keys, columns<br />Use cases - Large scale data storage and analysis like - <br />Time series data along with associated dimension data <br />Row keys are timestamps and thus sorted – helps time range queries<br />Google analytics<br />Provides aggregate statistics, # unique visitors/day, page views/URL/day<br />Raw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted <br />Data Cube - CPU<br />OS<br />Time<br />DC<br />
  29. 29. Column Store<br />Performance<br />Excellent R/W performance – large storage – PB’s<br />High scalability - horizontal scaling, auto-sharding<br />High Availability - transparent replication of data<br />HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence <br />
  30. 30. Column Store - HBase<br />Table defines Column Families - groups similar attributes , vertical partitioning <br />(Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value <br />Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key)<br />Ordered Rows by key, Ordered columns in a Column Family<br />Rows can have different number of columns <br />Columns have value and versions (any number)<br />Row range & column range and key range queries<br />
  31. 31. HBase Architecture<br />
  32. 32. Key Value Stores<br />
  33. 33. Key Value Stores<br />Simplest possible data model<br />Caching a user’s personalized, rendered page – avoid DB<br />S3 bucket storage for blob data against a unique id<br />Range of KV stores<br />Distributed, scaleable persistent key-value storage – Dynamo, Voldemort<br />Auto-Partitioned key space <br />Replicated KV<br />Highly Available<br />Largely in-memory KV stores – Redis, memcached<br />Redis blazing fast for cache and other interesting operations<br />
  34. 34. Redis<br />In memory KV store <br />Blazing fast – 100 K/sec R/W<br />Async snapshot to disk<br />More than KV store – a data structure store – <br />Supports lists, queues, sets and operations on them<br />Sorted list range operations<br />Set operations UNION, INTERSECTION, DIFF<br />
  35. 35. Redis – Use Cases<br />Web session caching with EXPIRE set for session expiry<br />Live real time URL stats like clicks etc – fast increments of counters<br />Auto Complete – Type first few characters – maps to a sort list and a range query is fired<br />Publish / Subscribe – fan out a message to subscribers<br />Set operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back<br />
  36. 36. Thanks<br />Email : abhijit.sharma@gmail.comTwitter : sharmaabhijitBlog :<br />