Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data and the growing relevance of NoSQL


Published on

Published in: Technology
  • Be the first to comment

Big Data and the growing relevance of NoSQL

  1. 1. Big Data trends and the rising importance of NOSQL<br />Abhijit Sharma, Architect, <br />Innovation & Incubation Lab, BMC Software<br />
  2. 2. Trends in cloud, web, and even enterprise scale apps<br />Unprecedented growth in -<br />Data set sizes which need to be stored, analyzed<br />Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, Foursquare<br />Connectedness and democratization of data<br />social networks, feeds, blogs, wiki, tags, semantic web <br />Data API’s - mash up data - use Twitter, FB, Flickr API’s<br />Semi structured or unstructured data<br />Performance requirements of these apps<br />Humongous R/W Scalability <br />High Availability <br />Trading consistency for availability – ACID not mandatory<br />
  3. 3. RDBMS woes<br />Challenge - Storing and scaling humongous amounts of data and remaining highly available<br />Vertical scaling mostly - upper limit & expensive<br />Horizontal scaling – no automatic sharding, no rebalancing – no infrastructure<br />Distributed transactions & joins due to normalization inhibit performance, availability<br />Schema less data models – rigid schema – alter table, null columns <br />Deeply connected data – not designed for this<br />
  4. 4. NOSQL is <br />NOT <br />No SQL<br />The NOSQL Alternative<br />
  5. 5. NOSQL is <br />simply<br />Not only SQL<br />The NOSQL Alternative<br />
  6. 6. NOSQL – So what else is it?<br />“One size fits all” RDBMS is not working <br />NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends.<br />They can be categorized along these axes -<br />Data Model - simple to complex<br />Scalability – single to horizontal<br />Persistence <br />
  7. 7. NOSQL categories<br />Graph Databases<br />Based on Graph theory<br />Data model – graph, nodes, edges, properties<br />Scalability – single node – high performance<br />Persistence – On disk data structures<br />Examples – Neo4J, AllegroGraph<br />Document Databases<br />Based loosely on documents/Lotus Notes<br />Data model – collections of documents<br />Scalability – horizontal, auto-sharding & replication<br />Persistence – B-Tree<br />Examples – mongoDB, CouchDB<br />
  8. 8. NOSQL categories<br />Column Stores<br />Based on Google’s BigTable design<br />Data model - big table, column families<br />Scalability – horizontal, auto-sharding & replication<br />Persistence – Memory + File (on DFS)<br />Examples – HBase, Cassandra<br />Key Value Stores<br />Based on DHT, Amazon’s Dynamo design<br />Data model – collection of key value pairs<br />Scalability – horizontal, auto-sharding & replication<br />Persistence – Memory or File <br />Examples – Redis, Amazon Dynamo, Voldemort<br />
  9. 9. Graph Databases<br />
  10. 10. Graph oriented data<br />Graphs are ubiquitous – Social networks, wikis, the web, recommendation engines et. al.<br />Deep trees, complex networks<br />Graph traversal - apt for expressing graph related problems (shortest path, network size etc.)<br />
  11. 11. LinkedIn Social Graph<br />
  12. 12. Why not RDBMS for large scale graphs?<br />Difficult to model and traverse graphs in RDBMS<br />recursive approaches - slow SQL queries that span many table joins<br />Hacks like storing paths for trees <br />
  13. 13. Graph Databases<br />Designed for efficient storage & traversal of large scale graphs<br />Natural modeling of graph network - nodes, relationships and their properties<br />Neo4J is a leading graph db<br />Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS<br />Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs<br />REST API and various language bindings<br />Graph pattern matching, Cypher Query language, Indexer – Lucene<br />
  14. 14. Graph basics<br />
  15. 15. All Paths & My Network size<br />
  16. 16. Shortest path between …<br />
  17. 17. Is connected to?<br />
  18. 18. You may know…<br />
  19. 19. Mining your network<br />Centrality Algorithms <br />Closeness – who has the most followers on twitter <br />Betweenness – who has more influential people following them<br />Eigenvector – PageRank<br />
  20. 20. Document Databases<br />
  21. 21. Flexible document oriented data<br />Document style unstructured data - schema less – e.g. JSON documents<br />No alter table needed like in an RDBMS, de-normalized data<br />Useful for iterative/agile development<br />Humongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availability<br />mongoDB is a leading document database<br />
  22. 22. Document Database – Use cases<br />Use cases :<br />Archiving of historic data which has undergone many schema changes<br />Flexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documents<br />Track near real time metrics - optimized increment of perf counters<br />Geo Loc based mobile and gaming apps (Geospatial indices can be key here)<br />
  23. 23. Craigslist Archival Database<br />Premium service to customers allowed search over their historical postings<br />Archival (no purging) of 10 years of postings - billions of documents<br />Schema changes across versions<br />MySQL based archival database <br />ALTER TABLE took a month to complete<br />
  24. 24. Foursquare<br /><ul><li>Find a venue whose name is Starbucks and mayor isAbhijit
  25. 25. Geo : Optimized for geo location queries – Find Starbucks near my current GPS location</li></li></ul><li>mongoDB Architecture<br />Client<br />Shard<br />Shard<br />Shard<br />Mongo Router<br />Mongo Router<br />Mongo Configuration Server<br />Mongo Configuration Server<br />
  26. 26. mongoDB Features<br />JSON documents, collection oriented storage<br />Rich, document-based queries<br />Indexes on document attributes<br />Fast in-place updates<br />Scalability features <br />Horizontal scalability<br />Configurable replication and high-availability<br />Auto-sharding & rebalancing<br />Language specific drivers – Java, Scala, Ruby etc.<br />
  27. 27. Column Stores<br />
  28. 28. Column Store<br />Reasonably rich data model – <br />sparse, distributed, persistent multi-dimensional sorted map<br />Sorted row keys, columns<br />Use cases - Large scale data storage and analysis like - <br />Time series data along with associated dimension data <br />Row keys are timestamps and thus sorted – helps time range queries<br />Google analytics<br />Provides aggregate statistics, # unique visitors/day, page views/URL/day<br />Raw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted <br />Data Cube - CPU<br />OS<br />Time<br />DC<br />
  29. 29. Column Store<br />Performance<br />Excellent R/W performance – large storage – PB’s<br />High scalability - horizontal scaling, auto-sharding<br />High Availability - transparent replication of data<br />HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence <br />
  30. 30. Column Store - HBase<br />Table defines Column Families - groups similar attributes , vertical partitioning <br />(Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value <br />Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key)<br />Ordered Rows by key, Ordered columns in a Column Family<br />Rows can have different number of columns <br />Columns have value and versions (any number)<br />Row range & column range and key range queries<br />
  31. 31. HBase Architecture<br />
  32. 32. Key Value Stores<br />
  33. 33. Key Value Stores<br />Simplest possible data model<br />Caching a user’s personalized, rendered page – avoid DB<br />S3 bucket storage for blob data against a unique id<br />Range of KV stores<br />Distributed, scaleable persistent key-value storage – Dynamo, Voldemort<br />Auto-Partitioned key space <br />Replicated KV<br />Highly Available<br />Largely in-memory KV stores – Redis, memcached<br />Redis blazing fast for cache and other interesting operations<br />
  34. 34. Redis<br />In memory KV store <br />Blazing fast – 100 K/sec R/W<br />Async snapshot to disk<br />More than KV store – a data structure store – <br />Supports lists, queues, sets and operations on them<br />Sorted list range operations<br />Set operations UNION, INTERSECTION, DIFF<br />
  35. 35. Redis – Use Cases<br />Web session caching with EXPIRE set for session expiry<br />Live real time URL stats like clicks etc – fast increments of counters<br />Auto Complete – Type first few characters – maps to a sort list and a range query is fired<br />Publish / Subscribe – fan out a message to subscribers<br />Set operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back<br />
  36. 36. Thanks<br />Email : abhijit.sharma@gmail.comTwitter : sharmaabhijitBlog :<br />