Big Data trends and the rising importance of NOSQLAbhijit Sharma, Architect, Innovation & Incubation Lab, BMC Software
Trends in cloud, web, and even enterprise scale appsUnprecedented growth in -Data set sizes which need to be stored, analyzedBig Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, FoursquareConnectedness and democratization of datasocial networks, feeds, blogs, wiki, tags, semantic web Data API’s - mash up data - use  Twitter, FB, Flickr API’sSemi structured or unstructured dataPerformance requirements of these appsHumongous R/W Scalability High Availability Trading consistency for availability – ACID not mandatory
RDBMS woesChallenge - Storing and scaling humongous amounts of data and remaining highly availableVertical scaling mostly - upper limit & expensiveHorizontal scaling – no automatic sharding, no rebalancing – no infrastructureDistributed transactions & joins due to normalization inhibit performance, availabilitySchema less data models – rigid schema – alter table, null columns Deeply connected data – not designed for this
NOSQL is NOT No SQLThe NOSQL Alternative
NOSQL is simplyNot only SQLThe NOSQL Alternative
NOSQL – So what else is it?“One size fits all” RDBMS is not working NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends.They can be categorized along these axes -Data Model - simple to complexScalability – single to horizontalPersistence
NOSQL categoriesGraph DatabasesBased on Graph theoryData model – graph,  nodes, edges, propertiesScalability – single node – high performancePersistence – On disk data structuresExamples – Neo4J,  AllegroGraphDocument DatabasesBased loosely on documents/Lotus NotesData model – collections of documentsScalability – horizontal,  auto-sharding & replicationPersistence – B-TreeExamples – mongoDB, CouchDB
NOSQL categoriesColumn StoresBased on Google’s BigTable designData model - big table, column familiesScalability – horizontal, auto-sharding & replicationPersistence – Memory + File (on DFS)Examples – HBase, CassandraKey Value StoresBased on DHT,  Amazon’s Dynamo designData model – collection of key value pairsScalability – horizontal, auto-sharding & replicationPersistence – Memory or File Examples – Redis,  Amazon Dynamo, Voldemort
Graph Databases
Graph oriented dataGraphs are ubiquitous – Social networks, wikis, the web, recommendation engines et. al.Deep trees, complex networksGraph traversal - apt for expressing graph related problems (shortest path, network size etc.)
LinkedIn Social Graph
Why not RDBMS for large scale graphs?Difficult to model and traverse graphs in RDBMSrecursive approaches - slow SQL queries that span many table joinsHacks like storing paths for trees
Graph DatabasesDesigned for efficient storage & traversal of large scale graphsNatural modeling of graph network - nodes, relationships and their propertiesNeo4J is a leading graph dbSupports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMSHandle large graphs that don't fit in memory - persistent transactional store optimized for graphsREST API and various language bindingsGraph pattern matching,  Cypher Query language, Indexer – Lucene
Graph basics
All Paths & My Network size
Shortest path between …
Is connected to?
You may know…
Mining your networkCentrality Algorithms Closeness  – who has the most followers on twitter Betweenness – who has more influential people following themEigenvector – PageRank
Document Databases
Flexible document oriented dataDocument style unstructured data - schema less – e.g. JSON documentsNo alter table needed like in an RDBMS, de-normalized dataUseful for iterative/agile developmentHumongous scale - billions of documents, R/W traffic – millions/sec,  horizontal scalability,  availabilitymongoDB is a leading document database
Document Database – Use casesUse cases :Archiving of historic data which has undergone many schema changesFlexible set of performance metrics – web site page views, unique visitors  etc.  - change over time – no need to update existing JSON documentsTrack near real time metrics - optimized increment of perf countersGeo Loc based mobile and gaming apps (Geospatial indices can be key here)
Craigslist Archival DatabasePremium service to customers allowed search over their  historical postingsArchival (no purging) of 10 years of postings - billions of documentsSchema changes across versionsMySQL based archival database ALTER TABLE took a month to complete
FoursquareFind a venue whose name is Starbucks and mayor isAbhijit
Geo : Optimized for geo location queries – Find Starbucks near my current GPS locationmongoDB ArchitectureClientShardShardShardMongo RouterMongo RouterMongo Configuration ServerMongo Configuration Server
mongoDB FeaturesJSON documents, collection oriented storageRich, document-based queriesIndexes on document attributesFast in-place updatesScalability features	Horizontal scalabilityConfigurable replication and high-availabilityAuto-sharding & rebalancingLanguage specific drivers – Java, Scala, Ruby etc.
Column Stores
Column StoreReasonably rich data model – sparse, distributed, persistent multi-dimensional sorted mapSorted row keys, columnsUse cases - Large scale data storage and analysis like - Time series data along with associated dimension data Row keys are timestamps and thus sorted – helps time range queriesGoogle analyticsProvides aggregate statistics, # unique visitors/day, page views/URL/dayRaw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted Data Cube - CPUOSTimeDC
Column StorePerformanceExcellent R/W performance – large storage – PB’sHigh scalability - horizontal scaling,  auto-shardingHigh Availability - transparent replication of dataHBase is a leading column store on – built on Hadoop HDFS as the underlying persistence
Column Store - HBaseTable defines  Column Families  -  groups similar attributes ,  vertical partitioning (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key)Ordered Rows by key, Ordered columns in a Column FamilyRows can have different number of columns Columns have value and versions (any number)Row range & column range and key range queries
HBase Architecture
Key Value Stores
Key Value StoresSimplest possible data modelCaching a user’s personalized, rendered page – avoid DBS3 bucket storage for blob data against a unique idRange of KV storesDistributed, scaleable persistent key-value storage – Dynamo,  VoldemortAuto-Partitioned key space Replicated KVHighly AvailableLargely in-memory KV stores – Redis, memcachedRedis blazing fast for cache and other interesting operations
RedisIn memory KV store Blazing fast – 100 K/sec R/WAsync snapshot to diskMore than KV store – a data structure store – Supports lists, queues, sets and operations on themSorted list range operationsSet operations UNION,  INTERSECTION,  DIFF
Redis – Use CasesWeb session caching with EXPIRE set for session expiryLive real time bit.ly URL stats like clicks etc – fast increments of countersAuto Complete – Type first few characters – maps to a sort list and a range query is firedPublish / Subscribe – fan out a message to subscribersSet operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back
ThanksEmail : abhijit.sharma@gmail.comTwitter : sharmaabhijitBlog : abhijitsharma.blogspot.com

Big Data and the growing relevance of NoSQL

  • 1.
    Big Data trendsand the rising importance of NOSQLAbhijit Sharma, Architect, Innovation & Incubation Lab, BMC Software
  • 2.
    Trends in cloud,web, and even enterprise scale appsUnprecedented growth in -Data set sizes which need to be stored, analyzedBig Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, FoursquareConnectedness and democratization of datasocial networks, feeds, blogs, wiki, tags, semantic web Data API’s - mash up data - use Twitter, FB, Flickr API’sSemi structured or unstructured dataPerformance requirements of these appsHumongous R/W Scalability High Availability Trading consistency for availability – ACID not mandatory
  • 3.
    RDBMS woesChallenge -Storing and scaling humongous amounts of data and remaining highly availableVertical scaling mostly - upper limit & expensiveHorizontal scaling – no automatic sharding, no rebalancing – no infrastructureDistributed transactions & joins due to normalization inhibit performance, availabilitySchema less data models – rigid schema – alter table, null columns Deeply connected data – not designed for this
  • 4.
    NOSQL is NOTNo SQLThe NOSQL Alternative
  • 5.
    NOSQL is simplyNotonly SQLThe NOSQL Alternative
  • 6.
    NOSQL – Sowhat else is it?“One size fits all” RDBMS is not working NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends.They can be categorized along these axes -Data Model - simple to complexScalability – single to horizontalPersistence
  • 7.
    NOSQL categoriesGraph DatabasesBasedon Graph theoryData model – graph, nodes, edges, propertiesScalability – single node – high performancePersistence – On disk data structuresExamples – Neo4J, AllegroGraphDocument DatabasesBased loosely on documents/Lotus NotesData model – collections of documentsScalability – horizontal, auto-sharding & replicationPersistence – B-TreeExamples – mongoDB, CouchDB
  • 8.
    NOSQL categoriesColumn StoresBasedon Google’s BigTable designData model - big table, column familiesScalability – horizontal, auto-sharding & replicationPersistence – Memory + File (on DFS)Examples – HBase, CassandraKey Value StoresBased on DHT, Amazon’s Dynamo designData model – collection of key value pairsScalability – horizontal, auto-sharding & replicationPersistence – Memory or File Examples – Redis, Amazon Dynamo, Voldemort
  • 9.
  • 10.
    Graph oriented dataGraphsare ubiquitous – Social networks, wikis, the web, recommendation engines et. al.Deep trees, complex networksGraph traversal - apt for expressing graph related problems (shortest path, network size etc.)
  • 11.
  • 12.
    Why not RDBMSfor large scale graphs?Difficult to model and traverse graphs in RDBMSrecursive approaches - slow SQL queries that span many table joinsHacks like storing paths for trees
  • 13.
    Graph DatabasesDesigned forefficient storage & traversal of large scale graphsNatural modeling of graph network - nodes, relationships and their propertiesNeo4J is a leading graph dbSupports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMSHandle large graphs that don't fit in memory - persistent transactional store optimized for graphsREST API and various language bindingsGraph pattern matching, Cypher Query language, Indexer – Lucene
  • 14.
  • 15.
    All Paths &My Network size
  • 16.
  • 17.
  • 18.
  • 19.
    Mining your networkCentralityAlgorithms Closeness – who has the most followers on twitter Betweenness – who has more influential people following themEigenvector – PageRank
  • 20.
  • 21.
    Flexible document orienteddataDocument style unstructured data - schema less – e.g. JSON documentsNo alter table needed like in an RDBMS, de-normalized dataUseful for iterative/agile developmentHumongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availabilitymongoDB is a leading document database
  • 22.
    Document Database –Use casesUse cases :Archiving of historic data which has undergone many schema changesFlexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documentsTrack near real time metrics - optimized increment of perf countersGeo Loc based mobile and gaming apps (Geospatial indices can be key here)
  • 23.
    Craigslist Archival DatabasePremiumservice to customers allowed search over their historical postingsArchival (no purging) of 10 years of postings - billions of documentsSchema changes across versionsMySQL based archival database ALTER TABLE took a month to complete
  • 24.
    FoursquareFind a venuewhose name is Starbucks and mayor isAbhijit
  • 25.
    Geo : Optimizedfor geo location queries – Find Starbucks near my current GPS locationmongoDB ArchitectureClientShardShardShardMongo RouterMongo RouterMongo Configuration ServerMongo Configuration Server
  • 26.
    mongoDB FeaturesJSON documents,collection oriented storageRich, document-based queriesIndexes on document attributesFast in-place updatesScalability features Horizontal scalabilityConfigurable replication and high-availabilityAuto-sharding & rebalancingLanguage specific drivers – Java, Scala, Ruby etc.
  • 27.
  • 28.
    Column StoreReasonably richdata model – sparse, distributed, persistent multi-dimensional sorted mapSorted row keys, columnsUse cases - Large scale data storage and analysis like - Time series data along with associated dimension data Row keys are timestamps and thus sorted – helps time range queriesGoogle analyticsProvides aggregate statistics, # unique visitors/day, page views/URL/dayRaw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted Data Cube - CPUOSTimeDC
  • 29.
    Column StorePerformanceExcellent R/Wperformance – large storage – PB’sHigh scalability - horizontal scaling, auto-shardingHigh Availability - transparent replication of dataHBase is a leading column store on – built on Hadoop HDFS as the underlying persistence
  • 30.
    Column Store -HBaseTable defines Column Families - groups similar attributes , vertical partitioning (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key)Ordered Rows by key, Ordered columns in a Column FamilyRows can have different number of columns Columns have value and versions (any number)Row range & column range and key range queries
  • 31.
  • 32.
  • 33.
    Key Value StoresSimplestpossible data modelCaching a user’s personalized, rendered page – avoid DBS3 bucket storage for blob data against a unique idRange of KV storesDistributed, scaleable persistent key-value storage – Dynamo, VoldemortAuto-Partitioned key space Replicated KVHighly AvailableLargely in-memory KV stores – Redis, memcachedRedis blazing fast for cache and other interesting operations
  • 34.
    RedisIn memory KVstore Blazing fast – 100 K/sec R/WAsync snapshot to diskMore than KV store – a data structure store – Supports lists, queues, sets and operations on themSorted list range operationsSet operations UNION, INTERSECTION, DIFF
  • 35.
    Redis – UseCasesWeb session caching with EXPIRE set for session expiryLive real time bit.ly URL stats like clicks etc – fast increments of countersAuto Complete – Type first few characters – maps to a sort list and a range query is firedPublish / Subscribe – fan out a message to subscribersSet operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back
  • 36.
    ThanksEmail : abhijit.sharma@gmail.comTwitter: sharmaabhijitBlog : abhijitsharma.blogspot.com