Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Demystifying datastores

134 views

Published on

my attempt to demystify datastores.
how to choose a store that fits your needs what are the questions you need to ask ?
hbase hadoop mysql cassandra vertica etc

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Demystifying datastores

  1. 1. Vishnu Rao MySQL Enthusiast Doodle maker Senior Data Engineer @ DataSpark Formerly @ flipkart.com
  2. 2. The comma separated list ... ● Hadoop , Hbase, Rocks Db ● MySQL , MariaDB , Postgres ● Cassandra , MongoDb ● Druid , Redis, MemSQL ● Elastic Search , Solr ● Cockroach Db, Couch db ● Vertica , Infobright ● Redshift , Dynamo Db ● S3 , OpenStack Swift ….
  3. 3. The FUN-damental Qns:
  4. 4. The FUN-damental Qns: Which one should I use ?
  5. 5. Demystifying Datastores
  6. 6. Lets try to look at the problem from the view of the database
  7. 7. First lets play some baseball ...
  8. 8. Base 0 : The Data itself
  9. 9. Base 0 : The Data itself ● Row having columns
  10. 10. Base 0 : The Data itself ● Row having columns ● Key - Value
  11. 11. Base 0 : The Data itself ● Row having columns ● Key - Value ○ Key - Blob (u think object)
  12. 12. Base 0 : The Data itself ● Row having columns ● Key - Value ○ Key - Blob (u think object) ○ Key - Document (u think json / xml)
  13. 13. Base 0 : The Data itself ● Row having columns ● Key - Value ○ Key - Blob (u think object) ○ Key - Document (u think json / xml) ● Graph (Nodes/edges kind of like key-value)
  14. 14. Base 1 : How is the Data Stored ?
  15. 15. Base 1 : How is the Data Stored ? Let’s consider a Sample Data Record/Row order-id-123 customer-1 5$ bill amount Bugis Street 1$ Tax 3 Items
  16. 16. Base 1 : How is the Data Stored ? Let’s consider a Sample Data Record/Row order-id-123 customer-1 5$ bill amount Bugis Street 1$ Tax 3 Items Columns / Attributes Possible PrimaryKey Column
  17. 17. Base 1 : How is the Data Stored ? Approach 1 ● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.
  18. 18. Base 1 : How is the Data Stored ? Approach 1 ● Store all columns of the Row side by side (i.e. TOGETHER ) on disk. ● This is generally referred to as a ROW based DataStore.
  19. 19. Base 1 : How is the Data Stored ? Approach 1 ● Useful for use cases like “showing ENTIRE Order on UI” order-id-123 customer-1 5$ bill amount Bugis Street 1$ Tax 3 Items
  20. 20. Base 1 : How is the Data Stored ? Approach 1 ● Useful for use cases like “showing ENTIRE Order on UI” ● The entire row is fetched in one disk access order-id-123 customer-1 5$ bill amount Bugis Street 1$ Tax 3 Items
  21. 21. Base 1 : How is the Data Stored ? Approach 2 ● Store Columns SEPARATELY, so that they can be accessed independently.
  22. 22. Base 1 : How is the Data Stored ? Approach 2 ● Store Columns SEPARATELY, so that they can be accessed independently. ● This is generally referred to as a COLUMN based DataStore.
  23. 23. Base 1 : How is the Data Stored ? Approach 2 ● Avg(billing_amount) or Sum(Items) order-id-123 customer-1 5$ bill amount Bugis Street1$ tax 3 items order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street
  24. 24. Base 1 : How is the Data Stored ? Approach 2 ● Avg(billing_amount) or Sum(Items) ● Instead of fetching entire row, fetch necessary columns for compute ○ I.e Less Data fetched from Disk = REDUCED IO order-id-123 customer-1 5$ bill amount Bugis Street1$ tax 3 items order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street
  25. 25. Base 1 : How is the Data Stored ? Approach 2 ● What are the other optimisations for column store. ○ Imagine 4 rows with column say ‘age’ ■ Row 1 - 28 ■ Row 2- 30 ■ Row 3 - 28 ■ Row 4- 28
  26. 26. Base 1 : How is the Data Stored ? Approach 2 ● While storing on disk , if you SORT and store, you can also think of compression: 28,28,28,30 (sorted -> good for search now) 28(3),30 (now compressed -> 28 stored once)
  27. 27. Base 1 : How is the Data Stored ? Typically : ● MySQL / Postgres = ROW based ● Vertica / Infobright / Druid = COLUMN based
  28. 28. Base 1 : How is the Data Stored ? Approach 2.5 ● Store Group of Columns TOGETHER but store each group separately.
  29. 29. Base 1 : How is the Data Stored ? Approach 2.5 ● Store Group of Columns TOGETHER but store each group separately. ● This is generally referred to as a COLUMN-family based DataStore.
  30. 30. Base 1 : How is the Data Stored ? Approach 2.5 Logically group the columns. order-id-123 customer-1 5$ bill amount Bugis Street 1$ tax 3 items
  31. 31. Base 1 : How is the Data Stored ? Approach 2.5 Logically group the columns. Typically: Hbase/Cassandra order-id-123 customer-1 5$ bill amount Bugis Street 1$ tax 3 items
  32. 32. Base 2 : The Indexing ● What kind of Data Structure is used ?
  33. 33. Base 2 : The Indexing ● What kind of Data Structure is used ? ○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?
  34. 34. Base 2 : The Indexing ● What kind of Data Structure is used ? ○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ? ● Certain type of queries like certain indexes
  35. 35. Base 2 : The Indexing ● What kind of Data Structure is used ? ○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ? ● Certain type of queries like certain indexes ○ Range like B-tree, Inserts like Fractal.
  36. 36. Base 2 : The Indexing ● What kind of Data Structure is used ? ○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ? ● Certain type of queries like certain indexes ○ Range like B-tree, Inserts like Fractal. ● Whats the index loading mechanism ? ○ Redis is Memory bound.
  37. 37. Base 3 : The Theorem ● Most Datastores do ○ Horizontal scaling ○ Sharding
  38. 38. Base 3 : The Theorem ● Most Datastores do ○ Horizontal scaling ○ Sharding ● So Here is the Catch - In event of Network Partition, ○ How is Consistency / Availability Handled ?
  39. 39. Base 4 : Apart from CAP theorem
  40. 40. Base 4 : Apart from CAP theorem ● ACID ? ○ Transaction commit/Rollback support
  41. 41. Base 4 : Apart from CAP theorem ● ACID ? ○ Transaction commit/Rollback support ● BASE ? ○ Basically Available , Soft State, Eventual Consistency ?
  42. 42. Base 4 : Apart from CAP theorem ● ACID ? ○ Transaction commit/Rollback support ● BASE ? ○ Basically Available , Soft State, Eventual Consistency ? ● Can I do joins if data is sharded ? ○ What about Distribution awareness ?
  43. 43. Base 4 : Apart from CAP theorem ● ACID ? ○ Transaction commit/Rollback support ● BASE ? ○ Basically Available , Soft State, Eventual Consistency ? ● Can I do joins if data is sharded ? ○ What about Distribution awareness ? ● The Query Interface (major concern ?)
  44. 44. The bases...
  45. 45. So, Try to cover the Bases & decide if you need it.. PS: There is no Silver Bullet
  46. 46. Thank you. Vishnu Rao jaihind213 sweetweet213 mash213.wordpress.com linkedin.com/in/213vishnu

×