Indic threads pune12-nosql now and path ahead


Published on

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Indic threads pune12-nosql now and path ahead

  1. 1. NoSQL: Now and Path AheadShubham Kumar SrivastavaMakeMyTrip
  2. 2. Who am I
  3. 3. AbstractWhat and Why : NoSqlFundamentalsUse CaseChallengesPath Ahead 3.
  4. 4. What is NoSqlDatabase which does not adhere to the traditional relational databasemanagement system (RDMS) structure .
  5. 5. Why NoSql Scalability and Performance Cost Data Modeling
  6. 6. Why NoSql : Motives and Drivers Scalability and Performance Horizontal scalability better than Vertical Hardware getting cheaper and processing power increasing Less Operational complexity as against RDBMS solutions. In most of the solutions you get automatic sharding etc as default .
  7. 7. Why NoSql : Motives and Drivers contd..
  8. 8. Why NoSql : Motives and Drivers contd..
  9. 9. Why NoSql : Motives and Drivers contd.. Cost Scale(as with NoSql) with Hefty Cost Commodity hardware, software versions, upgrades, maintenance. This brought organizations look out for alternatives and the need for a cost effective scale out option.
  10. 10. Why NoSql : Motives and Drivers contd.. Data ModelingSQL has been for Concurreny,Consistency,Integrity For Summations,Aggregations,Grouping’s Schema Says: What all Do I answer ??
  11. 11. Why NoSql : Motives and Drivers contd.. Data Modeling A plain key-value store is very powerful and fit the max use cases for a NoSQL solution Hierarchical or graph-like data modelling and processing. Values like maps of maps of maps. Document Databases which even store arbitrary complex objects. Document based indexing data store’s are a huge success.
  12. 12. Why NoSql : Motives and Drivers contd..At times SW apps are not limited to these constraints . This lead todata models likeKey/Value Store : Redis,MemcacheDb/Voldemort etc.Wide Column Store / Column Families :Cassandra/Hadoop(Hbase)/Hypertable/Cloudera etc.Document Based Store’s : Solr/Lucene/MongoDb/CouchDb/TerraStore etc.Graph Data Store : Neo4J/GraphBase/FlockDb etc.
  13. 13. Why NoSql : Motives and Drivers contd..
  14. 14. Why NoSql : Motives and Drivers contd.. Schema Says: What are the questions Data modeling is based on the set of Queries Exploit De-normalization Duplication Use Aggregates Manage Joins with App + Aggregation + DeNormalization etc.
  15. 15. Some Fanda-mentals CAP Theorem At the most only two properties of the three in a shared/distributed system can be satisfied. Consistency Availability Tolerance to Network Partitions
  16. 16. CAP : Pictorially
  17. 17. ExplanationUse case: Scaling Web AppsCritical fact’s :• Network outages are common• Customer shopping carts, email search, social network queries—can tolerate stale dataHow: Compromise on Consistency in-order to remain available vs disrupt user service at outages.
  18. 18. Explanation Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state. Brewer’s CAP theorem says you have no choice if you want to scale up.
  19. 19. Explanation contd..Sharp Contrast : High Speed Financial Application Highly Transactional Consistent Automated Can’t live with Eventual consistency
  20. 20. ACID vs BASE ACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when servers restart etc.
  21. 21. Some Fanda-mentals cont.. BASEBasic AvailabilitySoft-stateEventual consistency
  22. 22. Consistent HashingCommon way to load balance .The machine chosen to cache object o will be:hash(o) mod nn:total number of machines
  23. 23. Consistent Hashing contd.. Adding a machine to the cache means hash(o) mod (n + 1) Removing a machine to the cache means hash(o) mod (n - 1) Result on any above: Disaster  Swamped machines with redistribution
  24. 24. Consistent Hashing contd.. Commonly, a hash function(e.g MD5 hash) will map a value into a 128-bit key, 0~2^127-1(or 32 bit even as given next) .
  25. 25. Consistent Hashing contd..
  26. 26. Consistent Hashing contd.. Both Key and Machine hashed with the same function
  27. 27. Consistent Hashing contd.. Adding a Node
  28. 28. Consistent Hashing contd.. Removing a Node
  29. 29. Use Case and NoSQL SolutionProblem: Need to store bookings per day of all hotels . Queries centered around city and regions. Hotel count : 1 Million Date Range : Now to next 365 *2 Days
  30. 30. NoSQL: Path Ahead ACID equivalence(Neo4J,CouchDb etc) Transaction Support Atomicity MVCC
  31. 31. NoSQL: Path Ahead contd..Possible SolutionWork with SQL Db w.r.t Creation/Updation etc.Archive the data in NoSQL for query/analysis etc.
  32. 32. NoSQL: Path Ahead contd..Enterprise Adoption and Challenges NoSQL looks good for Unstructured data largely SQL is the best choice for a broad range of traditional workloads.
  33. 33. NoSQL: Path Ahead contd..
  34. 34. NoSQL: Path Ahead contd.. Shout out loud Hybrid ACID + BASE They are not alternatives but supplements
  35. 35. NoSQL: Path Ahead contd.. Maturity Support Skillset and Administration/Operation Analytics and BI support
  36. 36. NoSQL: Path Ahead contd..
  37. 37. Q&A
  38. 38. References Nancy Lynch and Seth Gilbert, “Brewers conjecture and the feasibility of consistent, available, partition- tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59. Brewers CAP Theorem",, Retrieved 02-Mar-2010 Brewers CAP theorem on distributed systems", CAP Twelve Years Later: How the "Rules" Have Changed on-line resource E. Brewer, "Towards Robust Distributed Systems," Proc. 19th Ann. ACM Symp.Principles of Distributed Computing (PODC 00), ACM, 2000, pp. 7-10; on-line resource D. Abadi, "Problems with CAP, and Yahoo’s Little Known NoSQL System," DBMS Musings, blog, 23 Apr. 2010; on-line resource. C. Hale, "You Can’t Sacrifice Partition Tolerance," 7 Oct. 2010; on-line resource. Facebook: Scaling Out on-line resource. Gemstone : The Hardest Problems In Data Management on-line resource The Log-Structured Merge-Tree (Research Paper) CodeProject : Consistent Hashing on-line resource HighlyScalable : NoSQL Data Modeling Techniques on-line resource eBay Tech Blog :Cassandra Data Modeling Best Practices on-line resource John D Cook : Acid Vs Base on-line resource Merkle Trees Phy-Accural Faliover Detaection (Research Paper)
  39. 39. Backup Slides Better than the Original 1 
  40. 40. Document Based DataStore{ _id : ObjectId("4e77bb3b8a3e000000004f7a"), when : Date("2011-09-19T02:10:11.3Z", author : "alex", title : "No Free Lunch", text : "This is the text of the post. It could be very long.", tags : [ "business", "ramblings" ], votes : 5, voters : [ "jane", "joe", "spencer", "phyllis", "li" ], comments : [ { who : "jane", when : Date("2011-09-19T04:00:10.112Z"), comment : "I agree." }, { who : "meghan", when : Date("2011-09-20T14:36:06.958Z"), comment : "You must be joking. etc etc ..." } ]}
  41. 41. User and Items
  42. 42. User and Items : Option 1
  43. 43. User and Items : Option 2
  44. 44. User and Items : Option 3
  45. 45. User and Items : Option 4
  46. 46. Cassandra CF
  47. 47. Cassandra SuperCF
  48. 48. Use Case 1Ecommerce Site Problem : Record User Preferences e.g : Location,IP,Currency selected, Source of Traffic, Multiple other dynamic values Solution : In a CF based structure keep it simpleUserId_Key: Pref2_Name:Value1,Pref2_Name:Value2, ….PrefN_Name:ValueN
  49. 49. Use Case 1RowKey: 1350136093705_6501082438199894=> (column=1350136093764, value=-3242432#911167901131523, timestamp=1350136093766000)=> (column=1350283322499, value=GOI#200701231712126570, timestamp=1350283322502001)=> (column=1350283566051, value=GOI#200703221605283033, timestamp=1350283566054001)=> (column=1350749595676, value=GOI#200805261514037199, timestamp=1350749595677001) (column=1350785230322, value=BOM#200701251747233158, timestamp=1350785230324001)⇒ RowKey: 1354499614310_10861558002828044⇒ => (column=1354499614368, value=TRV#201104071059204768, timestamp=1354499614370000, ttl=1728000)⇒ -------------------⇒ RowKey: 1349760150553_6114662943774777⇒ => (column=1349760152066, value=BLR#200802111324575807, timestamp=1349760152068001)⇒ -------------------⇒ RowKey: 1349805109805_6167423558533191⇒ => (column=1349805111833, value=TRV#312254274337517, timestamp=1349805111835001)⇒ -------------------⇒ RowKey: 1354435656227_7908056941568359⇒ => (column=1354435656367, value=IDR#200701211254519381, timestamp=1354435656369000, ttl=1728000)⇒ -------------------⇒ RowKey: 1347648097261_15570089270962881⇒ => (column=1347648097304, value=DEL#201101192008115545, timestamp=1347648097307000)
  50. 50. Use Case 1 Getprivate Map<String, String> getPrerences(Keyspace keySpace, String userId, String... prefernceNames) throws IOException, CharacterCodingException {SliceQuery<String, String, String> rsq = HFactory.createSliceQuery(keySpace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());rsq.setColumnFamily(USER_PREFERENCE);rsq.setKey(userId);rsq.setColumnNames(prefernceNames);QueryResult<ColumnSlice<String, String>> orows = rsq.execute();Map<String, String> preferenceMap = new LinkedHashMap<String, String>();for (HColumn<String, String> column : orows.get().getColumns()) {preferenceMap.put(column.getName(), column.getValue());}return preferenceMap;}
  51. 51. Use Case 1 SaveMutator<String> m = HFactory.createMutator(keySpace, StringSerializer.get());HColumn<String, String> userPrefrences = HFactory.createColumn(colkey, colvalue, StringSerializer.get(), StringSerializer.get());userPrefrences.setTtl(ttlUserPrefrences);m.addInsertion(rowkey, USER_PREFERENCE, userPrefrences);m.execute();
  52. 52. Use Case 2Online Travel SiteProblem: Need to know different metrics for a city hotels e.g.: Hotels booked in last X Time Hotels Last viewed in Y Time Hotels Left with Z Inventory
  53. 53. Use Case 2RowKey: 2d323436353731=> (super_column=911167901297486, (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 23 hour(s) ago., timestamp=1354962852610000) column=6c6173747669657765646d657373616762, value=Inventory#20 , timestamp=1354962852610000, column=6c6173747669657765646d657373616769, value=Bookings#8 , timestamp=135496282610000)-------------------RowKey: 58524f=> (super_column=200903041759196196, (column=6c617374626f6f6b65646d657373616765, value=Booked#Last booked 1 day(s) ago., timestamp=1347781187842000) (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 2 hours ago., timestamp=1347707080147000))=> (super_column=200903041848352230, (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 1 day(s) ago., timestamp=1347266107708000))
  54. 54. Use Case 2SuperSliceQuery<String, String, String, String> superQuery = HFactory.createSuperSliceQuery(getKeySpace(),StringSerializer.get(), StringSerializer.get(),StringSerializer.get(), StringSerializer.get());superQuery.setColumnFamily(SUPER_SOCIAL_MESSAGE).setKey(cityCode);QueryResult<SuperSlice<String, String, String>> result = superQuery.execute();List<HSuperColumn<String, String, String>> superColumns = result.get().getSuperColumns();if (superColumns != null) {for (HSuperColumn<String, String, String> superColumn : superColumns) {Map<String, String> messages = new HashMap<String, String>();List<HColumn<String, String>> columns = superColumn.getColumns();if (columns != null) {for (HColumn<String, String> column : columns) {messages.put(column.getName(), column.getValue());}}/* The equivalent doc *document.addField(superColumn.getName(), messages);documents.add(document);}}
  55. 55. Pig Script : MR<document> <pigscript start="-16" end="-43200" start1="-1441" end1="-10080" start2="0" end2="-15" start3="0" end3="-1440"> <comment>Delete All Messages</comment> <query><![CDATA[rows0 = LOAD cassandra://LH/HotelMessage USING as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );]]></query> <query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query> <query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query> <query><![CDATA[userhotel0 = FOREACH cols0 GENERATE key as key,$1) as name,$2) as value;]]></query> <query><![CDATA[uriCounts0 = FOREACH userhotel0 GENERATE key as citycode,,null));]]></query> <comment>Last Viewed start 15 minutes to 30 days ago</comment> <query><![CDATA[rows = LOAD cassandra://LH/LastViewedHotels?slice_start=#start&slice_end=#end&limit=1024&reversed=true USING as (key:chararray, cols:bag{T:tuple(name:long, value:chararray) } );]]></query> <query><![CDATA[cols = FOREACH rows GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query> <query><![CDATA[userhotel = FOREACH cols GENERATE key as key,$1) as name,$2) as value;]]></query> <query><![CDATA[userhotelByCity = FOREACH userhotel GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,#,2)) as (citycode:chararray,hotelid:chararray);]]></query> <query><![CDATA[groupByhotels = GROUP userhotelByCity BY hotelid;]]></query> <query><![CDATA[uriCounts = FOREACH groupByhotels { D = LIMIT userhotelByCity 1; GENERATE flatten(D.citycode) as citycode, TOTUPLE(group, viewed ,, ago.))); };]]></query> <comment>Last Booked 1 to 8 days ago</comment> <query><![CDATA[rows1 = LOAD cassandra://LH/BookedHotels?slice_start=#startA&slice_end=#endA&limit=1024&reversed=true USING as (key:chararray, cols:bag{T:tuple(name:long, value:chararray) } );]]></query> <query><![CDATA[cols1 = FOREACH rows1 GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query> <query><![CDATA[userhotel1 = FOREACH cols1 GENERATE key as key,$1) as name,$2) as value;]]></query> <query><![CDATA[userhotelByCity1 = FOREACH userhotel1 GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,#,2)) as (citycode:chararray,hotelid:chararray);]]></query> <query><![CDATA[groupByhotels1 = GROUP userhotelByCity1 BY hotelid;]]></query> <query><![CDATA[uriCounts1 = FOREACH groupByhotels1 { D = LIMIT userhotelByCity1 1;GENERATE flatten(D.citycode) as citycode,, booked ,, ago.)));};]]></query>
  56. 56. Criterias to Evaluate NoSQL SolutionsInternal partitioningAutomated flexible data distributionHot swappable nodesReplication-styleAutomated failover strategy