Your SlideShare is downloading. ×
Indic threads pune12-nosql now and path ahead
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Indic threads pune12-nosql now and path ahead


Published on

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012.

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. NoSQL: Now and Path AheadShubham Kumar SrivastavaMakeMyTrip
  • 2. Who am I
  • 3. AbstractWhat and Why : NoSqlFundamentalsUse CaseChallengesPath Ahead 3.
  • 4. What is NoSqlDatabase which does not adhere to the traditional relational databasemanagement system (RDMS) structure .
  • 5. Why NoSql Scalability and Performance Cost Data Modeling
  • 6. Why NoSql : Motives and Drivers Scalability and Performance Horizontal scalability better than Vertical Hardware getting cheaper and processing power increasing Less Operational complexity as against RDBMS solutions. In most of the solutions you get automatic sharding etc as default .
  • 7. Why NoSql : Motives and Drivers contd..
  • 8. Why NoSql : Motives and Drivers contd..
  • 9. Why NoSql : Motives and Drivers contd.. Cost Scale(as with NoSql) with Hefty Cost Commodity hardware, software versions, upgrades, maintenance. This brought organizations look out for alternatives and the need for a cost effective scale out option.
  • 10. Why NoSql : Motives and Drivers contd.. Data ModelingSQL has been for Concurreny,Consistency,Integrity For Summations,Aggregations,Grouping’s Schema Says: What all Do I answer ??
  • 11. Why NoSql : Motives and Drivers contd.. Data Modeling A plain key-value store is very powerful and fit the max use cases for a NoSQL solution Hierarchical or graph-like data modelling and processing. Values like maps of maps of maps. Document Databases which even store arbitrary complex objects. Document based indexing data store’s are a huge success.
  • 12. Why NoSql : Motives and Drivers contd..At times SW apps are not limited to these constraints . This lead todata models likeKey/Value Store : Redis,MemcacheDb/Voldemort etc.Wide Column Store / Column Families :Cassandra/Hadoop(Hbase)/Hypertable/Cloudera etc.Document Based Store’s : Solr/Lucene/MongoDb/CouchDb/TerraStore etc.Graph Data Store : Neo4J/GraphBase/FlockDb etc.
  • 13. Why NoSql : Motives and Drivers contd..
  • 14. Why NoSql : Motives and Drivers contd.. Schema Says: What are the questions Data modeling is based on the set of Queries Exploit De-normalization Duplication Use Aggregates Manage Joins with App + Aggregation + DeNormalization etc.
  • 15. Some Fanda-mentals CAP Theorem At the most only two properties of the three in a shared/distributed system can be satisfied. Consistency Availability Tolerance to Network Partitions
  • 16. CAP : Pictorially
  • 17. ExplanationUse case: Scaling Web AppsCritical fact’s :• Network outages are common• Customer shopping carts, email search, social network queries—can tolerate stale dataHow: Compromise on Consistency in-order to remain available vs disrupt user service at outages.
  • 18. Explanation Rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state. Brewer’s CAP theorem says you have no choice if you want to scale up.
  • 19. Explanation contd..Sharp Contrast : High Speed Financial Application Highly Transactional Consistent Automated Can’t live with Eventual consistency
  • 20. ACID vs BASE ACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when servers restart etc.
  • 21. Some Fanda-mentals cont.. BASEBasic AvailabilitySoft-stateEventual consistency
  • 22. Consistent HashingCommon way to load balance .The machine chosen to cache object o will be:hash(o) mod nn:total number of machines
  • 23. Consistent Hashing contd.. Adding a machine to the cache means hash(o) mod (n + 1) Removing a machine to the cache means hash(o) mod (n - 1) Result on any above: Disaster  Swamped machines with redistribution
  • 24. Consistent Hashing contd.. Commonly, a hash function(e.g MD5 hash) will map a value into a 128-bit key, 0~2^127-1(or 32 bit even as given next) .
  • 25. Consistent Hashing contd..
  • 26. Consistent Hashing contd.. Both Key and Machine hashed with the same function
  • 27. Consistent Hashing contd.. Adding a Node
  • 28. Consistent Hashing contd.. Removing a Node
  • 29. Use Case and NoSQL SolutionProblem: Need to store bookings per day of all hotels . Queries centered around city and regions. Hotel count : 1 Million Date Range : Now to next 365 *2 Days
  • 30. NoSQL: Path Ahead ACID equivalence(Neo4J,CouchDb etc) Transaction Support Atomicity MVCC
  • 31. NoSQL: Path Ahead contd..Possible SolutionWork with SQL Db w.r.t Creation/Updation etc.Archive the data in NoSQL for query/analysis etc.
  • 32. NoSQL: Path Ahead contd..Enterprise Adoption and Challenges NoSQL looks good for Unstructured data largely SQL is the best choice for a broad range of traditional workloads.
  • 33. NoSQL: Path Ahead contd..
  • 34. NoSQL: Path Ahead contd.. Shout out loud Hybrid ACID + BASE They are not alternatives but supplements
  • 35. NoSQL: Path Ahead contd.. Maturity Support Skillset and Administration/Operation Analytics and BI support
  • 36. NoSQL: Path Ahead contd..
  • 37. Q&A
  • 38. References Nancy Lynch and Seth Gilbert, “Brewers conjecture and the feasibility of consistent, available, partition- tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59. Brewers CAP Theorem",, Retrieved 02-Mar-2010 Brewers CAP theorem on distributed systems", CAP Twelve Years Later: How the "Rules" Have Changed on-line resource E. Brewer, "Towards Robust Distributed Systems," Proc. 19th Ann. ACM Symp.Principles of Distributed Computing (PODC 00), ACM, 2000, pp. 7-10; on-line resource D. Abadi, "Problems with CAP, and Yahoo’s Little Known NoSQL System," DBMS Musings, blog, 23 Apr. 2010; on-line resource. C. Hale, "You Can’t Sacrifice Partition Tolerance," 7 Oct. 2010; on-line resource. Facebook: Scaling Out on-line resource. Gemstone : The Hardest Problems In Data Management on-line resource The Log-Structured Merge-Tree (Research Paper) CodeProject : Consistent Hashing on-line resource HighlyScalable : NoSQL Data Modeling Techniques on-line resource eBay Tech Blog :Cassandra Data Modeling Best Practices on-line resource John D Cook : Acid Vs Base on-line resource Merkle Trees Phy-Accural Faliover Detaection (Research Paper)
  • 39. Backup Slides Better than the Original 1 
  • 40. Document Based DataStore{ _id : ObjectId("4e77bb3b8a3e000000004f7a"), when : Date("2011-09-19T02:10:11.3Z", author : "alex", title : "No Free Lunch", text : "This is the text of the post. It could be very long.", tags : [ "business", "ramblings" ], votes : 5, voters : [ "jane", "joe", "spencer", "phyllis", "li" ], comments : [ { who : "jane", when : Date("2011-09-19T04:00:10.112Z"), comment : "I agree." }, { who : "meghan", when : Date("2011-09-20T14:36:06.958Z"), comment : "You must be joking. etc etc ..." } ]}
  • 41. User and Items
  • 42. User and Items : Option 1
  • 43. User and Items : Option 2
  • 44. User and Items : Option 3
  • 45. User and Items : Option 4
  • 46. Cassandra CF
  • 47. Cassandra SuperCF
  • 48. Use Case 1Ecommerce Site Problem : Record User Preferences e.g : Location,IP,Currency selected, Source of Traffic, Multiple other dynamic values Solution : In a CF based structure keep it simpleUserId_Key: Pref2_Name:Value1,Pref2_Name:Value2, ….PrefN_Name:ValueN
  • 49. Use Case 1RowKey: 1350136093705_6501082438199894=> (column=1350136093764, value=-3242432#911167901131523, timestamp=1350136093766000)=> (column=1350283322499, value=GOI#200701231712126570, timestamp=1350283322502001)=> (column=1350283566051, value=GOI#200703221605283033, timestamp=1350283566054001)=> (column=1350749595676, value=GOI#200805261514037199, timestamp=1350749595677001) (column=1350785230322, value=BOM#200701251747233158, timestamp=1350785230324001)⇒ RowKey: 1354499614310_10861558002828044⇒ => (column=1354499614368, value=TRV#201104071059204768, timestamp=1354499614370000, ttl=1728000)⇒ -------------------⇒ RowKey: 1349760150553_6114662943774777⇒ => (column=1349760152066, value=BLR#200802111324575807, timestamp=1349760152068001)⇒ -------------------⇒ RowKey: 1349805109805_6167423558533191⇒ => (column=1349805111833, value=TRV#312254274337517, timestamp=1349805111835001)⇒ -------------------⇒ RowKey: 1354435656227_7908056941568359⇒ => (column=1354435656367, value=IDR#200701211254519381, timestamp=1354435656369000, ttl=1728000)⇒ -------------------⇒ RowKey: 1347648097261_15570089270962881⇒ => (column=1347648097304, value=DEL#201101192008115545, timestamp=1347648097307000)
  • 50. Use Case 1 Getprivate Map<String, String> getPrerences(Keyspace keySpace, String userId, String... prefernceNames) throws IOException, CharacterCodingException {SliceQuery<String, String, String> rsq = HFactory.createSliceQuery(keySpace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());rsq.setColumnFamily(USER_PREFERENCE);rsq.setKey(userId);rsq.setColumnNames(prefernceNames);QueryResult<ColumnSlice<String, String>> orows = rsq.execute();Map<String, String> preferenceMap = new LinkedHashMap<String, String>();for (HColumn<String, String> column : orows.get().getColumns()) {preferenceMap.put(column.getName(), column.getValue());}return preferenceMap;}
  • 51. Use Case 1 SaveMutator<String> m = HFactory.createMutator(keySpace, StringSerializer.get());HColumn<String, String> userPrefrences = HFactory.createColumn(colkey, colvalue, StringSerializer.get(), StringSerializer.get());userPrefrences.setTtl(ttlUserPrefrences);m.addInsertion(rowkey, USER_PREFERENCE, userPrefrences);m.execute();
  • 52. Use Case 2Online Travel SiteProblem: Need to know different metrics for a city hotels e.g.: Hotels booked in last X Time Hotels Last viewed in Y Time Hotels Left with Z Inventory
  • 53. Use Case 2RowKey: 2d323436353731=> (super_column=911167901297486, (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 23 hour(s) ago., timestamp=1354962852610000) column=6c6173747669657765646d657373616762, value=Inventory#20 , timestamp=1354962852610000, column=6c6173747669657765646d657373616769, value=Bookings#8 , timestamp=135496282610000)-------------------RowKey: 58524f=> (super_column=200903041759196196, (column=6c617374626f6f6b65646d657373616765, value=Booked#Last booked 1 day(s) ago., timestamp=1347781187842000) (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 2 hours ago., timestamp=1347707080147000))=> (super_column=200903041848352230, (column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 1 day(s) ago., timestamp=1347266107708000))
  • 54. Use Case 2SuperSliceQuery<String, String, String, String> superQuery = HFactory.createSuperSliceQuery(getKeySpace(),StringSerializer.get(), StringSerializer.get(),StringSerializer.get(), StringSerializer.get());superQuery.setColumnFamily(SUPER_SOCIAL_MESSAGE).setKey(cityCode);QueryResult<SuperSlice<String, String, String>> result = superQuery.execute();List<HSuperColumn<String, String, String>> superColumns = result.get().getSuperColumns();if (superColumns != null) {for (HSuperColumn<String, String, String> superColumn : superColumns) {Map<String, String> messages = new HashMap<String, String>();List<HColumn<String, String>> columns = superColumn.getColumns();if (columns != null) {for (HColumn<String, String> column : columns) {messages.put(column.getName(), column.getValue());}}/* The equivalent doc *document.addField(superColumn.getName(), messages);documents.add(document);}}
  • 55. Pig Script : MR<document> <pigscript start="-16" end="-43200" start1="-1441" end1="-10080" start2="0" end2="-15" start3="0" end3="-1440"> <comment>Delete All Messages</comment> <query><![CDATA[rows0 = LOAD cassandra://LH/HotelMessage USING as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );]]></query> <query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query> <query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query> <query><![CDATA[userhotel0 = FOREACH cols0 GENERATE key as key,$1) as name,$2) as value;]]></query> <query><![CDATA[uriCounts0 = FOREACH userhotel0 GENERATE key as citycode,,null));]]></query> <comment>Last Viewed start 15 minutes to 30 days ago</comment> <query><![CDATA[rows = LOAD cassandra://LH/LastViewedHotels?slice_start=#start&slice_end=#end&limit=1024&reversed=true USING as (key:chararray, cols:bag{T:tuple(name:long, value:chararray) } );]]></query> <query><![CDATA[cols = FOREACH rows GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query> <query><![CDATA[userhotel = FOREACH cols GENERATE key as key,$1) as name,$2) as value;]]></query> <query><![CDATA[userhotelByCity = FOREACH userhotel GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,#,2)) as (citycode:chararray,hotelid:chararray);]]></query> <query><![CDATA[groupByhotels = GROUP userhotelByCity BY hotelid;]]></query> <query><![CDATA[uriCounts = FOREACH groupByhotels { D = LIMIT userhotelByCity 1; GENERATE flatten(D.citycode) as citycode, TOTUPLE(group, viewed ,, ago.))); };]]></query> <comment>Last Booked 1 to 8 days ago</comment> <query><![CDATA[rows1 = LOAD cassandra://LH/BookedHotels?slice_start=#startA&slice_end=#endA&limit=1024&reversed=true USING as (key:chararray, cols:bag{T:tuple(name:long, value:chararray) } );]]></query> <query><![CDATA[cols1 = FOREACH rows1 GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query> <query><![CDATA[userhotel1 = FOREACH cols1 GENERATE key as key,$1) as name,$2) as value;]]></query> <query><![CDATA[userhotelByCity1 = FOREACH userhotel1 GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,#,2)) as (citycode:chararray,hotelid:chararray);]]></query> <query><![CDATA[groupByhotels1 = GROUP userhotelByCity1 BY hotelid;]]></query> <query><![CDATA[uriCounts1 = FOREACH groupByhotels1 { D = LIMIT userhotelByCity1 1;GENERATE flatten(D.citycode) as citycode,, booked ,, ago.)));};]]></query>
  • 56. Criterias to Evaluate NoSQL SolutionsInternal partitioningAutomated flexible data distributionHot swappable nodesReplication-styleAutomated failover strategy