Your SlideShare is downloading. ×
0
High-order bits from Cassandra & Hadoop<br />srisatishambati<br />@srisatish<br />
Thank You! <br />svccg in first page of search results for “cloud” on google!<br />
NoSQL-<br />Know your queries.<br />
points<br />Usecases<br />Why cassandra?<br />Usecase: Hadoop, Brisk<br />FUD:Consistency <br />Why facebook is not using ...
Users. Netflix.<br />Key by Customer, read-heavy<br />Key by Customer:Movie, write-heavy<br />
TimeSeries: (several customers)<br />periodic readings:  dev0, dev1…deviceID:metric:timestamp ->value<br />Metrics typical...
Why Cassandra?<br />
Operational simplicity<br />peer-to-peer<br />
Operational simplicity<br />peer-to-peer<br />
Replication: <br />Multi-datacenter<br />Multi-region ec2<br />Multi-availability zones<br />
reads local<br />dc1<br />dc2<br />Replication: <br />Multi-datacenter<br />Multi-region ec2, aws<br />Multi-availability ...
4.21.2011,  Amazon Web Services outage:<br />“Movie marathons on Netflix awaiting AWS to come back up.”  #ec2disabled<br />
4.21.2011,  Amazon Web Services outage:<br />Netflix was running on AWS. <br />
fast durable writes. <br />fast reads. <br />
Writes<br />Sequential, append-only.<br />~1-5ms<br />
Writes<br />Sequential, append-only.<br />~1-5ms<br />On cloud: ephemeral disks rock!<br />
Reads<br /> Local<br />Key & row caches, (also, jna-based 0xffheap)<br />indexes, materialized<br />
Reads<br /> Local<br />Key & row caches, (also, jna-based 0xffheap)<br />indexes, materialized<br />ssds: improved read pe...
Distribution between nodes<br /> Gossip<br />Anti-entropy<br />Failure-detector<br />L i g h t w e i g h t<br />
Clients: cql, thrift<br />pycassa, phpcassa<br /> hector, pelops<br />(scala, ruby, clojure)<br />
Usecase #3: hadoop<br />Hdfs cassandra hive<br />Logs         stats          analytics<br />
Brisk<br />Truly peer-to-peer hadoop.<br />
mv computation<br />not data<br />
Parallel Execution View<br />
jobtracker, tasktracker<br />hdfs: namenode, datanode<br />
cloudera<br />amazon: elastic map reduce<br />hortonworks<br />mapR<br />brisk<br />
Tools & Analytics <br />Hive, Pig, R<br />Karmasphere<br />Datameer<br />… dozens of stealth startups!<br />
Namenode decomposition, explained.<br />
Use column families (tables)<br />inode<br />sblock<br />
near-real time hadoop<br />Low latency: cassandra_dc nodes<br />Batch Analytics: brisk_dc nodes<br />
FUD, <br />acronym: fear, uncertainty, doubt.<br />
Consistency:  R + W > N    <br />ORACLE, 2-node: R=1, W=2, N=2,(T=2)<br />DNS<br />* N is replication factor. Not to be co...
Tune-able, flexibility.<br />For High Consistency:  <br />read:quorum, write:quorum<br />For High Availability: <br />	hig...
Inbox Search: <br />600+cores.120+TB (2008)<br />Went from 100-500m users.<br />Average NoSQL deployment size: ~6-12 nodes...
Usecase #5: search<br />Apache Solr + Cassandra = Solandra<br />Other inbox/file Searches:<br />xobni, c3<br />github.com/...
“Eventual consistency is harder to program.”<br />mostly immutable data.<br />complex systems at scale.<br />
Miscellaneous,<br />Myth: data-loss, partial rows.<br />writes are durable.<br />
Anti-Patterns<br />Transactions<br />Joins<br />Read before write<br />
Anti-Patterns for cloud<br />ebs<br />jvm, virtualized<br />single region <br />
Three good reasons for Cassandra...<br />
Tools<br />AMIs, OpsCenter, DataStax<br />AppDynamics<br />Netflix just builds AMIs for deployment!<br />
B e a u t i f u l   C   0   d   e<br />= new code(); //less is more<br />~90k.java.concurrent.@annotate. <br />bloomfilter...
Current & Future Focus:<br />Distributed Counters, CQL.<br />Simple client.<br />operational smoothening. <br />compaction...
Community<br />Robust. Rapid. #<br />Professional support from DataStax.<br />Filesysteminnovatin from Acunu<br />engineer...
Usecase #4:  first NoSQL, then scale!<br />simpledb  Cassandra<br />mongodb Cassandra<br />
Copyright: xkcd<br />
Copyright: plantoys<br />… more than one way to do it!<br />
Summary -<br />high scale peer-to-peer datastore<br />best friend for <br />multi-region, multi-zone availability.<br />Ha...
Q&A<br />@srisatish<br />
NoSQL-<br />Know your queries.<br />
Upcoming SlideShare
Loading in...5
×

High order bits from cassandra & hadoop

1,628

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,628
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "High order bits from cassandra & hadoop"

  1. 1. High-order bits from Cassandra & Hadoop<br />srisatishambati<br />@srisatish<br />
  2. 2. Thank You! <br />svccg in first page of search results for “cloud” on google!<br />
  3. 3. NoSQL-<br />Know your queries.<br />
  4. 4. points<br />Usecases<br />Why cassandra?<br />Usecase: Hadoop, Brisk<br />FUD:Consistency <br />Why facebook is not using Cassandra?<br />Anti-patterns<br />Community, Code, Tools<br />Q&A<br />
  5. 5. Users. Netflix.<br />Key by Customer, read-heavy<br />Key by Customer:Movie, write-heavy<br />
  6. 6. TimeSeries: (several customers)<br />periodic readings: dev0, dev1…deviceID:metric:timestamp ->value<br />Metrics typically way larger dataset than users.<br />
  7. 7. Why Cassandra?<br />
  8. 8. Operational simplicity<br />peer-to-peer<br />
  9. 9. Operational simplicity<br />peer-to-peer<br />
  10. 10. Replication: <br />Multi-datacenter<br />Multi-region ec2<br />Multi-availability zones<br />
  11. 11. reads local<br />dc1<br />dc2<br />Replication: <br />Multi-datacenter<br />Multi-region ec2, aws<br />Multi-availability zones<br />
  12. 12. 4.21.2011, Amazon Web Services outage:<br />“Movie marathons on Netflix awaiting AWS to come back up.” #ec2disabled<br />
  13. 13. 4.21.2011, Amazon Web Services outage:<br />Netflix was running on AWS. <br />
  14. 14. fast durable writes. <br />fast reads. <br />
  15. 15. Writes<br />Sequential, append-only.<br />~1-5ms<br />
  16. 16. Writes<br />Sequential, append-only.<br />~1-5ms<br />On cloud: ephemeral disks rock!<br />
  17. 17. Reads<br /> Local<br />Key & row caches, (also, jna-based 0xffheap)<br />indexes, materialized<br />
  18. 18. Reads<br /> Local<br />Key & row caches, (also, jna-based 0xffheap)<br />indexes, materialized<br />ssds: improved read performance! <br />
  19. 19. Distribution between nodes<br /> Gossip<br />Anti-entropy<br />Failure-detector<br />L i g h t w e i g h t<br />
  20. 20. Clients: cql, thrift<br />pycassa, phpcassa<br /> hector, pelops<br />(scala, ruby, clojure)<br />
  21. 21. Usecase #3: hadoop<br />Hdfs cassandra hive<br />Logs stats analytics<br />
  22. 22. Brisk<br />Truly peer-to-peer hadoop.<br />
  23. 23. mv computation<br />not data<br />
  24. 24.
  25. 25. Parallel Execution View<br />
  26. 26.
  27. 27. jobtracker, tasktracker<br />hdfs: namenode, datanode<br />
  28. 28. cloudera<br />amazon: elastic map reduce<br />hortonworks<br />mapR<br />brisk<br />
  29. 29. Tools & Analytics <br />Hive, Pig, R<br />Karmasphere<br />Datameer<br />… dozens of stealth startups!<br />
  30. 30. Namenode decomposition, explained.<br />
  31. 31.
  32. 32.
  33. 33. Use column families (tables)<br />inode<br />sblock<br />
  34. 34. near-real time hadoop<br />Low latency: cassandra_dc nodes<br />Batch Analytics: brisk_dc nodes<br />
  35. 35. FUD, <br />acronym: fear, uncertainty, doubt.<br />
  36. 36. Consistency: R + W > N <br />ORACLE, 2-node: R=1, W=2, N=2,(T=2)<br />DNS<br />* N is replication factor. Not to be confused with T=total #of nodes<br />
  37. 37. Tune-able, flexibility.<br />For High Consistency: <br />read:quorum, write:quorum<br />For High Availability: <br /> high W, low R. <br />
  38. 38.
  39. 39. Inbox Search: <br />600+cores.120+TB (2008)<br />Went from 100-500m users.<br />Average NoSQL deployment size: ~6-12 nodes.<br />
  40. 40. Usecase #5: search<br />Apache Solr + Cassandra = Solandra<br />Other inbox/file Searches:<br />xobni, c3<br />github.com/tjake/solandra<br />
  41. 41. “Eventual consistency is harder to program.”<br />mostly immutable data.<br />complex systems at scale.<br />
  42. 42. Miscellaneous,<br />Myth: data-loss, partial rows.<br />writes are durable.<br />
  43. 43. Anti-Patterns<br />Transactions<br />Joins<br />Read before write<br />
  44. 44. Anti-Patterns for cloud<br />ebs<br />jvm, virtualized<br />single region <br />
  45. 45. Three good reasons for Cassandra...<br />
  46. 46. Tools<br />AMIs, OpsCenter, DataStax<br />AppDynamics<br />Netflix just builds AMIs for deployment!<br />
  47. 47. B e a u t i f u l C 0 d e<br />= new code(); //less is more<br />~90k.java.concurrent.@annotate. <br />bloomfilters, merkletrees.<br />non-blocking, staged-event-driven.<br />bigtable, dynamo. <br />
  48. 48. Current & Future Focus:<br />Distributed Counters, CQL.<br />Simple client.<br />operational smoothening. <br />compaction.<br />
  49. 49. Community<br />Robust. Rapid. #<br />Professional support from DataStax.<br />Filesysteminnovatin from Acunu<br />engineers: independent,startups, large companies, Rackspace, Twitter, Netflix..<br />Come join the efforts!<br />
  50. 50.
  51. 51. Usecase #4: first NoSQL, then scale!<br />simpledb Cassandra<br />mongodb Cassandra<br />
  52. 52.
  53. 53.
  54. 54.
  55. 55. Copyright: xkcd<br />
  56. 56. Copyright: plantoys<br />… more than one way to do it!<br />
  57. 57. Summary -<br />high scale peer-to-peer datastore<br />best friend for <br />multi-region, multi-zone availability.<br />Hadoop – HDFS engulfing the DataWorld<br />
  58. 58. Q&A<br />@srisatish<br />
  59. 59. NoSQL-<br />Know your queries.<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×