M6d cassandra summit

2,143 views

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,143
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
14
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

M6d cassandra summit

  1. 1. Increasing Your Prospects: Cassandra in Online Advertising Let em know: #cassandra12© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  2. 2. A little about what we do© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  3. 3. Impressions look like…© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  4. 4. A High Level look at RTB 1. Browsers visit Publishers and create 2. impressions. sell impressions via Exchanges. Publishers 3. Exchanges serve as auction houses for the impressions. 4. M6d bids on impression. If we in we display an ad.© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  5. 5. Key Cassandra features • Horizontal scalability ● More nodes more storage ● More nodes more throughput • Cassandra is a high availability solution ● Almost all changes can be made at run time ● Rolling updates ● Survives node failures • One configuration file© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  6. 6. Key storage model features • Type Validation give us creature comforts  Help prevent insertion of bad data – Columns named age should be a number  Make data easier to read and write for end users  Encourage/Enforce storage in terse format – Store 478 as 478 not “478” • Rows do not need to have fixed columns • Writes do not read • Optimal for set/get/slice operations© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  7. 7. Things I have learned on the presentation circuit • Gratuitous use of Meme Generator (tx Nathan) • Gratuitous buzzwords for maximum tweet-ability ● Big Data ● Real Time analytics ● Cloud ● Web scale • Make prolific statements that contradict current software trends (tx Dean) • Attempted Prolific Statement: Transactions and locking are highly overrated© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  8. 8. Signal De-duplication and frequency capping • Solution must be “web-scale” ● billions of users ● one->thousands of events per user • Solution must record events • Do not store the same event N times a minute ● Control data growth – Spiders, nagios, pathological cases – Small statistical difference in signal ● An action 10 times a day vs 1 time a minute© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  9. 9. What this would look like© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  10. 10. ? Solution with transactions and locking ● Likely need scalable redundant lock layer ● Built in locks are not free ● Lots of code ● Lots of sockets ● Likely need to read to write ● Results in more nodes or caching layer for disk io© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  11. 11. Remember with Cassandra... • Rows have one to many columns • Column is composed of { name, value, timstamp } ● If two columns have the same name > timestamp wins • Memtables absorb overwrites • Writes are fast ● Sorted structure in memory ● Commit log to disk • Log-structured storage prunes old values and deletes • No reads on write path© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  12. 12. 12 Cassandrified solution © 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  13. 13. Consistent Hashing distributes data ● Random Partitioner rows keys are MD5 to locate node – Results in even distribution of rows across nodes – Limits/Removes hot spots ● Big Data is not so big when you have N nodes attack it * Wife asked me if diagram above was a flag. Pledge your allegiance to the United Nodes of Big Data© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  14. 14. Memtables absorb overwrites ● Memtables give de-duplication for free – Large memtable has larger chance of absorbing a write ● This solves our original requirement: – Do not store the same event N-times per interval ● Worst-case data written to disk N-times and compacted away ● Automatically de-duplicate on read with last-update-wins rule© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  15. 15. Casandra & stream processing as an alternative to ETL ● ETL (Extract,Transform,Load) is a useful paradigm ● Batch process can be obtuse – Processes with long startup – Little support for Appends, inserts, updates – Throughput issues for small files ● Difficult for small windows of time ● Overhead from MapReduce ● Sample scenario breakdown of state, city, and count© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  16. 16. City, State, count(1) in ETL system ● Several phases / copies ● Storing the entire log to build/rebuild aggregation ● Difficult to do on small intervals ● Needs scheduling, needs log push system© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  17. 17. City, State, count(1) stream system ● Could use Cassandras counter feature directly ● Added Apache Kafka layer ● Decouples producers and consumers ● Allows message replay ● Allows backlog and recover from failures (never happens btw) ● Near real time© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  18. 18. An application to search logs ● In 2008 this article sold me on map reduce ● Take logs from all servers ● Put them into hadoop ● Generate lucene indexes ● Load into sharded SOLR cluster on interval© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  19. 19. Pseudo diagram of solution ● Process to get files from servers into hadoop ● MapReduce process to build indexes ● Embedded SOLR on Hadoop Datanodes* Go here for real story: http://www.slideshare.net/schubertzhang/case-study-how-rackspace-query-terabytes-of-data-2400928© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  20. 20. But now its the future! ● Every component or layer of an architecture is another thing document and manage ● DataStax has built SOLR into Cassandra ● Applications can write to solr/cassandra directly ● Applications can read solr/cassandra directly© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  21. 21. Ah ha! moment ● Determined the rackspace log application could be done with simple pieces ● Someone called it Taco Bell Programming The more I write code and design systems, the more I understand that many times, you can achieve the desired functionality simply with clever reconfigurations of the basic Unix tool set. After all, functionality is an asset, but code is a liability. ● Cassandra is my main taco ingredient© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  22. 22. Prolific statement: Design stuff with less arrows ● More layers/components ● Batch driven ● Less layers/components ● Low latency© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  23. 23. Solr has wide adoption ● Clients for many programming languages ● Many hip JQuery Ajax widgets and stuff ● Open source Reuters Ajax Solr demo worked seamlessly with cassandra/solr ● Implemented Rackspace like solution with small code© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  24. 24. Game Changer: Compression ● Main memory reference 100 ns 20x L2 cache, 200x L1 cache ● Compress 1K bytes with Zippy 3,000 ns ● Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms ● Read 4K randomly from SSD* 150,000 ns 0.15 ms ● Read 1 MB sequentially from memory 250,000 ns 0.25 ms ● Round trip within same datacenter 500,000 ns 0.5 ms ● Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory ● Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip ● Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Source: https://gist.github.com/2841832© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  25. 25. Why compression helps ● Compressed data is smaller on disk ● If we compress data more fits in RAM and is cached ● Rotational disks: ● Rotational disks have very slow seeks ● RAM not used by process with cache disk ● Solid State Disks do seek faster then rotational ● But they are more expensive then rotationa l© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  26. 26. Enabling Compression ● Rolling update to Cassandra ● update column family my_stuff with compression_options={sstable_compression:SnappyCompresso r, chunk_length_kb:64}; ● bin/nodetool -h cdbla120 -p 8585 rebuildsstables my_stuff ● 68 GB of data shrinks to 36© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  27. 27. Compression in action ● Disk activity reduced drastically as more/all data fit in cache ● Better performance ● Disks that spin less should last longer© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  28. 28. Compression lessons ● Creates extra CPU usage (but not really much) ● Creates more young gen garbage (some) ● Anecdotal experimentation with chunk_length_kb ● 64KB is good for sparse less frequent tables ● 16KB had same compression ratio and made less garbage ● Found 4KB to be less effective then 16KB ● This is easy to experiment with© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  29. 29. We have reached the point of the presentation where we...© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  30. 30. Hate on everything not Cassandra© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  31. 31. Cassandras uptime story ● Main cluster in continuous operation since 8/6/11 ● Doubled physical nodes in the cluster ● Upgraded Cassandra twice 0.7.7->0.8.6->1.0.7 ● Rolling reboot kernel update, 1 for leap second ● No maintenance windows ● Lets compare Cassandra with other things I use/used© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  32. 32. Cassandra vs MySQL master/slave... MySQL Cassandra Replication Single thread, binlogs, Per operation manual recovery Scaling Add more nodes, initial Bootstrap new sync, setup replication, Cassandra node, re- configure applications balance off-peak Consistency Applications that care Per operation read master, or application check status of replication Backup Mysqldump/LVM Sstabletojson | snapshot snapshot Restore Re-insert Copy files into place everything/Restore snapshot© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  33. 33. So with mysql... ● Replication breaking often ● requiring manual intervention for many fixes ● Blocking writes for 30 minutes to add a column to a table ● Scale up to big iron then... ● Restart takes 30 minutes to fsck all disks ● Applications needing to be coded with state aware logic ● Which node should I query? ● Is replication behind? ● Is there some merge table trickery going on?© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  34. 34. Cassandra vs Memcache Memcache Cassandra Replication None (client managed) Per operation Scaling None (client managed) Grow or shrink without bad reads Consistency Yes (and really no) Per operation Backup No persistence sstabletojson|snapshot Restore No persistence Cache warming© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  35. 35. So memcache is... ● Not persistent ● Not clear on sharding ● Not clear on failure modes ● Actual experiences with memcache ● Memcache client was not sharding requests evenly. 60 % were going to node 1.. ● We lost rack with 40% of the memcache nodes – Site went to crawl as DBs were overloaded – took 1 hour to warm up again© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  36. 36. Cassandra vs DRBD DRBD Cassandra Replication 1 or 2 nodes per block Per operation Scaling No scaling. Just more Grow or shrink availability. dynamically Consistency Sync modes change Per operation failure consistency, deadtime between flip- flops Backup Like a disk sstabletojson|snapshot Restore Like a disk Like a disk© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  37. 37. So DRBD is... ● A 30 second to 1 minute fail over/outage ● An alert that might wake you up ● but hopefully allows you to sleep again ● Handcuffed to linux-ha/keepalived etc ● Making it an involved setup ● Making it involved to troubleshoot ● Might need a crossover cable or dedicated network ● cpu/network intensive with very active disks ● Can successfully fail over a data file in an inconsistent state© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  38. 38. Cassandra vs HDFS Hadoop Cassandra Replication Per file Per operation Scaling Add nodes Add nodes Consistency Very, to the point Per operation getting data in becomes difficult Backup Distcp sstabletojson|snapshot Restore Distcp Like a disk© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  39. 39. So HDFS... ● Comes up with about 4 or 5 reasons a year for master node/ full cluster restart ● Grow NameNode heap ● Enable jobtracker setting to stop 100,000 task jobs ● Enabled/updated trash feature (off by default) ● Forced to do a fail over by hardware fault ● Random DRBD/Kernel brain fart ● Need to update a JVM/kernel eventually ● Now finally new versions have HA NameNode ● Running jobs lose progress will not automatically restart© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential
  40. 40. Questions?© 2012 Media6Degrees. All Rights Reserved. Proprietary and Confidential

×