Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From 100s to 100s of Millions

27,951 views

Published on

Slides from Cassandra SF 2011

http://www.datastax.com/events/cassandrasf2011

Published in: Technology, Business

From 100s to 100s of Millions

  1. 1. From 100s to 100s ofMillionsJuly 2011Erik Onnen
  2. 2. About Me• Director of Platform Engineering at Urban Airship (.75 years)• Previously Principal Engineer at Jive Software (3 years)• 12 years large scale, distributed systems experience going back to CORBA• Cassandra, HBase, Kafka and ZooKeeper contributor - most recently CASSANDRA-2463
  3. 3. In this Talk• About Urban Airship• Systems Overview• A Tale of Storage Engines• Our Cassandra Deployment• Battle Scars • Development Lessons Learned • Operations Lessons Learned• Looking Forward
  4. 4. What is an Urban Airship?• Hosting for mobile services that developers should not build themselves• Unified API for services across platforms• SLAs for throughput, latency
  5. 5. By The Numbers
  6. 6. By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices
  7. 7. By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second
  8. 8. By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins
  9. 9. By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins • Transactions - send push, check status, get content
  10. 10. By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins • Transactions - send push, check status, get content• At any given point in time, we have ~ 1.1 million secure socket connections into our transactional core
  11. 11. By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins • Transactions - send push, check status, get content• At any given point in time, we have ~ 1.1 million secure socket connections into our transactional core• 6 months for the company to deliver 1M messages, just broke 4.2B
  12. 12. Transactional System
  13. 13. Transactional System• Edge Systems: • API - Apache/Python/django+piston+pycassa • Device negotiation - Java NIO + Hector • Message Delivery - Python, Java NIO + Hector • Device data - Java HTTPS endpoint
  14. 14. Transactional System• Edge Systems: • API - Apache/Python/django+piston+pycassa • Device negotiation - Java NIO + Hector • Message Delivery - Python, Java NIO + Hector • Device data - Java HTTPS endpoint• Persistence • Sharded PostgreSQL • Cassandra 0.7 • MongoDB 1.7
  15. 15. A Tale of Storage Engines
  16. 16. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?”
  17. 17. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)
  18. 18. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use:
  19. 19. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra
  20. 20. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase
  21. 21. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase • Redis
  22. 22. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase • Redis • MongoDB
  23. 23. A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase • Redis • MongoDB• We’re converging on Cassandra + PostgreSQL for transactional and HBase for long haul
  24. 24. A Tale of Storage Engines
  25. 25. A Tale of Storage Engines
  26. 26. A Tale of Storage Engines• PostgreSQL
  27. 27. A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2
  28. 28. A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model
  29. 29. A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model • Layered in memcached
  30. 30. A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model • Layered in memcached • Writes weren’t scaling after ~ 6 months
  31. 31. A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model • Layered in memcached • Writes weren’t scaling after ~ 6 months • Continued to use for several silos of data but needed a way to grow more easily
  32. 32. A Tale of Storage Engines
  33. 33. A Tale of Storage Engines• MongoDB
  34. 34. A Tale of Storage Engines• MongoDB • Initially, we loved Mongo
  35. 35. A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool
  36. 36. A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice
  37. 37. A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice • As data set grew, we learned a lot about MongoDB
  38. 38. A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice • As data set grew, we learned a lot about MongoDB • “MongoDB does not wait for a response by default when writing to the database.”
  39. 39. A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice • As data set grew, we learned a lot about MongoDB • “MongoDB does not wait for a response by default when writing to the database.”
  40. 40. A Tale of Storage Engines
  41. 41. A Tale of Storage Engines• MongoDB - Read/Write Problems
  42. 42. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa)
  43. 43. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server
  44. 44. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating
  45. 45. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop
  46. 46. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop • No writes or updates
  47. 47. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop • No writes or updates • Effectively a failure for most clients
  48. 48. A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop • No writes or updates • Effectively a failure for most clients • With replication, queries for anything other than the shard key talk to every node in the cluster
  49. 49. A Tale of Storage Engines
  50. 50. A Tale of Storage Engines• MongoDB - Update Problems
  51. 51. A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine
  52. 52. A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O
  53. 53. A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O • Frequently spill to end of the collection datafile leaving “holes” but not sparse files
  54. 54. A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O • Frequently spill to end of the collection datafile leaving “holes” but not sparse files • Those “holes” get MMap’d even though they’re not used
  55. 55. A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O • Frequently spill to end of the collection datafile leaving “holes” but not sparse files • Those “holes” get MMap’d even though they’re not used • Updates moving data acquire multiple locks commonly blocking other read/write operations
  56. 56. A Tale of Storage Engines
  57. 57. A Tale of Storage Engines• MongoDB - Optimization Problems
  58. 58. A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection
  59. 59. A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection • Read slave was too busy to be a backup, needed moar RAMs but were already on High-Memory EC2, nowhere else to go
  60. 60. A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection • Read slave was too busy to be a backup, needed moar RAMs but were already on High-Memory EC2, nowhere else to go • Mongo MMaps everything - when your data set is bigger than RAM, you better have fast disks
  61. 61. A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection • Read slave was too busy to be a backup, needed moar RAMs but were already on High-Memory EC2, nowhere else to go • Mongo MMaps everything - when your data set is bigger than RAM, you better have fast disks • Until 1.8, no support for sparse indexes
  62. 62. A Tale of Storage Engines
  63. 63. A Tale of Storage Engines• MongoDB - Ops Issues
  64. 64. A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat
  65. 65. A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat • Recovering a crashed system was effectively impossible without disabling indexes first (not the default)
  66. 66. A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat • Recovering a crashed system was effectively impossible without disabling indexes first (not the default) • Replica sets never worked for us in testing, lots of inconsistencies in failure scenarios
  67. 67. A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat • Recovering a crashed system was effectively impossible without disabling indexes first (not the default) • Replica sets never worked for us in testing, lots of inconsistencies in failure scenarios • Scattered records lead to lots of I/O that hurt on bad disks (EC2)
  68. 68. Cassandra at Urban Airship
  69. 69. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra
  70. 70. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.
  71. 71. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence
  72. 72. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving:
  73. 73. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving: • 30GB data
  74. 74. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving: • 30GB data • ~1000 reads/second/node
  75. 75. Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving: • 30GB data • ~1000 reads/second/node • ~750 writes/second/node
  76. 76. Cassandra at Urban Airship
  77. 77. Cassandra at Urban Airship• Why Cassandra?
  78. 78. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs)
  79. 79. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well
  80. 80. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD
  81. 81. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW
  82. 82. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW • Dynamic rebalancing and node addition
  83. 83. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW • Dynamic rebalancing and node addition • Column TTLs huge for us
  84. 84. Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW • Dynamic rebalancing and node addition • Column TTLs huge for us • Awesome community :)
  85. 85. Cassandra at Urban Airship
  86. 86. Cassandra at Urban Airship• Why Cassandra cont’d?
  87. 87. Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability
  88. 88. Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone
  89. 89. Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs
  90. 90. Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs • Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920)
  91. 91. Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs • Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920) • No SPoF
  92. 92. Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs • Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920) • No SPoF • Ability to alter CLs on a per operation basis
  93. 93. Battle Scars - Development
  94. 94. Battle Scars - Development• Know your data model
  95. 95. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA
  96. 96. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows
  97. 97. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems
  98. 98. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems
  99. 99. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems
  100. 100. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems • Favor JSON over packed binaries if possible
  101. 101. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems • Favor JSON over packed binaries if possible• Careful with Thrift in the stack
  102. 102. Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems • Favor JSON over packed binaries if possible• Careful with Thrift in the stack• Don’t fear the StorageProxy
  103. 103. Battle Scars - Development
  104. 104. Battle Scars - Development• Assume failure in the client
  105. 105. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused
  106. 106. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure
  107. 107. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway
  108. 108. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling
  109. 109. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later?
  110. 110. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later? • Compensating action during failures?
  111. 111. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later? • Compensating action during failures?• Don’t avoid the Cassandra code
  112. 112. Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later? • Compensating action during failures?• Don’t avoid the Cassandra code• Embed for testing
  113. 113. Battle Scars - Ops
  114. 114. Battle Scars - Ops• Cassandra in EC2:
  115. 115. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled
  116. 116. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O
  117. 117. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3
  118. 118. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes
  119. 119. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes • Avoid smaller instances all together
  120. 120. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes • Avoid smaller instances all together • Don’t always assume traversing close proximity AZs is more expensive
  121. 121. Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes • Avoid smaller instances all together • Don’t always assume traversing close proximity AZs is more expensive • Balance RAM cost vs. the cost of additional hosts and spending time w/ GC logs
  122. 122. Battle Scars - Ops
  123. 123. Battle Scars - Ops• Java Best Practices:
  124. 124. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts
  125. 125. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase
  126. 126. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump
  127. 127. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx
  128. 128. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx • Init scripts use the same scripts operators do
  129. 129. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx • Init scripts use the same scripts operators do • Bare metal will rock your world
  130. 130. Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx • Init scripts use the same scripts operators do • Bare metal will rock your world • +UseLargePages will rock your world too
  131. 131. Battle Scars - Ops ParNew GC Effectiveness Mean Time ParNew GC300 0.04225 0.03150 0.02 75 0.01 0 0 MB Collected Collection Time (ms) Bare Metal EC2 XL Bare Metal EC2 XL ParNew Collection Count 60000 45000 30000 15000 0 Number of Collections Bare Metal EC2 XL
  132. 132. Battle Scars - Ops
  133. 133. Battle Scars - Ops• Java Best Practices cont’d:
  134. 134. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails)
  135. 135. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like
  136. 136. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60
  137. 137. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold
  138. 138. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold • When in doubt take a thread dump
  139. 139. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold • When in doubt take a thread dump • TDA (http://java.net/projects/tda/)
  140. 140. Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold • When in doubt take a thread dump • TDA (http://java.net/projects/tda/) • Eclipse MAT (http://www.eclipse.org/mat/)
  141. 141. Battle Scars - Ops
  142. 142. Battle Scars - Ops• Understand when to compact
  143. 143. Battle Scars - Ops• Understand when to compact• Understand upgrade implications for datafiles
  144. 144. Battle Scars - Ops• Understand when to compact• Understand upgrade implications for datafiles• Watch hinted handoff closely
  145. 145. Battle Scars - Ops• Understand when to compact• Understand upgrade implications for datafiles• Watch hinted handoff closely• Monitor JMX religiously
  146. 146. Looking Forward
  147. 147. Looking Forward• Cassandra is a great hammer but not everything is a nail
  148. 148. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)
  149. 149. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC
  150. 150. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving
  151. 151. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL
  152. 152. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL • Pig
  153. 153. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL • Pig • Brisk
  154. 154. Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL • Pig • Brisk• Guardedly optimistic about off heap data management
  155. 155. Thanks to• jbellis, driftx• Datastax• Whoever wrote TDA• SAP
  156. 156. Thanks!• Urban Airship: http://urbanairship.com/• We’re hiring! http://urbanairship.com/company/jobs/• Me @eonnen or erik at

×