• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
From 100s to 100s of Millions
 

From 100s to 100s of Millions

on

  • 24,983 views

Slides from Cassandra SF 2011

Slides from Cassandra SF 2011

http://www.datastax.com/events/cassandrasf2011

Statistics

Views

Total Views
24,983
Views on SlideShare
24,851
Embed Views
132

Actions

Likes
26
Downloads
0
Comments
17

8 Embeds 132

http://lanyrd.com 57
http://iwlwi.blog.fc2.com 52
http://admin.blog.fc2.com 16
http://a0.twimg.com 2
https://twitter.com 2
http://twitter.com 1
http://us-w1.rockmelt.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

110 of 17 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…

110 of 17 previous next

Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

From 100s to 100s of Millions From 100s to 100s of Millions Presentation Transcript

  • From 100s to 100s ofMillionsJuly 2011Erik Onnen
  • About Me• Director of Platform Engineering at Urban Airship (.75 years)• Previously Principal Engineer at Jive Software (3 years)• 12 years large scale, distributed systems experience going back to CORBA• Cassandra, HBase, Kafka and ZooKeeper contributor - most recently CASSANDRA-2463
  • In this Talk• About Urban Airship• Systems Overview• A Tale of Storage Engines• Our Cassandra Deployment• Battle Scars • Development Lessons Learned • Operations Lessons Learned• Looking Forward
  • What is an Urban Airship?• Hosting for mobile services that developers should not build themselves• Unified API for services across platforms• SLAs for throughput, latency
  • By The Numbers
  • By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices
  • By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second
  • By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins
  • By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins • Transactions - send push, check status, get content
  • By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins • Transactions - send push, check status, get content• At any given point in time, we have ~ 1.1 million secure socket connections into our transactional core
  • By The Numbers• Over 160 million active application installs use our system across over 80 million unique devices• Freemium API peaks at 700 requests/second, dedicated customer API 10K requests/second • Over half of those are device check-ins • Transactions - send push, check status, get content• At any given point in time, we have ~ 1.1 million secure socket connections into our transactional core• 6 months for the company to deliver 1M messages, just broke 4.2B
  • Transactional System
  • Transactional System• Edge Systems: • API - Apache/Python/django+piston+pycassa • Device negotiation - Java NIO + Hector • Message Delivery - Python, Java NIO + Hector • Device data - Java HTTPS endpoint
  • Transactional System• Edge Systems: • API - Apache/Python/django+piston+pycassa • Device negotiation - Java NIO + Hector • Message Delivery - Python, Java NIO + Hector • Device data - Java HTTPS endpoint• Persistence • Sharded PostgreSQL • Cassandra 0.7 • MongoDB 1.7
  • A Tale of Storage Engines
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?”
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use:
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase • Redis
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase • Redis • MongoDB
  • A Tale of Storage Engines• “Is there a NoSQL system you guys don’t use?” • Riak :)• We do use: • Cassandra • HBase • Redis • MongoDB• We’re converging on Cassandra + PostgreSQL for transactional and HBase for long haul
  • A Tale of Storage Engines
  • A Tale of Storage Engines
  • A Tale of Storage Engines• PostgreSQL
  • A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2
  • A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model
  • A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model • Layered in memcached
  • A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model • Layered in memcached • Writes weren’t scaling after ~ 6 months
  • A Tale of Storage Engines• PostgreSQL • Bootstrapped the company on PostgreSQL in EC2 • Highly relational, large index model • Layered in memcached • Writes weren’t scaling after ~ 6 months • Continued to use for several silos of data but needed a way to grow more easily
  • A Tale of Storage Engines
  • A Tale of Storage Engines• MongoDB
  • A Tale of Storage Engines• MongoDB • Initially, we loved Mongo
  • A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool
  • A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice
  • A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice • As data set grew, we learned a lot about MongoDB
  • A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice • As data set grew, we learned a lot about MongoDB • “MongoDB does not wait for a response by default when writing to the database.”
  • A Tale of Storage Engines• MongoDB • Initially, we loved Mongo • Document databases are cool • BSON is nice • As data set grew, we learned a lot about MongoDB • “MongoDB does not wait for a response by default when writing to the database.”
  • A Tale of Storage Engines
  • A Tale of Storage Engines• MongoDB - Read/Write Problems
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa)
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop • No writes or updates
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop • No writes or updates • Effectively a failure for most clients
  • A Tale of Storage Engines• MongoDB - Read/Write Problems • Early days (1.2) one global lock (reads block writes and vice versa) • Later, one read lock, one write lock per server • Long running queries were often devastating • Replication would fall too far behind and stop • No writes or updates • Effectively a failure for most clients • With replication, queries for anything other than the shard key talk to every node in the cluster
  • A Tale of Storage Engines
  • A Tale of Storage Engines• MongoDB - Update Problems
  • A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine
  • A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O
  • A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O • Frequently spill to end of the collection datafile leaving “holes” but not sparse files
  • A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O • Frequently spill to end of the collection datafile leaving “holes” but not sparse files • Those “holes” get MMap’d even though they’re not used
  • A Tale of Storage Engines• MongoDB - Update Problems • Simple updates (i.e. counters) were fine • Bigger updates commonly resulted in large scans of the collection depending on position == heavy disk I/O • Frequently spill to end of the collection datafile leaving “holes” but not sparse files • Those “holes” get MMap’d even though they’re not used • Updates moving data acquire multiple locks commonly blocking other read/write operations
  • A Tale of Storage Engines
  • A Tale of Storage Engines• MongoDB - Optimization Problems
  • A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection
  • A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection • Read slave was too busy to be a backup, needed moar RAMs but were already on High-Memory EC2, nowhere else to go
  • A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection • Read slave was too busy to be a backup, needed moar RAMs but were already on High-Memory EC2, nowhere else to go • Mongo MMaps everything - when your data set is bigger than RAM, you better have fast disks
  • A Tale of Storage Engines• MongoDB - Optimization Problems • Compacting a collection locks the entire collection • Read slave was too busy to be a backup, needed moar RAMs but were already on High-Memory EC2, nowhere else to go • Mongo MMaps everything - when your data set is bigger than RAM, you better have fast disks • Until 1.8, no support for sparse indexes
  • A Tale of Storage Engines
  • A Tale of Storage Engines• MongoDB - Ops Issues
  • A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat
  • A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat • Recovering a crashed system was effectively impossible without disabling indexes first (not the default)
  • A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat • Recovering a crashed system was effectively impossible without disabling indexes first (not the default) • Replica sets never worked for us in testing, lots of inconsistencies in failure scenarios
  • A Tale of Storage Engines• MongoDB - Ops Issues • Lots of good information in mongostat • Recovering a crashed system was effectively impossible without disabling indexes first (not the default) • Replica sets never worked for us in testing, lots of inconsistencies in failure scenarios • Scattered records lead to lots of I/O that hurt on bad disks (EC2)
  • Cassandra at Urban Airship
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving:
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving: • 30GB data
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving: • 30GB data • ~1000 reads/second/node
  • Cassandra at Urban Airship• Summer of 2010 - no faith left in MongoDB started a migration to Cassandra• Lots of L&P testing, client analysis, etc.• December 2010 - Cassandra backed 85% of our Android stack’s persistence • Six EC2 XLS with each serving: • 30GB data • ~1000 reads/second/node • ~750 writes/second/node
  • Cassandra at Urban Airship
  • Cassandra at Urban Airship• Why Cassandra?
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs)
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW • Dynamic rebalancing and node addition
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW • Dynamic rebalancing and node addition • Column TTLs huge for us
  • Cassandra at Urban Airship• Why Cassandra? • Well suited for most of our data model (simple DAGs) • Lots of UUIDs and hashes partition well • Retrievals don’t need ordering beyond keys or TSD • Rolling upgrades FTW • Dynamic rebalancing and node addition • Column TTLs huge for us • Awesome community :)
  • Cassandra at Urban Airship
  • Cassandra at Urban Airship• Why Cassandra cont’d?
  • Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability
  • Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone
  • Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs
  • Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs • Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920)
  • Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs • Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920) • No SPoF
  • Cassandra at Urban Airship• Why Cassandra cont’d? • Particularly well suited to working around EC2 availability • Needed a cross AZ strategy - we had seen EBS issues in the past, didn’t trust fault containment w/n a zone • Didn’t want locality of replication so needed to stripe across AZs • Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920) • No SPoF • Ability to alter CLs on a per operation basis
  • Battle Scars - Development
  • Battle Scars - Development• Know your data model
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems • Favor JSON over packed binaries if possible
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems • Favor JSON over packed binaries if possible• Careful with Thrift in the stack
  • Battle Scars - Development• Know your data model • Creating indexes after the fact is a PITA • Design around wide rows • I/O problems • Thrift problems • Count problems • Favor JSON over packed binaries if possible• Careful with Thrift in the stack• Don’t fear the StorageProxy
  • Battle Scars - Development
  • Battle Scars - Development• Assume failure in the client
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later?
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later? • Compensating action during failures?
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later? • Compensating action during failures?• Don’t avoid the Cassandra code
  • Battle Scars - Development• Assume failure in the client • Read timeout vs. connection refused • When maintaining your own indexes, try and cleanup after failure • Be ready to cleanup inconsistencies anyway • Verify client library assumptions and exception handling • Retry now vs. retry later? • Compensating action during failures?• Don’t avoid the Cassandra code• Embed for testing
  • Battle Scars - Ops
  • Battle Scars - Ops• Cassandra in EC2:
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes • Avoid smaller instances all together
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes • Avoid smaller instances all together • Don’t always assume traversing close proximity AZs is more expensive
  • Battle Scars - Ops• Cassandra in EC2: • Ensure Dynamic Snitch is enabled • Disk I/O • Avoid EBS except for snapshot backups or use S3 • Stripe ephemerals, not EBS volumes • Avoid smaller instances all together • Don’t always assume traversing close proximity AZs is more expensive • Balance RAM cost vs. the cost of additional hosts and spending time w/ GC logs
  • Battle Scars - Ops
  • Battle Scars - Ops• Java Best Practices:
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx • Init scripts use the same scripts operators do
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx • Init scripts use the same scripts operators do • Bare metal will rock your world
  • Battle Scars - Ops• Java Best Practices: • All Java services are managed via the same set of scripts • In most cases, operators don’t treat Cassandra different from HBase • Simple mechanism to take thread or heap dump • All logging is consistent - GC, application, stdx • Init scripts use the same scripts operators do • Bare metal will rock your world • +UseLargePages will rock your world too
  • Battle Scars - Ops ParNew GC Effectiveness Mean Time ParNew GC300 0.04225 0.03150 0.02 75 0.01 0 0 MB Collected Collection Time (ms) Bare Metal EC2 XL Bare Metal EC2 XL ParNew Collection Count 60000 45000 30000 15000 0 Number of Collections Bare Metal EC2 XL
  • Battle Scars - Ops
  • Battle Scars - Ops• Java Best Practices cont’d:
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails)
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold • When in doubt take a thread dump
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold • When in doubt take a thread dump • TDA (http://java.net/projects/tda/)
  • Battle Scars - Ops• Java Best Practices cont’d: • Get familiar with GC logs (-XX:+PrintGCDetails) • Understand what degenerate CMS collection looks like • We settled at -XX:CMSInitiatingOccupancyFraction=60 • Possibly experiment with tenuring threshold • When in doubt take a thread dump • TDA (http://java.net/projects/tda/) • Eclipse MAT (http://www.eclipse.org/mat/)
  • Battle Scars - Ops
  • Battle Scars - Ops• Understand when to compact
  • Battle Scars - Ops• Understand when to compact• Understand upgrade implications for datafiles
  • Battle Scars - Ops• Understand when to compact• Understand upgrade implications for datafiles• Watch hinted handoff closely
  • Battle Scars - Ops• Understand when to compact• Understand upgrade implications for datafiles• Watch hinted handoff closely• Monitor JMX religiously
  • Looking Forward
  • Looking Forward• Cassandra is a great hammer but not everything is a nail
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL • Pig
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL • Pig • Brisk
  • Looking Forward• Cassandra is a great hammer but not everything is a nail• Coprocessors would be awesome (hint hint)• Still spend too much time worrying about GC• Glad to see the ecosystem around the product evolving • CQL • Pig • Brisk• Guardedly optimistic about off heap data management
  • Thanks to• jbellis, driftx• Datastax• Whoever wrote TDA• SAP
  • Thanks!• Urban Airship: http://urbanairship.com/• We’re hiring! http://urbanairship.com/company/jobs/• Me @eonnen or erik at