Having been a user of Cassandra since 0.2 and a production user since 1.0, I’ve seen it, been there, done that and somehow survived. These are tales of learning when to ignore best practices, how to break rules, and knowing when and how to come up with strategies to push the limits of Cassandra.
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
Making It To Veteren Cassandra Status
1. MAKING IT TO VETERAN
CASSANDRA STATUS
Been There, Done That, Survived
2. Eric Lubow @elubow
PERSONAL VANITY
๏ CTO of SimpleReach
๏ Co-Author of Practical
Cassandra
๏ Skydiver, Mixed Martial
Artist, Motorcyclist, Dog Dad
(IG: @charliedognyc), NY
Giants fan
3. Eric Lubow @elubow
SIMPLEREACH
๏ Identify the best content
๏ Use engagement metrics
๏ Stream processing ingest
๏ Many metrics, time sliced
๏ Multiple data stores
4. Eric Lubow @elubow
๏Started using Cassandra at 0.2 in Sep of 2009
๏First put Cassandra in production at 1.0
๏Helped in building multiple drivers
๏Filed lots of Jira tickets (40+)
๏Beta tested features
๏Large counter deployment (largest?)
AM I QUALIFIED TO BE A VETERAN
8. Eric Lubow @elubow
๏ Use Cassandra
๏ Dig in to the code from time to time (server and drivers)
๏ Know strengths and weaknesses and understand why
๏ Follow the changelogs and mailing lists
๏ Stress Cassandra in unconventional ways
๏ Learn the failure scenarios and how to fix them (hang out on IRC)
๏ Break the rules from time to time to see what happens
๏ “Those who do not know the past are condemned to repeat it.” -
George Santayana
HOW DO I LEVEL UP?
10. Eric Lubow @elubow
๏ What’s the latest cool technology?
CHOOSING A DATABASE IS EASY, #AMIRITE
๏ What is my data volume?
๏ What are my query patterns?
๏ Is my data (un)structured?
๏ Will data remain consistent?
๏ Am I read heavy or write heavy?
๏ Am I batch loading data?
๏ Is eventually consistent data ok?
๏ Can I have a DR plan?
๏ Legal/compliance requirements?
๏ Are there experts/enterprise support?
๏ What’s the community like?
๏ Easy to administer?
๏ Tooling, monitoring, language support?
๏ Cloud or iron?
๏ High volume ingestion or batch loading?
๏ Fault tolerance?
๏ Open source vs enterprise system?
๏ Employee learning curve vs. learning cost?
15. Eric Lubow @elubow
USE-CASE: ADMINISTRATION
๏ Every node is the same base
๏ No master node
๏ All monitoring through JMX
๏ One step to add/remove nodes
๏ Tunables, lots of em
๏ Easily wrote our own chef cookbook
๏ Goals
๏ Config nodes, Shard nodes, Replica nodes
๏ Master/slave nodes, leader election
๏ Monitoring via mongostat sometimes
๏ Two step to add/remove nodes
๏ No tunables
๏ Many non-well working chef cookbooks
๏ Goals
BASICALLY JUST ME
Cassandra Mongo
16. Eric Lubow @elubow
๏ Primarily Datastax ๏ Community
Contributions
๏ Who is the
community?
CASSANDRA IS OPEN SOURCE
17. Eric Lubow @elubow
SERIOUSLY 40+ JIRA TICKETS?
SPARK-6949 Pyspark and datetime
OPSC-6186 Rebalance - while calling decorator (IndexError): list index out of range
CASSANDRA-9871 Cannot replace token does not exist - DN node removed as Fat Client
OPSC-6045 Agent CPU on startup 800 Seconds
OPSC-5346 Opsc Repair service system_traces system_auth
CASSANDRA-7409 LCS improvement
CASSANDRA-8611 Socket timeout shitty default
CASSANDRA-9279 Gossip (and mutations) lock up on Startup
OPSC-4879 OpsC Agent JMX Connections and Cassandra Operations Fail Incessantly
CASSANDRA-8086 Too many connections - Cassandra Defense
CASSANDRA-7122 System peers
CASSANDRA-6506 Counters++ Final Performance
CASSANDRA-7510 Up node gossip messages -- affects drivers
PYTHON-202 More control for metadata updates
PYTHON-201 Optionally randomize contact points
OPSC-3672 OpsC - Repair Service Restarts on Node Flopping
DSP-3059 / SOLR-5463 Solr 4.10 - and Deep Paging
CASSANDRA-8548 Cleanup Dump
DSP-4560 Possible ticket Upgrade from 4.5.2 to 4.5.3
DSP-3341 In-memory Phase 2 (off heap and remove GB limit)
DSP-3970 Solr indexes even when values don't change
CASSANDRA-8150 Stump's JVM Tuning
18. Eric Lubow @elubow
SIMPLEREACH CONTEXT
๏ 100 million URLs
๏ 350 million Tweets
๏ 50k - 100k events per second (tens of billions of events per day)
๏ 225G new per hour
๏ 700T of total data (10T per month)
๏ 10T of hot data
๏ 72 nodes Cassandra cluster
๏ 52 Realtime Nodes
๏ 9 Search Nodes
๏ 11 Spark Nodes
20. Eric Lubow @elubow
๏ Average over 200k counter writes per second
๏ Pre-aggregate writes (saved us 10x the writes)
๏ Trying to defeat the counter time bomb
๏ Breaking the rules with CASSANDRA-8150
๏ Many many JVM tuning changes
๏ All things possible through monitoring
๏ Upgraded every node in the cluster by hand one at a time
๏ Upgrading to 2.1 definitely sealed the deal
CONQUERING COUNTERS
21. Eric Lubow @elubow
๏ Nodes might have removed themselves from a cluster because the
disk was full
๏ Apps might lose connections to the cluster and then take 45 min to
reconnect (or longer on bigger clusters)
๏ A slow node might make the entire cluster unusable
๏ A poorly gossiping node might overwork itself out of the cluster
๏ Adding a node to the cluster might take down all connected apps
๏ Sometimes you just can’t removenode (or bootstrap)
UNDERSTAND FAILURE SCENARIOS
22. Eric Lubow @elubow
WHAT SHOULD YOU WALK AWAY WITH?
๏ Incredibly important to have a deep
understanding around your cases
๏ Sometimes database tuning has nothing to do
with database settings
๏ Understand failure scenarios for your use-cases
๏ Give back, it helps everything get better
๏ Ignoring best practices is almost never a good
idea