Yes, but it’s a distributed system, so the tables contain wide partitions that require application query planning.
Joins don’t scale in distributed systems, denormalize your data and create a table for each query, using wide partitions.
Yes, but they are for niche uses and aren’t performant with high cardinality columns. Instead, denormalize, disk is cheap, writes ARE performant.
Spark is tool for running batch or streaming analytics, it is meant for analytics processing not tight customer-facing SLAs
Summarize the point of the section: TL;DR don’t treat Cassandra like a relational db in your application design and query planning.
How are you going to deploy your app? Cloud or hardware? Test with this in mind, don’t just use machines because you have them. Think how you can best reproduce your real world conditions.
Choose options that best reflect the systems you will actually utilize and the ones that are compatible with C*. Choose commodity hardware with killer SSDs, in the cloud interrogate your hardware and ensure you’re running on machines that have decent CPU and SSDs as well.
Up your memory game for search and analytics.
Cassandra wasn’t built to run on one node, use at least 5 with replication.
Try 10 clients instead. Stress is a good start, but it’s for basic exploration. You should load test with real data and real clients!
1. Days, for a minimum. Performance can only be reviewed over time. 2. What happens when compactions and repairs kick in? 3.What happens when you add/remove nodes?
Don’t use meagre datasets. Exceed RAM. Put 1TB+ on each node.
Beware apples to oranges comparisons. Are your settings the same?
2. Batches in Cassandra are rarely a performance optimization. Unlogged batches create lots of work for your coordinator node
3. Load balancers are a bottleneck and a single point of failure. DataStax drivers load balance for you! They also handle retries and failover.
Backing up Cassandra sounds hard Do we really have run snapshots on every node then copy the snapshots to S3?
We have to run repair on every node, every 10 days? Really? Let’s start off doing it monthly and see how things go…
What’s causing the high latency on those 1%? Doesn’t matter. It’s only 1%.
We need to plan for future growth Let’s just purchase 5 extra nodes and hope that’s enough
Repair service – automatically keeps data consistent across a cluster. Backup service – smart and simple backup and restore management for all managed clusters. Capacity service – enables historical trend analysis and forecasts future resource needs Proactive Alerts & External Notifications Best Practices : Slow query logs
Bad Habits Die Hard
Bad Habits Die Hard
Staying on the Right Cassandra Path