Presenter: Gurashish Brar, Member of Technical Staff at Bloomreach
Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at the same time providing access to the data in real time to front-end application with strict TP95 latency guarantees is a hard problem. We present a system for managing Cassandra clusters which provide following functionality: 1) Dynamic scaling of capacity to serve high throughput map-reduce jobs 2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing latency SLAs for TP95 3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling. At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to new nodes as needed.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure
1. Cassandra Compute Cloud
An Elastic Cassandra Infrastructure
Gurashish Singh Brar
Member of Technical Staff @ BloomReach
2. Abstract
Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at
the same time providing access to the data in real time to front-end application with strict TP95 latency
guarantees is a hard problem.
We present a system for managing Cassandra clusters which provide following functionality:
1) Dynamic scaling of capacity to serve high throughput map-reduce jobs
2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing
latency SLAs for TP95
3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling.
At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to
new nodes as needed
3. What is it about ?
• Dynamically scaling the infrastructure to support large EMR jobs
• Throughput SLA to backend applications
• TP95 latency SLA to frontend applications
• Cassandra 2.0 using vnodes
4. Agenda
• Application requirements
• Major issues we encountered
• Solutions to the issues
5. Application Requirements
• Backend EMR jobs performing scans, lookups and writes
Heterogeneous applications with varying degree of throughput SLAs.
Very high peak loads
Always available (no maintenance periods or planned downtimes)
• Frontend applications performing lookups
Data from backend applications expected in realtime
Low latencies
• Developer support
6. How we started
Frontend Applications
Frontend
DC
Backend
DC
Cassandra Cluster
EMR Jobs
11. Backend Issue: Fixed Resource
Cassandra Cluster
Backend
DC
EMR Jobs
EMR Jobs
EMR Jobs
EMR Jobs
EMR Jobs
EMR Jobs
EMR Jobs
EMR Jobs
12. Backend Issue: Starvation
Backend
DC
Cassandra Cluster
Large EMR Jobs
with
relaxed SLA
Small EMR job
with
tighter SLA
13. Summary of Issues
• Frontend isolation is not perfect
• Frontend latencies are impacted by backend write load
• EMR jobs can overwhelm the Cassandra cluster
• Large EMR jobs can starve smaller ones
14. Rate Limiter
Frontend Applications
Frontend
DC
Backend
DC
Cassandra Cluster
EMR Jobs
Token Server
(Redis)
15. Rate Limiter
• QPS allocated on per operation and application level
• Operations can be: scans, reads, writes, prepare, alter, create … etc
• Each mapper/reducer obtains permits for 1 minute (configurable).
• The token bucket is periodically refreshed with allocated capacity
• Quotas are dynamically adjusted to take advantage of unused quotas of applications
( We do want to maximize the cluster usage)
16. Why Redis ?
• High load from all EMR nodes
• Low latency
• Support high number of concurrent connections
• Support atomic fetch and add
17. Cost of Rate Limiter
• We converted EMR from an elastic resource to a fixed resource
• To scale EMR we have to scale Cassandra
• Adding capacity to Cassandra cluster is not trivial
• Adding capacity under heavy load is harder
• Auto scaling and reducing under heavy load is even harder
18. Managing capacity - Requirements
• Time to increase capacity should be in minutes
• Programmatic management and not manual
• Minimum load on the production cluster during the operation
23. Custom Replication Service
• Replication Service (source node) takes snapshot of column family
• SSTables in snapshot are evenly streamed on destination cluster
• Replication Service (destination node) splits a single source SSTable to N SSTables
• Splits computed using SSTableReader & SSTableWriter classes. A single SSTable can
be split in parallel by multiple threads
24. Custom Replication Service
• Once split the new SSTables are streamed to correct destination nodes
• Rolling restart is initiated on the destination cluster (we could have used nodetool refresh,
but it was unreliable)
• The cluster is ready for use
• In parallel trigger compaction on destination cluster for optimizing reads
25. Cluster Provisioning
• Estimate the required cluster size based on column family disk size on source cluster
• Provision machines on AWS (Cassandra is pre-installed on AMI , so no setup required)
• Generate yaml and topology file with the new cluster and create a backend datacenter
(Application agnostic)
• Copy schema from source cluster to destination cluster
• Call replication service on source cluster to replicate data
26. C* Compute Cloud
Source
Cluster
Cluster Management
service
On-demand cluster
On-demand cluster
On-demand cluster
On-demand cluster
EMR Jobs
27. C* Compute Cloud
• Very high throughput in moving raw data from source to destination cluster (10 X
increase in network usage compared to normal)
• Little CPU/Memory load on the source cluster
• Leverage the size of destination cluster to compute new SSTables for the new ring
• Time to provision varies between 10 minutes to 40 minutes
• API driven so automatically scales up and down with demand
• Application agnostic
28. C* Compute Cloud - Limitations
• Snapshot model : Take a snapshot of production and operate on it
This works really well for some use cases, good for most, but not all
• Provisioning time order of minutes
Works for EMR jobs which themselves take few minutes to provision but does not work for
dedicated backend applications
• Writes still need to happen on production reserved cluster
29. Where we are now
Frontend Applications
Frontend
DC
Backend
DC
Cassandra Cluster
EMR Jobs
On-demand cluster
Token Server (Redis)
On-demand cluster
On-demand cluster
Replication
Cluster Management
service
30. Exploiting the C* compute cloud
• Key feature: Easy, automated and fast cluster provisioning with
production data
• Use Spot Instances instead of On-Demand
• Failures in few nodes are survivable due to C* redundancy
• In case of too many failures, just rebuild on retry (its fast ! & automatic)
31. Spot Instances
• Service supports all instance types in AWS and all AZs
• Pick the optimal Spot Instance type & AZ that is the cheapest and
satisfies the constraints
• Further reduces cost and improves reliability of the service
• If r3.2xlarge spot price spikes on retry service might pick c3.8xlarge
• Auto expire clusters to adjust automatically to cheaper instances
32. Cost or Capacity (take your pick)
Capacity of C* compute cloud on spot instances
~=
(5 to 10) X C* cluster using on-demand instances
for same $ value
33. Issues Addressed
• Backend Read Capacity can scale linearly with C* compute cloud
• Frontend latencies are protected from write load through rate limit
34. Remaining issues
• Read load on backend DC can spillover to frontend DC causing spikes
• Write capacity is still defined by frontend latencies
37. Addressing the Write Capacity
• The obvious : Only push the updates that are new and not the same
Big improvement, 80-90% data did not change
• Add more nodes : With the backend read load off production it is lot
easier to expand capacity
• But we are still operating at ~ 3rd or 5th the write capacity to keep read
latencies low
38. Addressing the Write Capacity
• Experimental changes under evaluation
• Prioritize reads over writes on frontend
Pause write stage during a read
• Reduce replication load to frontend DC from backend DC
ColumnLevel replication strategy
Most frontend applications operate on a subset view of backend data
39. Key Takeaways
• Scale Cassandra dynamically for backend load by creating snapshot
clusters
• Use rate limiter to protect the production cluster from spiky and
unexpected backend traffic
• Build better isolation between frontend DC and backend DC
• Writes throughput from backend to frontend is a challenge