Optimizing the Public Cloud for Costand Scalability with CassandraCharles LamannaSenior Development Lead@clamannaRicardo V...
MetricsHubkeep services up and running for the lowest possible cost
Live StatusCost AwarenessAlerts and NotificationsActions and Scaling$#CASSANDRA13
growth2000+ customers in 6 months
0500100015002000250010/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013Number of MetricsHub Customers
010002000300040005000600070008000900010/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013Number of VMs Monitored by...
01234567810/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013Number of Metricshub Employees
storing data200M data points per hour
Planning for huge data ingestion rates• MetricsHub requires high scale, real-time data:• 1,000 data points per minute per ...
Looked at Redis…• Perform aggregation in memory (using INCR and other nativeoperations)• Flush aggregate data from Redis t...
… but it was fragile, and expensive for this usecase• RAM/Memory in the public cloud is *expensive* (but storage is*cheap*...
Looked at SQL…• Create tables for different time windows and granularities• Roll over from table-to-table (and drop entire...
… but SQL did not fit• Higher write than read volume pushed boundaries of theservers• Requires complex sharding after just...
Then we tried Cassandra (andnever went back)• Scales fluidly• Grows horizontally – double the nodes, double capacity• Add ...
… and by the way• Little-to-none operations cost• New nodes take minutes to setup• Nodes just keep running for months on e...
architecture68 virtual machines (PAAS and IAAS)
Table StorageJobs Worker Role(24 instances)SQL DatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL ...
Avoiding state• Application logic / code alllives on statelessmachines• Keeps it simple: decreaseshuman operations cost• U...
Windows Azure Cloud Services(PAAS)• Scale horizontally (grew from1 to 30+ instances)• Managed by the platform(patched; coo...
Table StorageSQLDatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL instances)Web API Web Role(8 in...
Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL i...
Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storageCassandra VM Cluster(32 XL instances)Web API Web Role(8 ...
Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storageWeb API Web Role(8 instances)Endpoints Replicated datain...
Windows Azure Virtual Machines(IaaS)Starting Select Image and VM Size New Disk Persisted in Storage
32 nodes, 8 “pods” of 4 nodes
Exposing the pods• Each pod of 4 nodeshas a single loadbalanced endpoint• Clients (on ourstateless roles) treatsthe endpoi...
Where does the data go?• Data files are on 8 mounted networkbacked disks (*not* ephemeral disks)• Data disks are geo-repli...
Our Column Families (CQL3)CREATE TABLE oneminute (rk text,ck text,cnt counter,sum counter,PRIMARY KEY (rk, ck));
Updating values…Realtime “average” values at any granularity, for any time windowupdateoneminute/tenminute/onedaysetsum = ...
Reading values…*ONE* round trip to fetch a metric over time (e.g. CPU over pastweek)select * from oneminutewhererk = ‘{cus...
What’s next?• Windows Azure Virtual Networks to connect /secure all of our resources(PAAS + IAAS + Services)• Expand Cassa...
Global Physical Infrastructureservers/network/datacentersREST API + OTHER SERVICEScompute data management networking
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lama...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lama...
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lama...
Upcoming SlideShare
Loading in …5
×

C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

1,073 views

Published on

MetricsHub is a monitoring and scalability service for public clouds, allowing companies to continuously gather data from their systems and auto-scale their deployments to optimize service costs. Taking advantage of Cassandra rapid ingestion rates, reliable replication model, and easiness of deployment, Metrics Hub can handle billions of datapoints per day. During this session, you will learn about the architecture supporting this service, which combines the power of the PaaS + IaaS on the Windows Azure platform.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,073
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • All state is maintained in Cassandra or SQL
  • Examples: Ping customer endpoint; pull load balancer stats; identify if a VM set is overloadedHuge scale and highly reliable framework (10s of thousands of jobs; no downtime)All jobs are isolated by task (e.g. ping URL) and customerCommunicates with Cassandra using FluentCassandra (.NET)Requests round robin balanced over 8 endpointsData stream is massive (100k writes / sec) and needs to be resilient
  • Integrates with other partner services (e.g. Windows Azure store)Used by MetricsHub client agents (on customer machines)Based on .NET (C#) WebAPIsPersists all customer data (writes) to Cassandra only
  • .NET based using MVC + IISHeavy use of jQuery / javascript on the client side 15+ OSS components are used in the portalBundled & shipped 1-click deployment Updated our production portal several times a day
  • FluentcassandraAll reads / writes for metric data go to this cluster; no need for a cache40+ VMs connect to this cluster
  • C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

    1. 1. Optimizing the Public Cloud for Costand Scalability with CassandraCharles LamannaSenior Development Lead@clamannaRicardo VillalobosSenior Cloud Architect@ricvilla
    2. 2. MetricsHubkeep services up and running for the lowest possible cost
    3. 3. Live StatusCost AwarenessAlerts and NotificationsActions and Scaling$#CASSANDRA13
    4. 4. growth2000+ customers in 6 months
    5. 5. 0500100015002000250010/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013Number of MetricsHub Customers
    6. 6. 010002000300040005000600070008000900010/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013Number of VMs Monitored byMetricsHub
    7. 7. 01234567810/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013Number of Metricshub Employees
    8. 8. storing data200M data points per hour
    9. 9. Planning for huge data ingestion rates• MetricsHub requires high scale, real-time data:• 1,000 data points per minute per VM• 12 data points per endpoint per minute• 500+ data points per storage account per hour• Need to aggregate, analyze and take actions based onthis data stream (in near real-time)• Must be cheap, scalable and reliable
    10. 10. Looked at Redis…• Perform aggregation in memory (using INCR and other nativeoperations)• Flush aggregate data from Redis to persistent storage at aregular interval• Is fast, powerful and a good OSS community
    11. 11. … but it was fragile, and expensive for this usecase• RAM/Memory in the public cloud is *expensive* (but storage is*cheap*)• Flushing the data requires complex coordination• If we did not flush quickly enough – out of memory!
    12. 12. Looked at SQL…• Create tables for different time windows and granularities• Roll over from table-to-table (and drop entire tables whenthe data expires)• Update in place (for counters, min, max, etc.) in a reliableway
    13. 13. … but SQL did not fit• Higher write than read volume pushed boundaries of theservers• Requires complex sharding after just a few dozen newcustomers• Is possible, but not worth the operational cost
    14. 14. Then we tried Cassandra (andnever went back)• Scales fluidly• Grows horizontally – double the nodes, double capacity• Add / remove capacity / nodes with no downtime• Highly available• No single point of failure• Replication factor (i.e. hot copies) is just a config switch
    15. 15. … and by the way• Little-to-none operations cost• New nodes take minutes to setup• Nodes just keep running for months on end• “Aggregate on write” – no jobs required!• Atomic distributed counters make it easy to do aggregates onwrite• …and a nice kicker: has *great* perf / COGS in Azure
    16. 16. architecture68 virtual machines (PAAS and IAAS)
    17. 17. Table StorageJobs Worker Role(24 instances)SQL DatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL instances)Web API Web Role(8 instances)End User WebBrowsersMonitored Customer Resources(e.g. websites; SQL databases)Monitored Virtual MachinesEndpoints Replicated datain multipledatacentersClientsPaaSIaaSServices
    18. 18. Avoiding state• Application logic / code alllives on statelessmachines• Keeps it simple: decreaseshuman operations cost• Use Azure PAAS offerings(Web and Worker roles)Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL instances)Web API Web Role(8 instances)Endpoints Replicated datain multipledatacentersPaaS
    19. 19. Windows Azure Cloud Services(PAAS)• Scale horizontally (grew from1 to 30+ instances)• Managed by the platform(patched; coordinatedrecycling; failover; etc.)• 1 click deployment fromVisual Studio (with automaticload balancer swaps)
    20. 20. Table StorageSQLDatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL instances)Web API Web Role(8 instances)Endpoints Replicated datain multipledatacentersJobs Worker RoleRuns recurring tasksto pull, generate andanalyze dataJobs aresynchronized andscheduled usingWindows AzureTables and QueuesJobs Worker Role(24 instances)
    21. 21. Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storagePortal Web Role(3 instances)Cassandra VM Cluster(32 XL instances)Endpoints Replicated datain multipledatacentersWeb API RoleRESTful endpoint forsaving and readingcustom metrics.Highlyconcurrent, secure &scalable.Web API Web Role(8 instances)
    22. 22. Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storageCassandra VM Cluster(32 XL instances)Web API Web Role(8 instances)Endpoints Replicated datain multipledatacentersPortal Web RoleInterface for ourcustomers – showstrends, charts andissues.Portal Web Role(3 instances)
    23. 23. Table StorageJobs Worker Role(24 instances)SQLDatabaseBlob storageWeb API Web Role(8 instances)Endpoints Replicated datain multipledatacentersMaintains allstate for metrics /time series data. Portal Web Role(3 instances)Cassandra VM Cluster(32 XL instances)Cassandra Cluster
    24. 24. Windows Azure Virtual Machines(IaaS)Starting Select Image and VM Size New Disk Persisted in Storage
    25. 25. 32 nodes, 8 “pods” of 4 nodes
    26. 26. Exposing the pods• Each pod of 4 nodeshas a single loadbalanced endpoint• Clients (on ourstateless roles) treatsthe endpoint as a pool• Blacklists and skips anendpoint if it startsproducing a lot oferrors
    27. 27. Where does the data go?• Data files are on 8 mounted networkbacked disks (*not* ephemeral disks)• Data disks are geo-replicated (3copies local; 1 remote) for “free” DR• Azure data disks offer greatthroughput (VMs end up CPU bound)
    28. 28. Our Column Families (CQL3)CREATE TABLE oneminute (rk text,ck text,cnt counter,sum counter,PRIMARY KEY (rk, ck));
    29. 29. Updating values…Realtime “average” values at any granularity, for any time windowupdateoneminute/tenminute/onedaysetsum = sum + {sample_value},cnt = cnt + 1whererk = {customer_name} andck = {metric_path}
    30. 30. Reading values…*ONE* round trip to fetch a metric over time (e.g. CPU over pastweek)select * from oneminutewhererk = ‘{customer_name} andck < {metric_path_start}andck >= {metric_path_end}‘order by ck desc;
    31. 31. What’s next?• Windows Azure Virtual Networks to connect /secure all of our resources(PAAS + IAAS + Services)• Expand Cassandra cluster across datacenterboundaries for improved availability• Integrate with more off-the-shelf Azurecomponents to reduce operational overhead
    32. 32. Global Physical Infrastructureservers/network/datacentersREST API + OTHER SERVICEScompute data management networking

    ×