Your SlideShare is downloading. ×
C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf

5,478
views

Published on

The Comcast Silicon Valley Innovation Center has developed a general purpose message bus for the cloud. The service is API compatible with Amazon's SQS/SNS and is built on Cassandra and Redis with the …

The Comcast Silicon Valley Innovation Center has developed a general purpose message bus for the cloud. The service is API compatible with Amazon's SQS/SNS and is built on Cassandra and Redis with the goal of linear horizontal scalability. This presentation offers and in-depth look at the architecture of the system and how they employ Cassandra as a central component to meet key requirements. Latest feature enhancements and performance data will also be covered.

Published in: Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,478
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
70
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CMB – A Message Bus for the Cloud
  • 2. CMB – A Message Bus for the CloudCQS – Queuing ServiceCNS – Topic based Pub Sub Service
  • 3. Why did we build our own?•  General purpose message bus to replace project driven one-offsolutions•  Smooth data center failover, maybe even “active-active” queues•  Must scale to millions of queues and 1000s of messages/sec (forexample 1 queue per STB)•  Tight latency requirements (“10ms response time 95th pct”)•  Evaluated other options to arrive at AWS SQS/SNS
  • 4. AWS SQS Primer“Simple Queuing Service”•  Focus on guaranteed delivery•  Best effort on orderly delivery, duplicates•  Few simple core APIs:SendMessage() ReceiveMessage() DeleteMessage()•  Do not trust message recipients
  • 5. Advantages of adopting an APIIf you do it on your own:•  API design typically biased towards first use case•  Almost guaranteed: You won’t get it right the first time (iterations)•  Difficult for new users to adopt: Documentation, tools, community,…
  • 6. Why did we build our own?AWS  SQS  Guaranteed  Delivery   +  Simple,  Robust  API   +  Scalability   +  Ac;ve-­‐Ac;ve   ?  DC  Failover   ?  Latency  &  Throughput   ?  Limita;ons  (Msg  Size,  #  Ar;facts,  …)   ?  
  • 7. “Build a horizontally scalable queuing service on top ofCassandra (and Redis) which is API compatible withAWS SQS / SNS API”
  • 8. CQS over Cassandra and RedisCassandra•  Cross-DC persistence and replication•  Proven horizontal scalabilityRedis•  Meet latency requirements•  Help with best effort ordering•  Handle Visibility Timeout (VTO)
  • 9. Cassandra Data ModelingHow to represent queued messages in Cassandra?•  Single Column Queue•  Single Row Queue•  Multi-Row Queue
  • 10. Single Column Queue
  • 11. Single Row Queue
  • 12. Multi-Row Queue
  • 13. CQS Data Flow Example1.  SendMessage(MSG1)2.  SendMessage(MSG2)3.  SendMessage(MSG3)4.  MSG1 = ReceiveMessage()5.  DeleteMessage(MSG1)
  • 14. CQS Architecture RecapCassandra Persistence Layer•  Messages sharded across 100 rows per queue•  Avoid wide rows (> 500K)•  Minimize churn (Tombstones)•  Distribute queue among Cassandra nodesRedis Caching Layer•  To meet latency requirements•  Payload cache (kicks in after first miss, pre-load next 10k)•  Improve FIFOness by storing Msg IDs in Redis List•  Handle message visibility entirely in Redis (Hashtable)
  • 15. CQS Key Cassandra FeaturesPersistence and failover•  Cross-DC replication in combination with Local Quorum Reads/Writes(tunable consistency)Millions of queues, spiky traffic patterns•  Massive horizontal scalabilityMessage order (FIFOness) / future dated messages•  Wide rows, composite column keys / TimeUUID and column sort orderMessage retention period (expiration)•  TTLFast lookup of static metadata (Queues, Users etc.)•  Row Cache, Secondary Indexes
  • 16. CQS Scalability•  Send(), Receive() and Delete() scale with CassandraRing, API Servers (stateless) and Redis Shards•  Are constant time operations•  Queues not sharded across Redis servers!
  • 17. CQS Availability•  Depends on availability of Cassandra•  Service functions without Redis!
  • 18. CQS DC Failover
  • 19. AWS SNS API“Simple Notification Service”•  Topic based Publish/Subscribe Service•  Supported protocols: HTTP/CQS/SQS•  Few simple core APIsCreateTopic() / DeleteTopic()Subscribe() / Unsubscribe()ConfirmSubscription()Publish()•  Do not trust message recipients (redelivery policy)
  • 20. CNS Data Flow Example•  Single operation: Publish message MSG1 to a topicT with four Subscribers S1, S2, S5, S6.•  S1, S2 are HTTP endpoints•  S5, S6 are CQS queues
  • 21. CNS Architecture Recap•  CQS Queue preserves messages when PublishWorkers are down or overloaded•  CQS Visibility Timeout takes care of guaranteeddelivery•  Retry policy improves guaranteed delivery fortemporarily unavailable endpoints (http)•  Publish Workers hardened for rogue endpoints (failingendpoints, slow endpoints, …)
  • 22. Differences SQS/SNS and CQS/CNSGoal: Full API compatibilityCurrent state:•  All APIs implemented, most parameters supported•  Can use AWS Java SDK and othersLimitations:•  AWS4 signatures not supported (V1 and V2 ok)•  SMS endpoints not supported, limited email supportEnhancements:•  Additional APIs for monitoring and management•  Unlimited number of queues, topics and subscriptions•  Adjustable message size and other parameters (MSG <= 64KB, LP <= 20 sec, DS <= 900 sec, RP, …)
  • 23. CMB Ready for Production Use?•  Code of CMB Core is stable•  Extensive testing done (including throughputscalability testing)•  In use at Comcast (Sports, DVR, …)
  • 24. Testing Goals•  Functional testing (unit tests, good code coverage)•  Stress testing (simulate Redis outage, data center failover)•  Endurance testing•  Load testing: Verify linear horizontal scalability (CQS / CNSthroughput scalability)
  • 25. CQS Throughput Scalability•  Throughput as a function of Cassandra Ring size•  Increase load until throughput (msg/sec) reaches a maximum•  Increase ring size and re-test•  Ensure sufficient API and Redis capacity to support largest ring•  Deployment: 10 API Servers, 5 Redis Shards, 4-16 Node CassandraRing
  • 26. CQS Throughput Scalability#  Load  Gen   #  API  Servers   #  Redis  Shards   Ring  Size   API  /  Sec   P99  5   10   5   4   2832   <=  100  ms  5   10   5   8   6072   <=  100  ms  5   10   5   12   9472   <=  100  ms  6   12   6   16   11667   <=  100  ms  8   15   7   20   13514   <=  100  ms  8   15   7   24   15365   <=  100  ms  
  • 27. 0  2000  4000  6000  8000  10000  12000  14000  16000  18000  4   8   12   16   20   24  API/sec  Ring  Size  
  • 28. CNS Throughput Scalability•  Most important metric: End-to-end latency•  Fixed number of subscribers, gradually increase #msg/sec publisheduntil system is “overwhelmed”•  Increase number of Publish Workers and re-test•  Deployment: 8 node Cassandra Ring, 2 API Servers, 2 Redis Shards,3-6 Publish Workers•  Test setup: Single topic with 100 HTTP subscribers, 10 min testduration
  • 29. CNS Throughput Scalability3 Publish Workers#PUB/SEC  #MSG/SEC  AVG(LAT)    API    AVG(RT)    API    P95(RT)    API  AVG(CQS)    API  P95(CQS)    PROD  AVG(RT)    PROD  P95(RT)    PROD  AVG(CQS)    PROD  P95(CQS)    CONS  AVG(RT)    CONS  P95(RT)    CONS  AVG(CQS)    CONS  P95(CQS)    HTTP  AVG(RT)    HTTP  P95(CQS)    5   500   198   21   44   10   26   77   150   77   155   47   96   38   85   12   18  10   1000   177   15   30   8   16   68   119   68   118   39   70   31   59   12   17  20   2000   160   16   29   9   17   69   120   69   120   40   69   31   59   9   16  40   4000   209   18   37   10   22   78   138   78   139   41   74   32   62   11   18  80   8000   75656   28   61   14   27   237   1020   237   1020   143   790   131   770   14   21  
  • 30. CNS Throughput Scalability6 Publish Workers#PUB/SEC  #MSG/SEC  AVG(LAT)    API    AVG(RT)    API    P95(RT)    API  AVG(CQS)    API  P95(CQS)    PROD  AVG(RT)    PROD  P95(RT)    PROD  AVG(CQS)    PROD  P95(CQS)    CONS  AVG(RT)    CONS  P95(RT)    CONS  AVG(CQS)    CONS  P95(CQS)    HTTP  AVG(RT)    HTTP  P95(CQS)    5   500   247   37   117   19   66   141   290   139   290   77   200   57   160   18   34  10   1000   226   41   130   21   79   163   380   162   370   79   200   58   160   17   39  20   2000   199   37   118   20   74   133   280   133   280   68   180   50   150   11   21  40   4000   225   45   140   25   110   148   320   148   320   76   210   53   170   18   25  80   8000   267   48   126   25   80   149   300   149   300   77   180   58   150   22   38  160   16000   145135   76   180   41   120   228   460   228   460   115   280   97   250   28   70  
  • 31. 020040060080010001200500 1000 2000 4000 8000Latency(ms)Throughput (Msgs/Sec)6 workers3 workersCNS  Throughput  Scalability  
  • 32. Use Case: X1 Sports App
  • 33. Use Case: X1 Sports App
  • 34. Use Case: X1 Sports App
  • 35. Moving Forward•  Follow SNS / SQS APIs•  More load and stress testing•  Ease of deployment and scale up•  More in-house production deployments (currently isolatedby application)•  CQS as a Service
  • 36. Thank You!http://github.com/Comcast/cmbhttp://groups.google.com/forum/#!forum/cmb-user-forumbwolf@sv.comcast.com
  • 37. BACKUP
  • 38. CNS Endurance TestSingle  topic  with  5  HTTP  subscribers,  65  msg/sec  published  14  mio  messages  published  over  12hrs  
  • 39. Use Case: EAS