Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

A talk on how to think about choosing a distributed messaging technology, and some notes on how to avoid locking yourself into a single choice, keeping your platform able to grow as needs change.

Published in: Software
  • Be the first to comment

  • Be the first to like this


  1. 1. Messaging “Just pick something”
  2. 2. A little about myself ● Sean Kelly ○ Also known as Stabby ● I went from .NET to Ruby to Go ○ But my favorite language is SQL ● Core maintainer of Tapjoys: ○ Chore - ○ Dynamiq - ● I love IPAs
  3. 3. Speaking of Tapjoy... We do… ● 1.8 Billion Requests per minute ○ And almost as many messages per day ● ~170 Million Jobs per day ● All on ~750 EC2 instances and private servers ● A stocked double-kegerator ○ Right now: Pinner IPA / Cisco Summer of Lager
  4. 4. What is “Messaging” Like, jobs and stuff?
  5. 5. Messaging is... ● A way to share important events, without needing to know who's listening ● A way to handle processing events and information at a larger scale ● Not all that unlike “background jobs” ○ Jobs: “I’ll do this later” ○ Messaging: “Other people will do this later”
  6. 6. How does this fit into my app? It sure sounds cool
  7. 7. Messaging and You Let’s say you’ve got a great new app, and for a while things are fine Monolith 1.0
  8. 8. Messaging and You Eventually, you need to push work out of band Monolith 1.2 Jobs
  9. 9. Messaging and You Now you have several services, and they all need to share info Monolith 1.5 Jobs One-off which becomes a core part of your business Jobs Failed attempt at Micro Service Reporting System
  10. 10. Sure, but how can you actually use Messaging? Those weren’t even very good drawings They didn’t have lines or anything
  11. 11. What types of Messaging are there? ● 1:1, traditional “Queueing” ○ Basic push / pull model of doing work ○ Common with asynchronous job processing ○ RabbitMQ, ActiveMQ, SQS, Disqueue, Dynamiq, NSQ ● Fanout ○ Broadcast style publishing, all listeners get a copy ○ Ex: A game pushing out notifications of an update ○ Most technologies with 1:1 queues support this in some way
  12. 12. What types of Messaging are there? ● Routing ○ Intelligent fanout, routes to listeners based on message metadata ○ Newsgroups: Subscribe to food.charcuterie.*, get bressola ○ RabbitMQ does this pretty well ● Streaming ○ Persistent connection, constant source of raw bytes ○ Twitter's Firehose is one example ○ Kafka is a current popular choice ○ Really popular with the Scala / Spark crowd
  13. 13. OK, so my Apps and Services need to talk Can’t I just stick it all in a shared database and be done with it?
  14. 14. No You certainly cannot Some things, maybe But not everything, it just doesn’t scale that way
  15. 15. Why not just stick it all in a DB? ● You can some share of your data this way ○ Depends on the use case, type of information ○ This is outside the scope of this talk ● Databases are not designed for delivering messages ○ Any “queue” tables will be ridiculously contended ○ No atomic “pull” options
  16. 16. So, what does Tapjoy do? You guys must have solved this, right?
  17. 17. At Tapjoy, we use... ● RabbitMQ ○ Moves analytics events to reporting endpoints by way of complex filesystem / s3 approach ○ Single node with sharded queues ○ Rabbit HA cannot handle our scale ● SNS / SQS ○ SNS in some newer projects, mostly for fanout ○ SQS for all traditional background jobs ● Kinesis ○ Pilot integration for a new analytics pipeline ○ Being supplanted with Kafka ● Kafka ○ New analytics pipeline ○ Used to distribute metrics to both the new endpoint as well as the existing one for verification ● Dynamiq ○ Inhouse Open Source SNS/SQS-alike built on top of Riak 2.0 ○ Currently used to circumvent complicated and slow legacy messaging service
  18. 18. But I’m not really here to talk about Tapjoy Not entirely I’m more interested in you
  19. 19. So, what do I pick? There are so many choices, and they all seem like they’d work
  20. 20. I’m not really here to tell you what to pick, either! I’d rather talk to you about how to pick, and how you integrate your choices Distributed Systems are all about tradeoffs
  21. 21. Ask: What are my actual needs? ● Planning for 2 years down the road is smart ○ But solutions right now get shit done ○ Include a cost projection with scale estimates ● Build a prototype (or two) ○ Try to iterate quickly ○ Understand how you’d use whatever you choose ○ Don’t be afraid to move on ○ Look at multiple client libraries ■ Look for: Docs, Active repos, Idiomatic
  22. 22. Ask: What is my latency tolerance? ● Publishing Messages ○ How much time can your app tolerate for publishing? ○ What does publish latency look like during an issue? ○ Consider the worst-case scenario when planning ● Consuming Messages ○ Can you run multiple consumers without impacting the service? ● End to End ○ How fast is the whole experience, round trip?
  23. 23. Ask: What level of durability? ● Client ○ Batched VS Unbatched / Streaming ○ Acknowledged writes ● Server ○ Messages held in memory VS disk ○ Messages highly-available? ○ Recover from network partitions safely? ○ At-Most-Once VS At-Least-Once ■ Exactly-Once is something of a myth
  24. 24. Ask: What about throughput? ● How many producing clients do you have ● How many messages per second will they submit ○ Does message size impact performance? ● What size should the cluster be? ○ Super cluster VS specialized clusters ● How many consumers it takes to keep pace ○ With room to grow
  25. 25. Ask: What does failure mean? ● What does a message publishing error mean? ● What does a delay in the processing pipeline mean? ● What does a “lost” or failed message mean? ● What does a total failure of the messaging system mean?
  26. 26. Ask: What behavior do I want? Is it… ● CA? ○ Not distributed, will be difficult to scale past 1 box ○ Traditional RDBMS systems are typically CA ● CP? ○ Good if you need strongly consistent data ○ Partitions can cause data unavailability ● AP? ○ Good if you need complete availability ○ Eventual consistency can often be “good enough”
  27. 27. Okay, so I lied a little bit I’ll give you one recommendation
  28. 28. Do you have... ● Relatively small (< 256kb) message sizes? ● Not so strict (~50ms) latency requirements? ● Throughput on the order of 100m or less per month? ● A tolerance or capability to handle the occasional duplicate message? ● No concern around being locked into a vendor-specific technology?
  29. 29. Go use SNS and SQS immediately Leave here now and just do it It’s easy, it’s cheap (at that scale), and you don’t need to maintain it
  30. 30. Ok, so I picked “something” Anything else to know?
  31. 31. You don’t have to choose just 1 ● It’s a falsehood that you need 1 perfect technology ○ Each has strengths, weaknesses, and ideal use cases ● Don’t be afraid to use something else ○ If you’re lucky, your app lives long enough to see many different infrastructure needs
  32. 32. Avoid direct implementations ● Wrap the notion of Publishing in an interface ○ Most technologies look surprisingly similar to publish ○ You can wrap this in a simple interface, and switch implementations as needed ● Consuming is usually unique per technology ○ Just write a new one ○ Trying to interface this part is probably more trouble than it’s worth ○ Play to the unique strengths of the technology
  33. 33. Interfacing your Messaging choices ● Sending messages is often as simple as a name and a chunk of data ○ Define a simple interface for pushing arbitrary data towards a named endpoint ○ A name and a string of JSON is usually enough to get going ○ At Tapjoy, we use our Chore library to handle abstracting message publishing from messaging technologies ● Destinations are independent from messages ○ You could need to switch sending messages to a new technology ○ You could have 2 or more different systems depending on the information in a given message
  34. 34. How do I change messages safely? ● Wrap messages in a simple envelope ○ Keep metadata about the message distinct from metadata about the event it describes ● Define schemas for message bodies ○ Schemas give you a catalogue of message definitions, and the ability to version them ○ At Tapjoy, we use our TOLL to build endpoint-agnostic clients based on schemas, and register them to use Chore publishers. ● Consumers need older schemas ○ Lets them reason about how to handle older messages ○ Keep a backlog of N older versions, drop support for > N
  35. 35. In Conclusion
  36. 36. Keep in mind ● Distributed Systems - all about tradeoffs ○ Never trade “P” ● Understand your needs ○ Latency, Throughput, Availability, Durability ● Understand how it fits into your architecture ● Interfaces are your friend ○ They can give you a lot of flexibility
  37. 37. Keep in mind ● Use schemas and versioning to support changes to messages themselves ● Just pick something ○ Build a prototype, or two (or three) ○ Your second try will probably go better ○ SNS/SQS is a decent choice, if latency isn’t a concern ● Tapjoy is a great place to work on these kinds of problems at huge scale
  38. 38. Messaging “Just pick something” Sean Kelly @StabbyCutyou