How we approached the problem of building a reliable system that must deal with complex failures, and must scale, while keeping our (Clojure) codebase simple, extensible, and verifiable. Without losing our minds.
3. About me - Kapil
Staff Engineer @ Helpshift
Clojure
Distributed Systems
Games
Music
Books/Comics
Football
4. Helpshift is a Mobile CRM SaaS product. We help connect app developers with their customers. Since everything is now on mobile.
5. • 600M+ MAU
• 60k RPS
• 500GB / day
Scale
These are some of the scale numbers we have reached at Helpshift.
6. Reliable
• Fail fast
• Detect non-recoverable errors
• Resilient
• Retries recoverable errors
• Backpressure
• Detect degraded state
At this scale services need to be reliable. We need to have exact control on how things behave under failure conditions.
7. Let’s take a look at the problem. We will build the solution to the problem iteratively once we understand the scope
We do a lot of writes to ElasticSearch but those writes can be done asynchronously. So application servers just push updates to a Kafka topic. We need write a Kafka
consumer that reads from the topic and performs writes to ElasticSearch. But wait! Elasticsearch has bulk api. So we need to write a Kafka consumer that bulk writes to
Elasticsearch.
8. Testing becomes simpler. It’s just putting things in channel and verifying FSM state
Test generate signals and data. Assertion is checking what state FSM goes into.
9. Scale
• V1 - 150 rps
• Today - 5k rps
Scaling a reliable service which recovers from error states is very simple. It’s basically Kafka consumer that can handle all the happy and unhappy paths. Scaling it means
just adding more instances of these services or FSMs in the same service.
10. Extensibility / Maintainability
• First version - 5 weeks
• MongoDB - 2 weeks
• Active maintainer - 1 engineer - 20% time
• In Production - 2 years
• LOC - 2k
• Project as a library
11. Summary
• Reliable system == Predictable failures and happy
paths
• Use CSP / core.async to decouple components
• Central FSM that receives all data and control
signals to take decisions