The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.
2. - Prateek Agarwal
- Software Engineer
- Infrastructure team @ Yelp
- Have worked on
- python Swagger clients,
- Zipkin infrastructure,
- Maintaining Cassandra, ES clusters
About me
6. - 250+ services
- We <3 Python
- Pyramid/uwsgi framework
- SmartStack for service discovery
- Swagger for API schema declaration
- Zipkin transport : Kafka | Zipkin datastore : Cassandra
- Trace is generated on live traffic at a very very low % rate (0.005%)
- Can also be generated on-demand by providing a particular query-param
Infrastructure overview
8. pyramid_zipkin
- A simple decorator around every request
- Able to handle scribe | kafka transport
- Attaches a `unique_request_id` to every request
- No changes needed in the service logic
- Ability to add annotations using python’s `logging` module
- Ability to add custom spans Service B
pyramid_zipkin
pyramid
uwsgi
10. swagger_zipkin
- Eliminates the manual work of attaching zipkin headers
- Decorates over swagger clients
- swaggerpy (swagger v1.2)
- bravado (swagger v2.0)
Service A
swagger_client
swagger_zipkin
11. Lessons Learned
- Cassandra is an excellent datastore for heavy writes
- Typical prod writes/sec : 15k
- It was able to even handle 100k writes/sec
12. Lessons Learned
- Allocating offheap memory for Cassandra helped in reducing write latency by 2x
- Pending compactions also went down.
13. Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
14. Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
15. Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
16. Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Running more instances of collector
- Adding more kafka partitions
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
17. Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Running more instances of collector
- Adding more kafka partitions
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
18. Future Plans
- To be used during deployments to check degradations
- Validate the differences in number of downstream calls
- Check against any new dependency sneaking in
- Time differences in the spans
- Create trace aggregation infrastructure using Splunk (wip)
- A missing part of Zipkin
- Redeploy zipkin dependency graph service after improvements
- The service was unprovisioned because it created 100s of Gigs of /tmp files
- These files got purged after the run (in ~1-2 hours)
- Meanwhile, ops got alerted due to low disk space remaining
- Didn’t give much of a value addition