Openzipkin conf: Zipkin at Yelp

155 views

Published on

The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.

Published in: Internet
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
155
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Openzipkin conf: Zipkin at Yelp

  1. 1. Zipkin @ Prateek Agarwal @prat0318
  2. 2. - Prateek Agarwal - Software Engineer - Infrastructure team @ Yelp - Have worked on - python Swagger clients, - Zipkin infrastructure, - Maintaining Cassandra, ES clusters About me
  3. 3. Yelp’s Mission Connecting people with great local businesses.
  4. 4. Yelp Stats As of Q1 2016 90M 3270%102M
  5. 5. - Zipkin Infrastructure - pyramid_zipkin / swagger_zipkin - Lessons learned - Future plans Agenda
  6. 6. - 250+ services - We <3 Python - Pyramid/uwsgi framework - SmartStack for service discovery - Swagger for API schema declaration - Zipkin transport : Kafka | Zipkin datastore : Cassandra - Trace is generated on live traffic at a very very low % rate (0.005%) - Can also be generated on-demand by providing a particular query-param Infrastructure overview
  7. 7. Infrastructure overview Let’s talk about a scenario where service A calls B.
  8. 8. pyramid_zipkin - A simple decorator around every request - Able to handle scribe | kafka transport - Attaches a `unique_request_id` to every request - No changes needed in the service logic - Ability to add annotations using python’s `logging` module - Ability to add custom spans Service B pyramid_zipkin pyramid uwsgi
  9. 9. pyramid_zipkin Service B pyramid_zipkin pyramid uwsgi - Ability to add custom spans
  10. 10. swagger_zipkin - Eliminates the manual work of attaching zipkin headers - Decorates over swagger clients - swaggerpy (swagger v1.2) - bravado (swagger v2.0) Service A swagger_client swagger_zipkin
  11. 11. Lessons Learned - Cassandra is an excellent datastore for heavy writes - Typical prod writes/sec : 15k - It was able to even handle 100k writes/sec
  12. 12. Lessons Learned - Allocating offheap memory for Cassandra helped in reducing write latency by 2x - Pending compactions also went down.
  13. 13. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Adding more kafka partitions - Running more instances of collector - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  14. 14. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Adding more kafka partitions - Running more instances of collector - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  15. 15. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Adding more kafka partitions - Running more instances of collector - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  16. 16. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Running more instances of collector - Adding more kafka partitions - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  17. 17. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Running more instances of collector - Adding more kafka partitions - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  18. 18. Future Plans - To be used during deployments to check degradations - Validate the differences in number of downstream calls - Check against any new dependency sneaking in - Time differences in the spans - Create trace aggregation infrastructure using Splunk (wip) - A missing part of Zipkin - Redeploy zipkin dependency graph service after improvements - The service was unprovisioned because it created 100s of Gigs of /tmp files - These files got purged after the run (in ~1-2 hours) - Meanwhile, ops got alerted due to low disk space remaining - Didn’t give much of a value addition
  19. 19. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp

×