Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Openzipkin conf: Zipkin at Yelp

327 views

Published on

The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Openzipkin conf: Zipkin at Yelp

  1. 1. Zipkin @ Prateek Agarwal @prat0318
  2. 2. - Prateek Agarwal - Software Engineer - Infrastructure team @ Yelp - Have worked on - python Swagger clients, - Zipkin infrastructure, - Maintaining Cassandra, ES clusters About me
  3. 3. Yelp’s Mission Connecting people with great local businesses.
  4. 4. Yelp Stats As of Q1 2016 90M 3270%102M
  5. 5. - Zipkin Infrastructure - pyramid_zipkin / swagger_zipkin - Lessons learned - Future plans Agenda
  6. 6. - 250+ services - We <3 Python - Pyramid/uwsgi framework - SmartStack for service discovery - Swagger for API schema declaration - Zipkin transport : Kafka | Zipkin datastore : Cassandra - Trace is generated on live traffic at a very very low % rate (0.005%) - Can also be generated on-demand by providing a particular query-param Infrastructure overview
  7. 7. Infrastructure overview Let’s talk about a scenario where service A calls B.
  8. 8. pyramid_zipkin - A simple decorator around every request - Able to handle scribe | kafka transport - Attaches a `unique_request_id` to every request - No changes needed in the service logic - Ability to add annotations using python’s `logging` module - Ability to add custom spans Service B pyramid_zipkin pyramid uwsgi
  9. 9. pyramid_zipkin Service B pyramid_zipkin pyramid uwsgi - Ability to add custom spans
  10. 10. swagger_zipkin - Eliminates the manual work of attaching zipkin headers - Decorates over swagger clients - swaggerpy (swagger v1.2) - bravado (swagger v2.0) Service A swagger_client swagger_zipkin
  11. 11. Lessons Learned - Cassandra is an excellent datastore for heavy writes - Typical prod writes/sec : 15k - It was able to even handle 100k writes/sec
  12. 12. Lessons Learned - Allocating offheap memory for Cassandra helped in reducing write latency by 2x - Pending compactions also went down.
  13. 13. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Adding more kafka partitions - Running more instances of collector - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  14. 14. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Adding more kafka partitions - Running more instances of collector - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  15. 15. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Adding more kafka partitions - Running more instances of collector - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  16. 16. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Running more instances of collector - Adding more kafka partitions - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  17. 17. Lessons Learned - With more services added, fetching from Kafka became a bottleneck - Solutions tried: - Running more instances of collector - Adding more kafka partitions - Adding multiple kafka consumer threads - with appropriate changes in openzipkin/zipkin - WIN - Batching up messages before sending to Kafka - with appropriate changes in openzipkin/zipkin - BIG WIN
  18. 18. Future Plans - To be used during deployments to check degradations - Validate the differences in number of downstream calls - Check against any new dependency sneaking in - Time differences in the spans - Create trace aggregation infrastructure using Splunk (wip) - A missing part of Zipkin - Redeploy zipkin dependency graph service after improvements - The service was unprovisioned because it created 100s of Gigs of /tmp files - These files got purged after the run (in ~1-2 hours) - Meanwhile, ops got alerted due to low disk space remaining - Didn’t give much of a value addition
  19. 19. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp

×