SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.
The talk describes how Yelp deploys Zipkin and integrates it with its 250+ services. It also goes through the challenges faced during scaling it up and how we tuned it up.
2.
- Prateek Agarwal
- Software Engineer
- Infrastructure team @ Yelp
- Have worked on
- python Swagger clients,
- Zipkin infrastructure,
- Maintaining Cassandra, ES clusters
About me
3.
Yelp’s Mission
Connecting people with great
local businesses.
6.
- 250+ services
- We <3 Python
- Pyramid/uwsgi framework
- SmartStack for service discovery
- Swagger for API schema declaration
- Zipkin transport : Kafka | Zipkin datastore : Cassandra
- Trace is generated on live traffic at a very very low % rate (0.005%)
- Can also be generated on-demand by providing a particular query-param
Infrastructure overview
7.
Infrastructure overview
Let’s talk about a scenario where service A calls B.
8.
pyramid_zipkin
- A simple decorator around every request
- Able to handle scribe | kafka transport
- Attaches a `unique_request_id` to every request
- No changes needed in the service logic
- Ability to add annotations using python’s `logging` module
- Ability to add custom spans Service B
pyramid_zipkin
pyramid
uwsgi
9.
pyramid_zipkin
Service B
pyramid_zipkin
pyramid
uwsgi
- Ability to add custom spans
10.
swagger_zipkin
- Eliminates the manual work of attaching zipkin headers
- Decorates over swagger clients
- swaggerpy (swagger v1.2)
- bravado (swagger v2.0)
Service A
swagger_client
swagger_zipkin
11.
Lessons Learned
- Cassandra is an excellent datastore for heavy writes
- Typical prod writes/sec : 15k
- It was able to even handle 100k writes/sec
12.
Lessons Learned
- Allocating offheap memory for Cassandra helped in reducing write latency by 2x
- Pending compactions also went down.
13.
Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
14.
Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
15.
Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
16.
Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Running more instances of collector
- Adding more kafka partitions
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
17.
Lessons Learned
- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:
- Running more instances of collector
- Adding more kafka partitions
- Adding multiple kafka consumer threads
- with appropriate changes in openzipkin/zipkin
- WIN
- Batching up messages before sending to Kafka
- with appropriate changes in openzipkin/zipkin
- BIG WIN
18.
Future Plans
- To be used during deployments to check degradations
- Validate the differences in number of downstream calls
- Check against any new dependency sneaking in
- Time differences in the spans
- Create trace aggregation infrastructure using Splunk (wip)
- A missing part of Zipkin
- Redeploy zipkin dependency graph service after improvements
- The service was unprovisioned because it created 100s of Gigs of /tmp files
- These files got purged after the run (in ~1-2 hours)
- Meanwhile, ops got alerted due to low disk space remaining
- Didn’t give much of a value addition