Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using Flink"

862 views

Published on

Distributed tracing is used to analyze performance and error cases in service oriented architectures. The Observability team at Airbnb recently created Upshot, a data pipeline that uses Flink to analyze over 40 million trace events per minute. Summaries of the resulting data are sent to Druid, Datadog, and other downstream datastores. This talk will focus on how we use Flink and how we analyzed and addressed scaling issues we encountered while building Upshot.

Published in: Technology
  • Excelente los enlaces. he visto varios videos. https://uautonoma.cl
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using Flink"

  1. 1. Upshot: Distributed Tracing with Flink Brian Wolfe / 2018-09-05 / Flink Forward 2018
  2. 2. Tracing with Flink ● Observability at Airbnb ● Trace reconstruction job ● Successes (and failures)
  3. 3. Tracing with Flink ● Observability at Airbnb ● Trace reconstruction job ● Successes (and failures)
  4. 4. Airbnb 5 M + Listings worldwide 81,000 Cities with listings 191+ Countries with listings 300 million+ Total guest arrivals
  5. 5. Observability
  6. 6. Observability Ensure that software engineers at Airbnb have the monitoring and introspection tools to successfully operate and develop their services.
  7. 7. The Reality
  8. 8. Reality 1: many tools
  9. 9. Reality 1: many tools ?
  10. 10. Reality 1: many tools
  11. 11. Reality 2: data silos
  12. 12. Reality 3: choosing the right data is hard Present Enrich DS Validate
  13. 13. Reality 3: choosing the right data is hard Present Enrich DS Validate Late error in request
  14. 14. Reality 3: choosing the right data is hard Present Enrich DS Validate Late error in request Give me ALL the data!!
  15. 15. Tracing with Flink ● Observability at Airbnb ● Trace reconstruction job ● Successes (and failures)
  16. 16. Upshot Pipeline
  17. 17. Flink Aggregation
  18. 18. Flink Aggregation : CollectTrace After one event
  19. 19. Flink Aggregation : CollectTrace After more events
  20. 20. Flink Aggregation : CollectTrace After all events
  21. 21. Flink Aggregation : CollectTrace After all events
  22. 22. Tracing with Flink ● Observability at Airbnb ● Trace reconstruction job ● Successes (and failures)
  23. 23. Success: it works 3 Million Events / s
  24. 24. Success: One Instrumentation, Several Uses High dynamic range histograms in DruidIndividual request visualizations ResponseTime(ms) Quantilems Servicecall
  25. 25. Success : Post-Sampling of Events
  26. 26. Success: testing by dual-reading
  27. 27. Problem: serialization
  28. 28. Problem: serialization performance 3x performance improvement with a custom serializer
  29. 29. Problem: serialization performance 3x performance improvement with a custom serializer String deserialization, alloc, GC is still > 25% of CPU cost.
  30. 30. Problem: coupling between all operators
  31. 31. Problem: coupling between all operators
  32. 32. Problem: coupling between all operators
  33. 33. Problem: coupling between all operators
  34. 34. Problem: coupling between all operators
  35. 35. Problem: coupling between all operators
  36. 36. Problem: coupling between all operators
  37. 37. Problem: coupling between all operators
  38. 38. Problem: coupling between all operators
  39. 39. Problem: coupling between all operators Better after Flink 1.5, but still a problem
  40. 40. Redesigning the data flow
  41. 41. Problem: Coupling Between All OperatorsUseFlinkpartitionerto writetoKafka
  42. 42. Problem: Coupling Between All OperatorsUseFlinkpartitionerto writetoKafka Limit fan-out
  43. 43. What is next?
  44. 44. Next Steps ● Localize impact of degraded performance ● Load shed when necessary ● Add more links between observability tools
  45. 45. Observability Team Willie Yao Nelson Gauthier Jerry Chung Rong Hu Chen Luo Sarah Wada Harry Shoff Nathan Baxter Joseph Sofaer

×