Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pintrace: Distributed tracing@Pinterest

661 views

Published on

#VelocityConf 2017 talk

Published in: Engineering

Pintrace: Distributed tracing@Pinterest

  1. 1. P I N T R A C E D I S T R I B U T E D T R A C I N G @ P I N T E R E S T S U M A N K A R U M U R I
  2. 2. A B O U T M E • Passionate about distributed tracing, monitoring and cloud infrastructure. • Lead on Visibility team at Pinterest. • Lead for Zipkin project at Twitter (briefly). • Author of “Distributed tracing” (upcoming) from O’Reilly. • Ex-(Twitter, Facebook, Amazon, Yahoo, Goldman Sachs).
  3. 3. M O T I VAT I O N
  4. 4. SPEED IMPROVES ENGAGEMENT
  5. 5. M I C R O - S E R V I C E S B R O K E O U R T O O L S HOW DID THIS REQUEST EXECUTE?
  6. 6. A G G R E G AT E E V E N T S P E R S E R V I C E U N D E R S TA N D T R E N D S A N D A L E R T S C H E A P S E R V I C E L E V E L O V E R V I E W N O P E R R E Q U E S T O V E R V I E W M E T R I C S
  7. 7. R E C O R D D I S C R E T E E V E N T S M A N U A L C O R R E L AT I O N E X P E N S I V E F L E X I B L E B U T V E RY B R I T T L E L O G S
  8. 8. P R O J E C T P R E S T I G E P I N P O I N T M A N U A L T R A C I N G
  9. 9. R E C O R D E V E N T S I N A R E Q U E S T W I T H C A U S A L O R D E R I N G What is Distributed Tracing?
  10. 10. S T R U C T U R E D L O G G I N G O N S T E R O I D S A N N O TAT I O N , S PA N , T R A C E What is Distributed Tracing?
  11. 11. T R A C E R E Q U E S T S : R E C O R D E V E N T S I N A R E Q U E S T W I T H C A U S A L O R D E R I N G . A C R O S S M O B I L E C L I E N T S , B A C K E N D S E R V I C E S A N D D ATA B A S E S Z I P K I N B A S E D T R A C I N G S O L U T I O N M O R E E X P E N S I V E P I N T R A C E
  12. 12. B U I L D I N G P I N T R A C E : 5 C H A L L E N G E S
  13. 13. B U I L D I N S T R U M E N TAT I O N C H A L L E N G E 1 HARD & TEDIOUS O N E I N S T R U M E N TAT I O N S P E R ( L A N G U A G E , F R A M E W O R K , T H R E A D P O O L , P R O T O C O L ) C O M B I N AT I O N . O P E N T R A C I N G P Y T H O N T R A C E R , F I N A G L E Z I P K I N T R A C E R
  14. 14. S PA N R E P O R T A N D A G G R E G AT I O N C H A L L E N G E 2 First company wide span aggregation pipeline.
  15. 15. D E P L O Y I N S T R U M E N TAT I O N C H A L L E N G E 3 3 instrumentations. 100+ services 40 teams Sampling <1% traffic
  16. 16. T R A C E P R O C E S S I N G A N D S T O R A G E C H A L L E N G E 4 Open sourced our streaming pipeline: github.com/openzipkin/zipkin-sparkstreaming
  17. 17. T R A C E V I S U A L I Z AT I O N C H A L L E N G E 5 Pintrace architecture
  18. 18. T R A C E S A R E D ATA Z I P K I N U I
  19. 19. A P P L I C AT I O N S O F T R A C E D ATA U N D E R S TA N D , D E B U G A N D T U N E D I S T R I B U T E D S Y S T E M S .
  20. 20. I D E N T I F Y I N G S E R V I C E S I N T E R A C T I N G W I T H A R E Q U E S T U N D E R S TA N D R E Q U E S T T I M E L I N E
  21. 21. I D E N T I F Y I N G D U P L I C AT E C O M P U TAT I O N U N D E R S TA N D R E Q U E S T T I M E L I N E 5% latency (20ms improvement) while halving the load
  22. 22. W H I C H C L U S T E R S E R V E D T H I S R E Q U E S T ? D E B U G D I S T R I B U T E D S Y S T E M
  23. 23. C U S T O M A P P L I C AT I O N S PA N S D E B U G D I S T R I B U T E D S Y S T E M
  24. 24. I D E N T I F Y C L O C K S K E W D E B U G D I S T R I B U T E D S Y S T E M Clock skew is very common in cloud environment. Easily identified in a trace. Zipkin UI corrects for clock skew.
  25. 25. I D E N T I F Y I N G S E R I A L E X E C U T I O N T U N E D I S T R I B U T E D S Y S T E M Step pattern in a trace signifies serial execution Parallel get_many after the bug fix.
  26. 26. M O R E A P P L I C AT I O N S O F T R A C E D ATA • Tracking down p99 latencies. • Identify architectural optimizations. • Latency pipeline. • Service dependency analysis. • Improve time to triage. • Automated root cause analysis.
  27. 27. L E S S O N S L E A R N E D • User awareness and education are very important to make tracing successful. • Begin with the end in mind. • Trace most valuable paths in the application. • Distributed tracing landscape is confusing. • Quality of traces is more important than quantity.
  28. 28. Q U E S T I O N S ? https://tinyurl.com/pintrace-architecture https://tinyurl.com/pintrace-applications skarumuri@pinterest.com twitter: @mansu

×