Upshot: Distributed Tracing
with Flink
Brian Wolfe / 2018-09-05 / Flink Forward 2018
Tracing with Flink
● Observability at Airbnb
● Trace reconstruction job
● Successes (and failures)
Tracing with Flink
● Observability at Airbnb
● Trace reconstruction job
● Successes (and failures)
Airbnb
5 M +
Listings worldwide
81,000
Cities with listings
191+
Countries with listings
300
million+
Total guest arrivals
Observability
Observability
Ensure that software engineers at
Airbnb have the monitoring and
introspection tools to successfully
operate and develop their services.
The Reality
Reality 1: many tools
Reality 1: many tools
?
Reality 1: many tools
Reality 2: data silos
Reality 3: choosing the right data is hard
Present
Enrich
DS
Validate
Reality 3: choosing the right data is hard
Present
Enrich
DS
Validate
Late error in request
Reality 3: choosing the right data is hard
Present
Enrich
DS
Validate
Late error in request
Give me ALL the data!!
Tracing with Flink
● Observability at Airbnb
● Trace reconstruction job
● Successes (and failures)
Upshot Pipeline
Flink Aggregation
Flink Aggregation : CollectTrace
After one event
Flink Aggregation : CollectTrace
After more events
Flink Aggregation : CollectTrace
After all events
Flink Aggregation : CollectTrace
After all events
Tracing with Flink
● Observability at Airbnb
● Trace reconstruction job
● Successes (and failures)
Success: it works
3 Million Events / s
Success: One Instrumentation, Several Uses
High dynamic range histograms in DruidIndividual request visualizations
ResponseTime(ms)
Quantilems
Servicecall
Success : Post-Sampling of Events
Success: testing by dual-reading
Problem: serialization
Problem: serialization performance
3x performance improvement with a custom serializer
Problem: serialization performance
3x performance improvement with a custom serializer
String deserialization, alloc, GC is still > 25% of CPU cost.
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Problem: coupling between all operators
Better after Flink 1.5, but still a problem
Redesigning the data flow
Problem: Coupling Between All OperatorsUseFlinkpartitionerto
writetoKafka
Problem: Coupling Between All OperatorsUseFlinkpartitionerto
writetoKafka
Limit fan-out
What is next?
Next Steps
● Localize impact of degraded performance
● Load shed when necessary
● Add more links between observability tools
Observability
Team
Willie Yao Nelson Gauthier Jerry Chung
Rong Hu Chen Luo Sarah Wada
Harry Shoff Nathan Baxter Joseph Sofaer
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using Flink"

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using Flink"