SlideShare a Scribd company logo
1 of 26
Download to read offline
Distributed Tracing
Insights from
internal SAAS team
Calendar
> Background
> Our internal DistTrace problems
> Searching for better solution
> Conclusion
> @dxhuy (フィ)
> LINE’s Observability Team EM
Introduction
> Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash
> Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
> What is Distributed Tracing
> Request scope DistTrace
> Concepts
Prerequisite know-how
Elasticsearch
Our setup
Brave

base client
Zipkin base

internal server
Elasticsearch
Elasticsearch
Thrift
Custom Armeria

based collector
https://github.com/line/armeria
NvME based

High spec machines
Our setup
> Inhouse customized Zipkin UI
- Now became upstream default
(zipkin-lens)
Problems with
OSS multi-tenant DistTracing
> Storage cost + scalability
> Standard war (instrument lib)
> UI / UX for multi-tenant
> User voice: high implement
cost, useful spans be sampled
out..
Storage problem
Infra cost
Useful data
If we sampling 

100% data
Infra cost for Trace
>>
Infra cost for app
Sampling problem
> Sampling to reduce storage cost
> All OSS employ “head-based”
sampling (unbiased sampling)
- Many useful data will be
sampling out
- What is useful data btw?
OSS UI / UX problem
> Search by “serviceName” is hard
when you have 100 teams and each
team has 50 services
> Time range query is mostly
useless when you have 1000 rps
Standard war
Zipkin B3 header
OpenTracing
OpenCencus
OpenTelemetry
Language base (go, jvm)
Middleware base (htrace..)
How could we
make it better?
Trace without trace
> The Mystery Machine: End-to-end performance
analysis of large-scale Internet services
- First trace solution at facebook (before
Canopi)
- Use “LOG” to calculate causal analysis
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf
Trace without trace (cont)
> Canopy: An End-to-End Performance Tracing
And Analysis System (Facebook)
> From scratch solution (even trace API,
trace standard)
> Head based sampling (token bucket)
https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
Trace without Trace (cont)
> Transparent tracing of microservice-based
applications
- Utilize Proxy for trace interception
- Utilize Linux Syscall for trace
instrumentation
https://dl.acm.org/doi/abs/10.1145/3297280.3297403
Unify standard
> Universal context propagation for
distributed system instrumentation
- Propose universal “standard” for context
data to overcome standard war
https://dl.acm.org/doi/abs/10.1145/3190508.3190526
Better sampling method
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
Better sampling method
> Honeycomb refinery (DistTrace vendor)
- Feature based, tail-based sampling
- https://docs.honeycomb.io/working-with-
your-data/tracing/refinery/
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
Better sampling method (cont)
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
What we could do
Some insights from our company
> Log is the most informative one
of 3 pillars
> Tail-based sampling seems the
cure for “usefulness” of
DistTrace service
OSS move
Firehorse mode of Zipkin and Jeager
(100% sampling, skip indexing)
> https://cwiki.apache.org/
confluence/display/ZIPKIN/
Firehose+mode
> https://github.com/jaegertracing/
jaeger/issues/1731
Propose architecture
Firehose mode
Trace Client
Feature base

Sampler
Storage 

Log Client
Inject trace ID
Sample-in
-Error trace
-High Latency trace
Construct Trace by
Buffer span in memory
No more search UI
for trace
> User traverse to trace by trace
ID only (no more time range base
search)
> User get trace ID from LOG
search UI (you need search-able
LOG, example: Kibana)
Thanks you
for listening

More Related Content

What's hot

Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurryddlatham
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & MonitoringJohn Paulett
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insightsOmid Vahdaty
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryMongoDB
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLCloudera, Inc.
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Kaiyao Huang
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster CassandraTzach Livyatan
 

What's hot (20)

Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurry
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
MongoDB Backup & Disaster Recovery
MongoDB Backup & Disaster RecoveryMongoDB Backup & Disaster Recovery
MongoDB Backup & Disaster Recovery
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & Monitoring
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insights
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster Recovery
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 

Similar to Distributed Tracing, from internal SAAS insights

Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream csching
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Threat hunting on the wire
Threat hunting on the wireThreat hunting on the wire
Threat hunting on the wireInfoSec Addicts
 
Using Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your ServicesUsing Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your ServicesAlcide
 
TechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityTechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityRobb Boyd
 
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Jakub Botwicz
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfSamHoney6
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposPriyanka Aash
 
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsSPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsRod Soto
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsBoris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsShapeBlue
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 

Similar to Distributed Tracing, from internal SAAS insights (20)

Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Threat hunting on the wire
Threat hunting on the wireThreat hunting on the wire
Threat hunting on the wire
 
Using Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your ServicesUsing Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your Services
 
TechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityTechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching Programmability
 
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gpos
 
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsSPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsBoris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 

More from Huy Do

Some note about GC algorithm
Some note about GC algorithmSome note about GC algorithm
Some note about GC algorithmHuy Do
 
Engineering Efficiency in LINE
Engineering Efficiency in LINEEngineering Efficiency in LINE
Engineering Efficiency in LINEHuy Do
 
GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)Huy Do
 
Story Writing Byte Serializer in Golang
Story Writing Byte Serializer in GolangStory Writing Byte Serializer in Golang
Story Writing Byte Serializer in GolangHuy Do
 
Akka と Typeの話
Akka と Typeの話Akka と Typeの話
Akka と Typeの話Huy Do
 
[Scalameetup]spark shuffle
[Scalameetup]spark shuffle[Scalameetup]spark shuffle
[Scalameetup]spark shuffleHuy Do
 
DI in ruby
DI in rubyDI in ruby
DI in rubyHuy Do
 
Itlc2015
Itlc2015Itlc2015
Itlc2015Huy Do
 
Consistent Hashingの小ネタ
Consistent Hashingの小ネタConsistent Hashingの小ネタ
Consistent Hashingの小ネタHuy Do
 
Thriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきたThriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきたHuy Do
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]Huy Do
 
実践Akka
実践Akka実践Akka
実践AkkaHuy Do
 
CA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについてCA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについてHuy Do
 
Making CLI app in ruby
Making CLI app in rubyMaking CLI app in ruby
Making CLI app in rubyHuy Do
 
CacheとRailsの簡単まとめ
CacheとRailsの簡単まとめCacheとRailsの簡単まとめ
CacheとRailsの簡単まとめHuy Do
 
[Htmlday]present
[Htmlday]present[Htmlday]present
[Htmlday]presentHuy Do
 

More from Huy Do (16)

Some note about GC algorithm
Some note about GC algorithmSome note about GC algorithm
Some note about GC algorithm
 
Engineering Efficiency in LINE
Engineering Efficiency in LINEEngineering Efficiency in LINE
Engineering Efficiency in LINE
 
GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)
 
Story Writing Byte Serializer in Golang
Story Writing Byte Serializer in GolangStory Writing Byte Serializer in Golang
Story Writing Byte Serializer in Golang
 
Akka と Typeの話
Akka と Typeの話Akka と Typeの話
Akka と Typeの話
 
[Scalameetup]spark shuffle
[Scalameetup]spark shuffle[Scalameetup]spark shuffle
[Scalameetup]spark shuffle
 
DI in ruby
DI in rubyDI in ruby
DI in ruby
 
Itlc2015
Itlc2015Itlc2015
Itlc2015
 
Consistent Hashingの小ネタ
Consistent Hashingの小ネタConsistent Hashingの小ネタ
Consistent Hashingの小ネタ
 
Thriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきたThriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきた
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
 
実践Akka
実践Akka実践Akka
実践Akka
 
CA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについてCA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについて
 
Making CLI app in ruby
Making CLI app in rubyMaking CLI app in ruby
Making CLI app in ruby
 
CacheとRailsの簡単まとめ
CacheとRailsの簡単まとめCacheとRailsの簡単まとめ
CacheとRailsの簡単まとめ
 
[Htmlday]present
[Htmlday]present[Htmlday]present
[Htmlday]present
 

Recently uploaded

The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfkalichargn70th171
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
full course of software engineering mid term.pdf
full course of software engineering mid term.pdffull course of software engineering mid term.pdf
full course of software engineering mid term.pdfAbdul salam
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxRemote DBA Services
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisDEEPRAJ PATHAK
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...OnePlan Solutions
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 

Recently uploaded (20)

The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdfPros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
Pros and Cons of Selenium In Automation Testing_ A Comprehensive Assessment.pdf
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
full course of software engineering mid term.pdf
full course of software engineering mid term.pdffull course of software engineering mid term.pdf
full course of software engineering mid term.pdf
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
logical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptxlogical backup of Oracle Datapump-detailed.pptx
logical backup of Oracle Datapump-detailed.pptx
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business Analysis
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 

Distributed Tracing, from internal SAAS insights

  • 2. Calendar > Background > Our internal DistTrace problems > Searching for better solution > Conclusion
  • 3. > @dxhuy (フィ) > LINE’s Observability Team EM Introduction
  • 4. > Large scale metrics platform (400M datapoints per min) > ~~ Logs platform (1M log per min) > ~~ Dist Tracing platform (2M spans per min) Our team https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash
  • 5. > Large scale metrics platform (400M datapoints per min) > ~~ Logs platform (1M log per min) > ~~ Dist Tracing platform (2M spans per min) Our team
  • 6. > What is Distributed Tracing > Request scope DistTrace > Concepts Prerequisite know-how
  • 7. Elasticsearch Our setup Brave
 base client Zipkin base
 internal server Elasticsearch Elasticsearch Thrift Custom Armeria
 based collector https://github.com/line/armeria NvME based
 High spec machines
  • 8. Our setup > Inhouse customized Zipkin UI - Now became upstream default (zipkin-lens)
  • 9. Problems with OSS multi-tenant DistTracing > Storage cost + scalability > Standard war (instrument lib) > UI / UX for multi-tenant > User voice: high implement cost, useful spans be sampled out..
  • 10. Storage problem Infra cost Useful data If we sampling 
 100% data Infra cost for Trace >> Infra cost for app
  • 11. Sampling problem > Sampling to reduce storage cost > All OSS employ “head-based” sampling (unbiased sampling) - Many useful data will be sampling out - What is useful data btw?
  • 12. OSS UI / UX problem > Search by “serviceName” is hard when you have 100 teams and each team has 50 services > Time range query is mostly useless when you have 1000 rps
  • 13. Standard war Zipkin B3 header OpenTracing OpenCencus OpenTelemetry Language base (go, jvm) Middleware base (htrace..)
  • 14. How could we make it better?
  • 15. Trace without trace > The Mystery Machine: End-to-end performance analysis of large-scale Internet services - First trace solution at facebook (before Canopi) - Use “LOG” to calculate causal analysis http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf
  • 16. Trace without trace (cont) > Canopy: An End-to-End Performance Tracing And Analysis System (Facebook) > From scratch solution (even trace API, trace standard) > Head based sampling (token bucket) https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
  • 17. Trace without Trace (cont) > Transparent tracing of microservice-based applications - Utilize Proxy for trace interception - Utilize Linux Syscall for trace instrumentation https://dl.acm.org/doi/abs/10.1145/3297280.3297403
  • 18. Unify standard > Universal context propagation for distributed system instrumentation - Propose universal “standard” for context data to overcome standard war https://dl.acm.org/doi/abs/10.1145/3190508.3190526
  • 19. Better sampling method > Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering - Tailed-based sampling method - Utilize Machine-Learning to overcome feature selection problem https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  • 20. Better sampling method > Honeycomb refinery (DistTrace vendor) - Feature based, tail-based sampling - https://docs.honeycomb.io/working-with- your-data/tracing/refinery/ https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  • 21. Better sampling method (cont) > Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering - Tailed-based sampling method - Utilize Machine-Learning to overcome feature selection problem https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  • 22. What we could do Some insights from our company > Log is the most informative one of 3 pillars > Tail-based sampling seems the cure for “usefulness” of DistTrace service
  • 23. OSS move Firehorse mode of Zipkin and Jeager (100% sampling, skip indexing) > https://cwiki.apache.org/ confluence/display/ZIPKIN/ Firehose+mode > https://github.com/jaegertracing/ jaeger/issues/1731
  • 24. Propose architecture Firehose mode Trace Client Feature base
 Sampler Storage 
 Log Client Inject trace ID Sample-in -Error trace -High Latency trace Construct Trace by Buffer span in memory
  • 25. No more search UI for trace > User traverse to trace by trace ID only (no more time range base search) > User get trace ID from LOG search UI (you need search-able LOG, example: Kibana)