SlideShare a Scribd company logo
1 of 44
Download to read offline
bertjan@openvalue.eu
Debugging distributed systems
Bert Jan Schrijver
@bjschrijver
Debugging distributed systems: the good parts
bertjan@openvalue.eu
Bert Jan Schrijver
@bjschrijver
Networking 101
How the internet works
Why?
Bert Jan Schrijver
L e t ’ s m e e t
@bjschrijver
Why are distributed
systems difficult?
Networking 101
What?
Why?
✅
Demo
War stories
Conclusion
W h a t ‘ s n e x t ?
Outline
A structured approach
@bjschrijver
• Level 1: Non-concurrency


• Level 2: Concurrency


• Level 3: Distribution
The hierarchy of complexity for debugging
Source:	https://maximilianmichels.com/2020/debugging-distributed-systems
What is a distributed system?
A distributed system is a system whose
components are located on different
networked computers, which
communicate and coordinate their actions
by passing messages to one another.
• Concurrency of components


• Lack of a global clock


• Independent failure of components


• Distributed systems are harder to reason
about
Characteristics of distributed systems
Source:	http://www.nasa.gov/images/content/218652main_STOCC_FS_img_lg.jpg
Working with distributed systems is
fundamentally different from writing
software on a single computer - and the
main difference is that there are lots of new
and exciting ways for things to go wrong.


- Martin Kleppmann
“
”
Photo:	Dave	Lehl
Why do things go wrong?
“ ”
Photo:	Dave	Lehl
The fallacies of distributed computing
are a set of assertions made by L Peter
Deutsch and others at Sun Microsystems
describing false assumptions that
programmers new to distributed
applications invariably make.
1. The network is reliable;


2. Latency is zero;


3. Bandwidth is infinite;


4. The network is secure;


5. Topology doesn't change;


6. There is one administrator;


7. Transport cost is zero;


8. The network is homogeneous.
Fallacies of distributed computing
What could possibly go wrong?
“ ”
Photo:	Dave	Lehl
OSI & TCP/IP
Source:	https://www.guru99.com/difference-tcp-ip-vs-osi-model.html
.. in your browser’s address bar and press Enter
What happens when you type google.com…
Source:	https://github.com/alex/what-happens-when
17
In a distributed system, there may well be
some parts of the system that are broken in
some unpredictably way, even though other
parts of the system are working fine…
“
”
Source:	Martin	Kleppmann
“
”
Source:	Martin	Kleppmann
… but in a system with thousands of
nodes, it is reasonable to assume that
something is always broken.


- Martin Kleppmann
Are we done yet?
Source:	https://7216-presscdn-0-76-pagely.netdna-ssl.com/wp-content/uploads/2011/12/confused-man-single-good-men.jpg
Where do I start?
A structured approach
to debugging distributed systems
@bjschrijver
Check DNS & routing
Check connection
Debug client side
Create minimal reproducer
Debug server side
Observe & document
Wrap up & post mortem
Inspect traffic / messages
Step 1: Observe & document
• What do you know about the problem?


• Inspect logging, errors, metrics, tracing


• Draw the path from source to target - what’s
in between? Focus on details!


• Document what you know


• Can we reproduce in a test?


• By injecting errors, for example
Tools


Whiteboard,
documentation, logging,
metrics, tracing
(opentracing.io), tests,
Jepsen,
Step 1: Observe & document
Step 2: Create minimal reproducer
• Goal: maximise the amount of debugging
cycles


• Focus on short development iterations /
feedback loops


• Get close to the action!
Tools


IDE, Shell scripts,


SSH tunnels, Curl
Step 3: Debug client side
• Focus on eliminating anything that could be
wrong on the client side


• Are we connecting to the right host?


• Do we send the right message?


• Do we receive a response?


• Not much different from local


debugging
Tools


IDE, debugger,
logging
Step 4: Check DNS & routing
• DNS:


• Make sure you know what IP address the
hostname should resolve to


• Verify that this actually happens


at the client


• Routing:


• Verify you can reach the


target machine
Tools


host, nslookup,
dig, whois, ping,
traceroute,
nslookup.io,
dnschecker.org
Step 5: Check connection
• Can we connect to the port?


• If not, do we get a REJECT or a DROP?


• Does the connection open and stay open?


• Are we talking TLS?


• What is the connection speed


between us?
Tools


telnet, nc, curl,
iperf
Step 6: Inspect traffic / messages
• Do we send the right request?


• Do we receive the right response?


• How do we know?


• How do we handle TLS?


• Are there any load balancers


or proxies in between?
Tools


curl, wireshark,
tcpdump, network
tab in browser,
mitm/tls proxy
Step 7: Debug server side
• Inspect the remote host


• Can we attach a remote debugger?


• See https://youtube.com/OpenValue


• Profiling


• Strace Tools


SSH tunnels,
remote debugger,
profiler, strace
Step 8: Wrap up & post mortem
• Document the issue:


• Timeline


• What did we see?


• Why did it happen?


• What was the impact?


• How did we find out?


• What did we do to mitigate and fix?


• What should we do to prevent


repetition?
Tools


Whiteboard,
documentation
If you really want a reliable system, you
have to understand what its failure modes
are. You have to actually have witnessed
it misbehaving.


- Jason Cahoon
“
”
Distributed systems war stories
The time where it worked half of the time…
The one where two services didn’t speak
the same language…
The expensive one…
The one at a school…
The one where only one country was
affected…
The one with breaking news…
Summary: a structured approach
to debugging distributed systems
@bjschrijver
Check DNS & routing
Check connection
Debug client side
Create minimal reproducer
Debug server side
Observe & document
Wrap up & post mortem
Inspect traffic / messages
Source:	https://cdn2.vox-cdn.com/thumbor/J9OqPYS7FgI9fjGhnF7AFh8foVY=/148x0:1768x1080/1280x854/cdn0.vox-cdn.com/uploads/chorus_image/image/46147742/cute-success-kid-1920x1080.0.0.jpg
THAT’S IT.


NOW GO KICK SOME ASS!
Questions?


@bjschrijver
Thanks for your time.
Got feedback? Tweet it!
All pictures belong


to their respective


authors
@bjschrijver

More Related Content

What's hot

Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleHBaseCon
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Sri Prasanna
 
Prometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observabilityPrometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observabilityJulien Pivotto
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating SystemsDr Sandeep Kumar Poonia
 
Message passing ( in computer science)
Message   passing  ( in   computer  science)Message   passing  ( in   computer  science)
Message passing ( in computer science)Computer_ at_home
 
Google File System
Google File SystemGoogle File System
Google File Systemguest2cb4689
 
Group Communication (Distributed computing)
Group Communication (Distributed computing)Group Communication (Distributed computing)
Group Communication (Distributed computing)Sri Prasanna
 
Aggrement protocols
Aggrement protocolsAggrement protocols
Aggrement protocolsMayank Jain
 
Transport Layer Security
Transport Layer SecurityTransport Layer Security
Transport Layer SecurityHuda Seyam
 
Distributed Computing system
Distributed Computing system Distributed Computing system
Distributed Computing system Sarvesh Meena
 
Synchronization Pradeep K Sinha
Synchronization Pradeep K SinhaSynchronization Pradeep K Sinha
Synchronization Pradeep K SinhaJawwad Rafiq
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...Flink Forward
 
Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Navid Daneshvaran
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating systemudaya khanal
 
File Transfer Protocol (FTP)
File Transfer Protocol (FTP)File Transfer Protocol (FTP)
File Transfer Protocol (FTP)AxelXrest
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesDrew Hansen
 

What's hot (20)

Distributed systems scheduling
Distributed systems schedulingDistributed systems scheduling
Distributed systems scheduling
 
12. dfs
12. dfs12. dfs
12. dfs
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)
 
Prometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observabilityPrometheus: From technical metrics to business observability
Prometheus: From technical metrics to business observability
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Message passing ( in computer science)
Message   passing  ( in   computer  science)Message   passing  ( in   computer  science)
Message passing ( in computer science)
 
Google File System
Google File SystemGoogle File System
Google File System
 
Group Communication (Distributed computing)
Group Communication (Distributed computing)Group Communication (Distributed computing)
Group Communication (Distributed computing)
 
Aggrement protocols
Aggrement protocolsAggrement protocols
Aggrement protocols
 
Transport Layer Security
Transport Layer SecurityTransport Layer Security
Transport Layer Security
 
Distributed Computing system
Distributed Computing system Distributed Computing system
Distributed Computing system
 
Synchronization Pradeep K Sinha
Synchronization Pradeep K SinhaSynchronization Pradeep K Sinha
Synchronization Pradeep K Sinha
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...Thomas Lamirault_Mohamed Amine Abdessemed  -A brief history of time with Apac...
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1Operating system concepts ninth edition (2012), chapter 2 solution e1
Operating system concepts ninth edition (2012), chapter 2 solution e1
 
Distributed operating system
Distributed operating systemDistributed operating system
Distributed operating system
 
File Transfer Protocol (FTP)
File Transfer Protocol (FTP)File Transfer Protocol (FTP)
File Transfer Protocol (FTP)
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 

Similar to Debugging distributed systems

JUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsJUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsBert Jan Schrijver
 
Devoxx Belgium 2022 - Debugging distributed systems
Devoxx Belgium 2022 - Debugging distributed systemsDevoxx Belgium 2022 - Debugging distributed systems
Devoxx Belgium 2022 - Debugging distributed systemsBert Jan Schrijver
 
Arnhem JUG March 2023 - Debugging distributed systems
Arnhem JUG March 2023 - Debugging distributed systemsArnhem JUG March 2023 - Debugging distributed systems
Arnhem JUG March 2023 - Debugging distributed systemsBert Jan Schrijver
 
GOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsBert Jan Schrijver
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsBert Jan Schrijver
 
Mastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsBert Jan Schrijver
 
Incident Response Fails
Incident Response FailsIncident Response Fails
Incident Response FailsMichael Gough
 
Monitoring microservices
Monitoring microservicesMonitoring microservices
Monitoring microservicesWilliam Brander
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2Scott Sutherland
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineSagi Brody
 
When Security Tools Fail You
When Security Tools Fail YouWhen Security Tools Fail You
When Security Tools Fail YouMichael Gough
 
Hyperledger Blockchain
Hyperledger BlockchainHyperledger Blockchain
Hyperledger BlockchainAfraz Khan
 
6 Scope & 7 Live Data Collection
6 Scope & 7 Live Data Collection6 Scope & 7 Live Data Collection
6 Scope & 7 Live Data CollectionSam Bowne
 
1 (20 files merged).ppt
1 (20 files merged).ppt1 (20 files merged).ppt
1 (20 files merged).pptseshas1
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosCharity Majors
 
CNIT 152: 6. Scope & 7. Live Data Collection
CNIT 152: 6. Scope & 7. Live Data CollectionCNIT 152: 6. Scope & 7. Live Data Collection
CNIT 152: 6. Scope & 7. Live Data CollectionSam Bowne
 

Similar to Debugging distributed systems (20)

JUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systemsJUG CH September 2021 - Debugging distributed systems
JUG CH September 2021 - Debugging distributed systems
 
Devoxx Belgium 2022 - Debugging distributed systems
Devoxx Belgium 2022 - Debugging distributed systemsDevoxx Belgium 2022 - Debugging distributed systems
Devoxx Belgium 2022 - Debugging distributed systems
 
Arnhem JUG March 2023 - Debugging distributed systems
Arnhem JUG March 2023 - Debugging distributed systemsArnhem JUG March 2023 - Debugging distributed systems
Arnhem JUG March 2023 - Debugging distributed systems
 
GOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systemsGOTO night April 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systems
 
JavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systemsJavaLand 2022 - Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
 
Debugging distributed systems
Debugging distributed systemsDebugging distributed systems
Debugging distributed systems
 
Mastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systemsMastering Microservices 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systems
 
Heartbleed
HeartbleedHeartbleed
Heartbleed
 
Incident Response Fails
Incident Response FailsIncident Response Fails
Incident Response Fails
 
Monitoring microservices
Monitoring microservicesMonitoring microservices
Monitoring microservices
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
 
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider DisciplineTroubleshooting: A High-Value Asset For The Service-Provider Discipline
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
 
When Security Tools Fail You
When Security Tools Fail YouWhen Security Tools Fail You
When Security Tools Fail You
 
Drop, Stop & Roll
Drop, Stop & RollDrop, Stop & Roll
Drop, Stop & Roll
 
Hyperledger Blockchain
Hyperledger BlockchainHyperledger Blockchain
Hyperledger Blockchain
 
6 Scope & 7 Live Data Collection
6 Scope & 7 Live Data Collection6 Scope & 7 Live Data Collection
6 Scope & 7 Live Data Collection
 
1 (20 files merged).ppt
1 (20 files merged).ppt1 (20 files merged).ppt
1 (20 files merged).ppt
 
Chaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just ChaosChaos Engineering Without Observability ... Is Just Chaos
Chaos Engineering Without Observability ... Is Just Chaos
 
CNIT 152: 6. Scope & 7. Live Data Collection
CNIT 152: 6. Scope & 7. Live Data CollectionCNIT 152: 6. Scope & 7. Live Data Collection
CNIT 152: 6. Scope & 7. Live Data Collection
 
Dmk bo2 k8_ccc
Dmk bo2 k8_cccDmk bo2 k8_ccc
Dmk bo2 k8_ccc
 

Recently uploaded

办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...vmzoxnx5
 
Basic Security.pptx is a awsome PPT on your mobiel
Basic Security.pptx is a awsome PPT on your mobielBasic Security.pptx is a awsome PPT on your mobiel
Basic Security.pptx is a awsome PPT on your mobielpratamakiki860
 
Is DNS ready for IPv6, presented by Geoff Huston at IETF 119
Is DNS ready for IPv6, presented by Geoff Huston at IETF 119Is DNS ready for IPv6, presented by Geoff Huston at IETF 119
Is DNS ready for IPv6, presented by Geoff Huston at IETF 119APNIC
 
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)ICT Watch - Indonesia
 
Power of Social Media for E-commerce.pdf
Power of Social Media for E-commerce.pdfPower of Social Media for E-commerce.pdf
Power of Social Media for E-commerce.pdfrajats19920
 
draft-harrison-sidrops-manifest-number-01, presented at IETF 119
draft-harrison-sidrops-manifest-number-01, presented at IETF 119draft-harrison-sidrops-manifest-number-01, presented at IETF 119
draft-harrison-sidrops-manifest-number-01, presented at IETF 119APNIC
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...ICT Watch - Indonesia
 
Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119
Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119
Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119APNIC
 
IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119
IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119
IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119APNIC
 
Tari Eason Warriors Come Out To Play T Shirts
Tari Eason Warriors Come Out To Play T ShirtsTari Eason Warriors Come Out To Play T Shirts
Tari Eason Warriors Come Out To Play T Shirtsrahman018755
 

Recently uploaded (11)

办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
 
Basic Security.pptx is a awsome PPT on your mobiel
Basic Security.pptx is a awsome PPT on your mobielBasic Security.pptx is a awsome PPT on your mobiel
Basic Security.pptx is a awsome PPT on your mobiel
 
Is DNS ready for IPv6, presented by Geoff Huston at IETF 119
Is DNS ready for IPv6, presented by Geoff Huston at IETF 119Is DNS ready for IPv6, presented by Geoff Huston at IETF 119
Is DNS ready for IPv6, presented by Geoff Huston at IETF 119
 
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
 
Power of Social Media for E-commerce.pdf
Power of Social Media for E-commerce.pdfPower of Social Media for E-commerce.pdf
Power of Social Media for E-commerce.pdf
 
draft-harrison-sidrops-manifest-number-01, presented at IETF 119
draft-harrison-sidrops-manifest-number-01, presented at IETF 119draft-harrison-sidrops-manifest-number-01, presented at IETF 119
draft-harrison-sidrops-manifest-number-01, presented at IETF 119
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
 
Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119
Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119
Making an RFC in Today's IETF, presented by Geoff Huston at IETF 119
 
IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119
IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119
IPv6 Operational Issues (with DNS), presented by Geoff Huston at IETF 119
 
Tari Eason Warriors Come Out To Play T Shirts
Tari Eason Warriors Come Out To Play T ShirtsTari Eason Warriors Come Out To Play T Shirts
Tari Eason Warriors Come Out To Play T Shirts
 

Debugging distributed systems

  • 2. Debugging distributed systems: the good parts bertjan@openvalue.eu Bert Jan Schrijver @bjschrijver Networking 101 How the internet works
  • 4. Bert Jan Schrijver L e t ’ s m e e t @bjschrijver
  • 5. Why are distributed systems difficult? Networking 101 What? Why? ✅ Demo War stories Conclusion W h a t ‘ s n e x t ? Outline A structured approach @bjschrijver
  • 6. • Level 1: Non-concurrency • Level 2: Concurrency • Level 3: Distribution The hierarchy of complexity for debugging Source: https://maximilianmichels.com/2020/debugging-distributed-systems
  • 7. What is a distributed system?
  • 8. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.
  • 9. • Concurrency of components • Lack of a global clock • Independent failure of components 
 • Distributed systems are harder to reason about Characteristics of distributed systems Source: http://www.nasa.gov/images/content/218652main_STOCC_FS_img_lg.jpg
  • 10. Working with distributed systems is fundamentally different from writing software on a single computer - and the main difference is that there are lots of new and exciting ways for things to go wrong. - Martin Kleppmann “ ” Photo: Dave Lehl
  • 11. Why do things go wrong? “ ” Photo: Dave Lehl
  • 12. The fallacies of distributed computing are a set of assertions made by L Peter Deutsch and others at Sun Microsystems describing false assumptions that programmers new to distributed applications invariably make.
  • 13. 1. The network is reliable; 2. Latency is zero; 3. Bandwidth is infinite; 4. The network is secure; 5. Topology doesn't change; 6. There is one administrator; 7. Transport cost is zero; 8. The network is homogeneous. Fallacies of distributed computing
  • 14. What could possibly go wrong? “ ” Photo: Dave Lehl
  • 16. .. in your browser’s address bar and press Enter What happens when you type google.com… Source: https://github.com/alex/what-happens-when
  • 17. 17
  • 18. In a distributed system, there may well be some parts of the system that are broken in some unpredictably way, even though other parts of the system are working fine… “ ” Source: Martin Kleppmann
  • 19. “ ” Source: Martin Kleppmann … but in a system with thousands of nodes, it is reasonable to assume that something is always broken. - Martin Kleppmann
  • 20. Are we done yet?
  • 22. A structured approach to debugging distributed systems @bjschrijver Check DNS & routing Check connection Debug client side Create minimal reproducer Debug server side Observe & document Wrap up & post mortem Inspect traffic / messages
  • 23. Step 1: Observe & document • What do you know about the problem? • Inspect logging, errors, metrics, tracing • Draw the path from source to target - what’s in between? Focus on details! • Document what you know • Can we reproduce in a test? • By injecting errors, for example Tools Whiteboard, documentation, logging, metrics, tracing (opentracing.io), tests, Jepsen,
  • 24. Step 1: Observe & document
  • 25. Step 2: Create minimal reproducer • Goal: maximise the amount of debugging cycles • Focus on short development iterations / feedback loops • Get close to the action! Tools IDE, Shell scripts, SSH tunnels, Curl
  • 26. Step 3: Debug client side • Focus on eliminating anything that could be wrong on the client side • Are we connecting to the right host? • Do we send the right message? • Do we receive a response? • Not much different from local 
 debugging Tools IDE, debugger, logging
  • 27. Step 4: Check DNS & routing • DNS: • Make sure you know what IP address the hostname should resolve to • Verify that this actually happens 
 at the client • Routing: • Verify you can reach the 
 target machine Tools host, nslookup, dig, whois, ping, traceroute, nslookup.io, dnschecker.org
  • 28. Step 5: Check connection • Can we connect to the port? • If not, do we get a REJECT or a DROP? • Does the connection open and stay open? • Are we talking TLS? • What is the connection speed 
 between us? Tools telnet, nc, curl, iperf
  • 29. Step 6: Inspect traffic / messages • Do we send the right request? • Do we receive the right response? • How do we know? • How do we handle TLS? • Are there any load balancers 
 or proxies in between? Tools curl, wireshark, tcpdump, network tab in browser, mitm/tls proxy
  • 30. Step 7: Debug server side • Inspect the remote host • Can we attach a remote debugger? • See https://youtube.com/OpenValue • Profiling • Strace Tools SSH tunnels, remote debugger, profiler, strace
  • 31. Step 8: Wrap up & post mortem • Document the issue: • Timeline • What did we see? • Why did it happen? • What was the impact? • How did we find out? • What did we do to mitigate and fix? • What should we do to prevent 
 repetition? Tools Whiteboard, documentation
  • 32. If you really want a reliable system, you have to understand what its failure modes are. You have to actually have witnessed it misbehaving. - Jason Cahoon “ ”
  • 34. The time where it worked half of the time…
  • 35. The one where two services didn’t speak the same language…
  • 37. The one at a school…
  • 38.
  • 39. The one where only one country was affected…
  • 40. The one with breaking news…
  • 41. Summary: a structured approach to debugging distributed systems @bjschrijver Check DNS & routing Check connection Debug client side Create minimal reproducer Debug server side Observe & document Wrap up & post mortem Inspect traffic / messages
  • 44. Thanks for your time. Got feedback? Tweet it! All pictures belong to their respective authors @bjschrijver