Disaster porn and the value of a generalist

837 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
837
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Disaster porn and the value of a generalist

  1. 1. Disaster Porn... and the importance of being a generalist
  2. 2. About Me Scott SandersSenior Systems Administrator RideCharge, Inc. @scott_sanders ssanders@taximagic.com
  3. 3. Surge Conference 2011● Ben Frieds (Google CIO) keynote speech talks about the importance being a generalist● I think specializing is fine (and normal as your career advances), but its VITAL to keep a generalist perspective● Disaster porn!● I have no affiliation with OmniTI or Surge, but I highly recommend you attend the conference in Baltimore on Sept. 27th - 28th
  4. 4. Background● Taxi Magic ○ Mobile applications to book/track/pay for taxis ○ Web booking integration for taxi fleets ○ In-car payment hardware (PIM)● Whats a PIM? ○ Passenger Information Monitor ○ 7" HD touchscreen ○ Credit card swipe ○ Wired into cab hardware and dispatch system ○ Uses cellular communication to talk to TM ○ Regular GPS events over UDP ○ Payment transactions over HTTPS
  5. 5. The problems begin... (June 5th)● A handful of cab drivers in Los Angeles begin reporting failures when swiping CCs● Embedded hardware team recalls a few cabs and investigates local log files● Reports problems during SSL handshake to RideCharge servers● Tech Ops team remaps httpd to the same libcrypto.so and libssl.so version as the PIM using libmap.conf(5)● Problem vanishes! HOORAY!!! Beer!
  6. 6. Fast forward to June 12th...● SHTF● Widespread reports of failing CC swipes across the entire SoCal region● Hardware team pulls more vehicles and notices the same SSL handshake problem● Tech Ops team is unable to correlate this to a drop in traffic● Furthermore, Tech Ops is still seeing regular GPS updates from ALL active cabs!
  7. 7. WTF?
  8. 8. Diving in...● Our cellular ISP insists they arent having any problems● (Sound familiar to anyone?)● I start running the standard toolkit looking for patterns ○ tcpdump ○ traceroute ○ NMAP● NMAP is giving me some inconsistent results
  9. 9. Understanding how TCP/IP works● How do you establish a TCP connection? ○ SYN (Hey, you there?) ○ SYN/ACK (Yeah, whats up?) ○ ACK (Cool, lets talk!)● What happens if you connect to a port that doesnt have a service bound to it? ○ SYN (Hey, you there?) ○ RST (leave me alone!)● So why am I only getting a RST every now and then? Why do I see timeouts instead?● This is starting to smell like a routing problem
  10. 10. Proving the problem exists● Since I am receiving GPS updates over UDP from all the cabs I can use this to identify the IP of a cab and its location at a point in time● We know the expected behavior when attempting a connection to a closed port● Lets run some tests and gather some data
  11. 11. comm_test.sh#!/usr/bin/env bashtest_connection () { # fork a subshell to handle the tcp connect test ( # the result is either no-response or conn-refused result=$(nmap -P0 -T1 -sT -p22 --reason -q $4 | awk /^22/{print $4}) echo "$1 $2 $3 $4 $result $8 $9" >> results.txt ) &}# connect to the gps receiver host and monitor real-time UDP gps updatesssh -t gps001.iad1.prod.rws tail -F gps_updates.csv | while read line ; do # line format: Jun 16 15:14:45, 184.251.233.91, 0, 20, 2577, # 33.9822566666667, -118.4593 line=$(echo $line | tr -d ,) test_connection $linedone
  12. 12. Results% comm_test.shJun 16 15:28:00 102.122.93.194 conn-refused 33.8221321105957 -116.548851013184Jun 16 15:27:57 176.135.73.0 conn-refused 32.8885866666667 -97.0376933333333Jun 16 15:27:59 181.251.163.200 conn-refused 33.9004183333333 -118.387591666667Jun 16 15:27:53 178.156.201.182 conn-refused 44.9484977722168 -93.2568588256836Jun 16 15:27:28 180.229.138.141 no-response 39.766675 -104.940496666667Jun 16 15:27:28 187.231.74.250 no-response 33.80945 -118.206921666667Jun 16 15:28:00 181.255.84.59 conn-refused 34.0593466666667 -118.24536Jun 16 15:27:55 78.6.67.236 conn-refused 34.0581833333333 -118.415878333333
  13. 13. Visualize the problemAwesome way to get non-techies on your side and impress some management :-)
  14. 14. Beating up your ISP (figuratively)● After more than a dozen calls to the ISP and as many "escalations" we landed on a conference call with some lead networks engineers● After 6 hours on this conference call reiterating the problem and showing the data one engineer asks us to "hold tight"● Things get very quiet...● Like magic all of my tests start succeeding!
  15. 15. WTF!?!
  16. 16. The backstory● On June 5th, the ISP migrated the SoCal region to a new datacenter in Anaheim. This was an epic failure and they rolled back● On June 12th, the ISP migrated again to Anaheim "successfully"● Cell traffic is pooled by connection, and one of the pools was routing asymmetrically● Asymmetric routing + stateful firewalls = BAD● Updating the routing tables fixed everything
  17. 17. Being a generalist● A DevOps culture requires generalists● Understanding the full stack means being able to troubleshoot problems at all layers● Fluid communication between sysadmins, developers, hardware engineers, and network engineers requires generalists● Fewer people in the war room results in faster problem solving● This saves time and money and makes your team more valuable to the business
  18. 18. Thank you! Were hiring!

×