SlideShare a Scribd company logo
1 of 18
Leon Torres 
October 15, 2014
Web Startup Challenges 
• Low-friction development 
• Hodgepodge of technologies 
• Hodgepodge of infrastructures 
• Legacy support 
• Constant migrations and upgrades 
• Bottom line: 
High rate of change and no time to check!
A Gordian Knot 
• How utilized is our Hadoop cluster? 
• How utilized is our DC? 
• Are all of our services running correctly? 
• Is our latency OK at every layer in the stack? 
• Someone changed something, were there any 
negative ripple effects? 
• Are we hitting any scaling issues?
A Network Knot 
• Our products live on the internet 
• Our data centers are global 
– Some of them are virtual 
• Network effects are a fact of life 
– Network partitions 
– Latency makes information late 
– Noise is natural and frequent 
– Data just goes missing 
– High availability compounds the problem
– Richard W. Hamming
Solution Design 
• Hypothesize existence of 
system state 
a time varying stream of state components 
• Build it by measuring our systems in toto 
• Stream all measurements to one place 
• Gain insight by inspecting this stream 
computationally and ad-hoc
Separation of Concerns 
• State collection 
• State computation 
• State visualization
Collecting Sate 
• Define a state event ADT capturing: 
– Host 
– Service 
– State 
– Timestamp 
– Any additional key/value fields 
• Find something to collect it
Riemann 
• Riemann accepts state events as a stream 
• Riemann indexes the stream, provides stream 
processing facilities and some alerting tools 
• Also provides downstream pipes: 
– Unix domain sockets 
– Web sockets 
– Graphite stream comes free 
– Create your own
Innternal State Relays 
• Poll third party monitors for state 
• Map to Riemann events 
• Send to Riemann 
• Fill in holes with custom monitors 
– Hadoop jobs, load balancer state, etc. 
• Foundation in place to know everything about 
our global DC state
Network Monitors 
• Static monitors around the world 
– Constantly check HTTP state of services 
• Poll third party monitors (Pingdom, etc.) 
• Deduce network state from aggregate streams 
• Detect outages from user perspective 
• Can extend with phantomjs to get Gomez like 
waterfall and do whatever we want!
Demo Time 
• Ad hoc demo 
– Grep the stream 
– Quickly analyze state of disk utilization 
• Hadoop global state 
– It just pipes nagios data! 
• Network monitoring demo 
– Let’s combine pingdom + network monitors 
– And iterate! awesome dashboard
Distributed Gotchas 
• Riemann can scale, but some nasty surprises 
– Events on a TCP connection are processed serially 
– If event rate gets too high, stream gets saturated 
and backs up into OS network buffers, then into 
Netty’s unbounded buffers. This ultimately 
starves heap and crashes Riemann. 
– Solution is to use large connection pools at the 
clients that push events
Distributed Gotchas 
• Network outages and partitions are difficult 
– Riemann must not go down 
– Riemann must deal with split-brain 
• Highly available SRE solution planned 
– Virtual ip, heartbeat (similar to LB solution) 
• Riemann servers in separate locations 
– End up with two masters on partition => double 
the alerts but at least we get something
Are we cutting the knot?

More Related Content

What's hot

Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
~Eric Principe
 
Micro Strain Gauge Wireless Real Time Measurement
Micro Strain Gauge Wireless Real Time MeasurementMicro Strain Gauge Wireless Real Time Measurement
Micro Strain Gauge Wireless Real Time Measurement
Antonio Mondragon
 

What's hot (20)

PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
 
Introduction to Akka Serverless
Introduction to Akka ServerlessIntroduction to Akka Serverless
Introduction to Akka Serverless
 
Free training on NCM - Discovery & Disaster recovery
Free training on NCM - Discovery & Disaster recovery Free training on NCM - Discovery & Disaster recovery
Free training on NCM - Discovery & Disaster recovery
 
Parameter Inconsistency and Auto Correction
Parameter Inconsistency and Auto CorrectionParameter Inconsistency and Auto Correction
Parameter Inconsistency and Auto Correction
 
ULMAN GUI Specifications
ULMAN GUI SpecificationsULMAN GUI Specifications
ULMAN GUI Specifications
 
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
Net flix embracingfailure re-invent2014-141113085858-conversion-gate02
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
 
Micro Strain Gauge Wireless Real Time Measurement
Micro Strain Gauge Wireless Real Time MeasurementMicro Strain Gauge Wireless Real Time Measurement
Micro Strain Gauge Wireless Real Time Measurement
 
Fifth draft
Fifth draftFifth draft
Fifth draft
 
Network and server performance monitoring training
Network and server performance monitoring trainingNetwork and server performance monitoring training
Network and server performance monitoring training
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
 
Software defined network
Software defined network Software defined network
Software defined network
 
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar AasenContainer Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
 
Near rt ric tc
Near rt ric tcNear rt ric tc
Near rt ric tc
 
Portable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache BeamPortable Streaming Pipelines with Apache Beam
Portable Streaming Pipelines with Apache Beam
 
Webinar intro-to-central3.7-nov23-2016
Webinar intro-to-central3.7-nov23-2016Webinar intro-to-central3.7-nov23-2016
Webinar intro-to-central3.7-nov23-2016
 
Slick: A control plane for middleboxes
Slick: A control plane for middleboxesSlick: A control plane for middleboxes
Slick: A control plane for middleboxes
 
PLNOG 3: Kamil Grabowski - Jak stworzyc skuteczne NOC w Polsce
PLNOG 3: Kamil Grabowski - Jak stworzyc skuteczne NOC w PolscePLNOG 3: Kamil Grabowski - Jak stworzyc skuteczne NOC w Polsce
PLNOG 3: Kamil Grabowski - Jak stworzyc skuteczne NOC w Polsce
 
Software-Defined Networking Layers presentation
Software-Defined Networking Layers presentationSoftware-Defined Networking Layers presentation
Software-Defined Networking Layers presentation
 

Similar to Distributed monitoring

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 

Similar to Distributed monitoring (20)

Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
 
Instrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionInstrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in production
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Overcoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for PerformanceOvercoming Variable Payloads to Optimize for Performance
Overcoming Variable Payloads to Optimize for Performance
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyft
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Fdp embedded systems
Fdp embedded systemsFdp embedded systems
Fdp embedded systems
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
The Need for Complex Analytics from Forwarding Pipelines
The Need for Complex Analytics from Forwarding Pipelines The Need for Complex Analytics from Forwarding Pipelines
The Need for Complex Analytics from Forwarding Pipelines
 
Tv and video on the Internet
Tv and video on the InternetTv and video on the Internet
Tv and video on the Internet
 
OpenStack: Virtual Routers On Compute Nodes
OpenStack: Virtual Routers On Compute NodesOpenStack: Virtual Routers On Compute Nodes
OpenStack: Virtual Routers On Compute Nodes
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Distributed monitoring

  • 2. Web Startup Challenges • Low-friction development • Hodgepodge of technologies • Hodgepodge of infrastructures • Legacy support • Constant migrations and upgrades • Bottom line: High rate of change and no time to check!
  • 3.
  • 4. A Gordian Knot • How utilized is our Hadoop cluster? • How utilized is our DC? • Are all of our services running correctly? • Is our latency OK at every layer in the stack? • Someone changed something, were there any negative ripple effects? • Are we hitting any scaling issues?
  • 5. A Network Knot • Our products live on the internet • Our data centers are global – Some of them are virtual • Network effects are a fact of life – Network partitions – Latency makes information late – Noise is natural and frequent – Data just goes missing – High availability compounds the problem
  • 6.
  • 7.
  • 8. – Richard W. Hamming
  • 9. Solution Design • Hypothesize existence of system state a time varying stream of state components • Build it by measuring our systems in toto • Stream all measurements to one place • Gain insight by inspecting this stream computationally and ad-hoc
  • 10. Separation of Concerns • State collection • State computation • State visualization
  • 11. Collecting Sate • Define a state event ADT capturing: – Host – Service – State – Timestamp – Any additional key/value fields • Find something to collect it
  • 12. Riemann • Riemann accepts state events as a stream • Riemann indexes the stream, provides stream processing facilities and some alerting tools • Also provides downstream pipes: – Unix domain sockets – Web sockets – Graphite stream comes free – Create your own
  • 13. Innternal State Relays • Poll third party monitors for state • Map to Riemann events • Send to Riemann • Fill in holes with custom monitors – Hadoop jobs, load balancer state, etc. • Foundation in place to know everything about our global DC state
  • 14. Network Monitors • Static monitors around the world – Constantly check HTTP state of services • Poll third party monitors (Pingdom, etc.) • Deduce network state from aggregate streams • Detect outages from user perspective • Can extend with phantomjs to get Gomez like waterfall and do whatever we want!
  • 15. Demo Time • Ad hoc demo – Grep the stream – Quickly analyze state of disk utilization • Hadoop global state – It just pipes nagios data! • Network monitoring demo – Let’s combine pingdom + network monitors – And iterate! awesome dashboard
  • 16. Distributed Gotchas • Riemann can scale, but some nasty surprises – Events on a TCP connection are processed serially – If event rate gets too high, stream gets saturated and backs up into OS network buffers, then into Netty’s unbounded buffers. This ultimately starves heap and crashes Riemann. – Solution is to use large connection pools at the clients that push events
  • 17. Distributed Gotchas • Network outages and partitions are difficult – Riemann must not go down – Riemann must deal with split-brain • Highly available SRE solution planned – Virtual ip, heartbeat (similar to LB solution) • Riemann servers in separate locations – End up with two masters on partition => double the alerts but at least we get something
  • 18. Are we cutting the knot?

Editor's Notes

  1. At no point can we sit down and sift through our architecture and say this situation is an error and that situation is ok. We cannot just classify things like that because they become defunct within a month and sometimes within days. OK, we can do it for certain things, but for most application level stuff we have no way to do it. We have to somehow monitor *everything* and figure out how we can know what went wrong from that. Note that this requires us to be experts at every level of the system, as Bilke covered last presentation.
  2. Let’s take a look at some things we may want to know. These are some gnarly, but super important questions.
  3. Our life is complicated by the distributed nature of our systems, so we need to ensure that whatever solution we have takes into account the network.
  4. Here are some existing solutions we have tried over the years..
  5. However, our experience is that these do not work. They each solve different problems, sometimes very well, but they all fail to answer the knotty questions about the overall system. We have to drill down into many of these applications to get an idea of what the heck is going on. I don’t know about you, but I’m getting log-in fatigue whenever a problem happens. And the situation is getting worse with all these pay-ware hosted third party solutions. So is there a better way? We need to clear our minds of these approaches and look at the fundamental problem from a fresh perspective.
  6. If we really get back to the basics, we’re talking information theory, computer science, really thinking about the problem as far down as we need to go. And I’m not being academic. Hamming’s quote illustrates a highly pragmatic wisdom despite his heavily mathematical work. It’s also quite on topic: We will take a deep look at what we’re really trying to do here, to come up with some solution design that considers our desire for insight and how we can piece it numerically from our chaotic mess of systems, people and processes.
  7. Each of the existing tools we just swiped off the table purported to yield insight from some data, but they somehow failed to tell us what we need to know: the state of our system. Let’s look at a solution design that involves the so called state of our system. (read slide) Now much of this was motivated by a project called Riemann, which was designed by a Physics nerd. In science when you model something, you choose to represent the system as state vectors in some convenient topological space, and then you run gnarly computations to see if the model matches reality. This is a powerful approach that has consistently yielded great insights on the nature of the universe. We will repeat this process here because hey, our computer systems are a subset of the universe.
  8. This makes it straightforward to implement, debug, scale and maintain.
  9. The point of all this is to be able to operate on the stream as needed. Note that you don’t need to write clojure code to do this, you can simply open a socket and stream it into python or whatever. Later on there will be demos that I cobbled together using javascript over websockets.
  10. What about monitoring the data center? It turns out we don’t have to re-invent the wheel. Each monitoring system like nagios and new relic have API which allows us to poll the state and map it to Riemann friendly events. This is great because we can leverage existing expertise of monitoring systems and get a huge return right off the bat.
  11. Pingdom is great, but it lacks some features, such as telling us what the network state is in general. We can deduce the network state by creating our own series of monitors. This also gives us a platform to replicate the latency waterfall for web pages as done by Gomez and Akamai.
  12. Demo time
  13. I wrote something about Riemann java client being lousy. The network monitors have to reconnect on timeout, but it wasn’t supported. So I implemented my own connection logic with one TCP connection and ended up getting burned rather nicely by this. So now I have to contribute to the java client or roll my own. Exciting stuff!
  14. It’s too soon to say, but I have been using this system during recent outages and it’s starting to look quite useful. We can expect the a follow up to cover the problem of insight and whether this kind of streaming state processor helps at all. There are some additional preliminary and exciting ideas that I haven’t covered here. It’s shaping up to be an interesting body of work Finally, who would have known: monitoring seems like such a dry topic, until you realize it’s actually very deep.