Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How our ISP cost us a full day of the entire R&D team - Lior Redlus - DevOpsDays Tel Aviv 2017


Published on

DevOpsDays Tel Aviv 2017

Published in: Technology
  • Be the first to comment

How our ISP cost us a full day of the entire R&D team - Lior Redlus - DevOpsDays Tel Aviv 2017

  1. 1. How our ISP cost us a full day of the entire R&D team Lior Redlus Co-founder and Chief Data Scientist Coralogix
  2. 2. About Myself • 32yr. Scientist at heart • B.Sc and M.Sc in Neuroscience and Information Processing (BIU) • Co-founder and Chief Data Scientist @ Coralogix
  3. 3. About Coralogix • A Machine Learning powered scalable Log Analysis solution • Log Management already included: indexing, querying, filtering, alerting etc. • Coralogix Analytics: • Turns your data into patterns and flows • Gives you deep insights on your system • Automatically detects production problems • Finds system behavior changes between code deployments
  4. 4. Interacting with your logging data • Coralogix provides 3 ways to get insights from your logs: 1. Coralogix Dashbaord – a simple and powerful dashboard with machine learning capabilities 2. Elastic’s Kibana – with a rich query language and flexible visualizations 3. Elasticsearch API – for deep technical querying and aggregations
  5. 5. Good product, happy customers! • Everything worked smoothly for months • Until we got a call from a customer (0.5TB / day) • Some of his heavier dashboards could not be loaded • He was not happy • And neither were we
  6. 6. Well, of course. This makes no one happy! • The error message was replicated in our offices as well
  7. 7. Kibana – technical overview Port 5601 Node.js server Angular.js client localhost Docker container Docker container Docker container Our proprietary Kibana proxy: • Emulates elasticsearch for Kibana • Confines customers to only access their data • Parses queries for various SLA restrictions Port 9200 Port 9200 Customer Publicdomain
  8. 8. So what could have gone wrong..? • We looked into everything we could think of: • Was the customer’s dashboard defined properly? • It was. • Was any indexed elasticsearch data corrupted? • No. • Was a large Kibana dashboard overloading our Kibana Proxy? • Not according to the CPU and memory monitoring. • Was there a hidden bug in our Kibana Proxy for certain queries? • Replies seemed to be correct for every query we researched. • Was any Docker container replaced recently, possibly with different settings? • Yes, but new settings were not introduced. • Was any Docker networking bug (and there are many…) interacting here? • Not any that we could find.
  9. 9. Everything looked perfect! • However, we did have one odd finding: • When we were connected to our VPN, all the problems disappeared! • Late at night and disappointed, we decided to call it a day:
  10. 10. Connecting the dots… • Returning home, we each loaded the dashboard, and to our surprise – everything worked! • The same ISP served us and the customer, but not our homes. • The new suspect was our Internet Service Provider!
  11. 11. Results – 1 • The next day, confident, we experimented: • SSL vs. no SSL • Kibana’s standard port 5601 vs. 443 https port • Adding our Kibana CNAME to Cloudflare • The results were staggering!
  12. 12. Results – 2 • Loading the dashboard without SSL through port 5601: • Loading the same dashboard with SSL through port 443 and Cloudflare:
  13. 13. Results, solution and conclusion • The ISP was throttling our requests, causing timeouts and packet losses – eventually crashing heavy-loaded dashboards • Adding our Kibana to Cloudflare under port 443 solved our problems • (aside from wasting a whole day of our R&D team!) • Conclusion: trust no-one!
  14. 14. Questions? • Please feel free to contact me directly: Lior Redlus, Chief Data Scientist, One month free trial @