ךפוה תא הטאדה ךלש ל- Patterns and Flows ןתונ חותינ קימעמ לש עדימה תכרעמב אצומ ןפואב יטמוטוא תויעב ןשקדורפב ההזמ םייוניש תוגהנתהב הנכותה ןיב תואסריג דוק
How our ISP cost us a full day of the entire R&D team - Lior Redlus - DevOpsDays Tel Aviv 2017
How our ISP cost us a full day of
the entire R&D team
Co-founder and Chief Data Scientist
• 32yr. Scientist at heart
• B.Sc and M.Sc in Neuroscience and Information Processing (BIU)
• Co-founder and Chief Data Scientist @ Coralogix
• A Machine Learning powered scalable Log Analysis solution
• Log Management already included: indexing, querying, filtering,
• Coralogix Analytics:
• Turns your data into patterns and flows
• Gives you deep insights on your system
• Automatically detects production problems
• Finds system behavior changes between code deployments
Interacting with your logging data
• Coralogix provides 3 ways to get insights from your logs:
1. Coralogix Dashbaord – a simple and powerful dashboard with machine
2. Elastic’s Kibana – with a rich query language and flexible visualizations
3. Elasticsearch API – for deep technical querying and aggregations
Good product, happy customers!
• Everything worked smoothly for months
• Until we got a call from a customer (0.5TB / day)
• Some of his heavier dashboards could not be loaded
• He was not happy
• And neither were we
Well, of course. This makes no one happy!
• The error message was replicated in our offices as well
Kibana – technical overview
Our proprietary Kibana proxy:
• Emulates elasticsearch for Kibana
• Confines customers to only
access their data
• Parses queries for various SLA
So what could have gone wrong..?
• We looked into everything we could think of:
• Was the customer’s dashboard defined properly?
• It was.
• Was any indexed elasticsearch data corrupted?
• Was a large Kibana dashboard overloading our Kibana Proxy?
• Not according to the CPU and memory monitoring.
• Was there a hidden bug in our Kibana Proxy for certain queries?
• Replies seemed to be correct for every query we researched.
• Was any Docker container replaced recently, possibly with different settings?
• Yes, but new settings were not introduced.
• Was any Docker networking bug (and there are many…) interacting here?
• Not any that we could find.
Everything looked perfect!
• However, we did have one odd finding:
• When we were connected to our VPN, all the problems disappeared!
• Late at night and disappointed, we decided to call it a day:
Connecting the dots…
• Returning home, we each loaded the dashboard, and to our surprise
– everything worked!
• The same ISP served us and the customer, but not our homes.
• The new suspect was our Internet Service Provider!
Results – 1
• The next day, confident, we experimented:
• SSL vs. no SSL
• Kibana’s standard port 5601 vs. 443 https port
• Adding our Kibana CNAME to Cloudflare
• The results were staggering!
Results – 2
• Loading the dashboard without SSL through port 5601:
• Loading the same dashboard with SSL through port 443 and Cloudflare:
Results, solution and conclusion
• The ISP was throttling our requests, causing timeouts and packet
losses – eventually crashing heavy-loaded dashboards
• Adding our Kibana to Cloudflare under port 443 solved our problems
• (aside from wasting a whole day of our R&D team!)
• Conclusion: trust no-one!
• Please feel free to contact me directly:
Lior Redlus, Chief Data Scientist, firstname.lastname@example.org
One month free trial @ http://www.coralogix.com