Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Big Data Systems - "The Simple Way"

2,746 views

Published on

Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.

In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?

And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.

Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.

Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/

Published in: Engineering

Monitoring Big Data Systems - "The Simple Way"

  1. 1. Monitoring Big Data Systems Done "The Simple Way" Demi Ben-Ari - CTO @ Panorays
  2. 2. About Me Demi Ben-Ari, Co-Founder & CTO @ Panorays ● BS’c Computer Science – Academic College Tel-Aviv Yaffo ● Co-Founder “Big Things” Big Data Community In the Past: ● Sr. Data Engineer - Windward ● Team Leader & Sr. Java Software Engineer, Missile defense and Alert System - “Ofek” – IAF Interested in almost every kind of technology – A True Geek
  3. 3. Agenda ● A lot of (NOT) funny Jokes ● Problem definition and Environment ● Monitoring pipeline solutions ○ Metrics ○ Datastore ○ Dashboards ○ Alerting ● Summary ● (Not going to address Service discovery and monitoring)
  4. 4. Say “Distributed”, Say “Big Data”, Say….
  5. 5. What is Big Data (IMHO)? And What to Monitor? ● Systems involving the “3 Vs”: What are the right questions we want to ask? ○ Volume - How much? ■ Amount per second / minute / hour / day…. ■ Gigabytes, Terabytes, Petabytes… ○ Velocity - How fast? ■ Count per second / minute / hour / day…. ○ Variety ■ What kind? (Difference) ■ Sensor Data, Logs, Data Streams, Financial Transactions, Geo Locations...
  6. 6. Monolith Structure OS CPU Memory Disk Processes Java Application Server Database Web Server Load Balancer Users - Other Applications Monitoring System UI
  7. 7. Distributed Microservices Architecture Service A Queue DB Service B DBCache Cache DBService C Web Server DB Analytics Cluster Master Slave Slave Slave Monitoring System???
  8. 8. Some basic concepts
  9. 9. Basic Concepts ● Monitoring ○ Collecting, processing, aggregating, and displaying real-time quantitative data about a system ● White-box ○ Monitoring based on metrics exposed by the internals of the system ○ logs, interfaces JMX of JVM, etc ● Black-box ○ Testing externally visible behavior as a user would see it. ● Dashboard ○ An application that provides a summary view of a service’s core metrics.
  10. 10. Basic Concepts ● Alert ○ A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias or a pager. ● Root cause ○ A defect in a software or human system that - if repaired, instills confidence that this event won’t happen again in the same way. ● Node and machine ○ Used interchangeably to indicate a single instance (physical server, virtual machine or container). There might be multiple services worth monitoring on a single machine. ● Push ○ Any change to a service’s running software or its configuration. ● KPI - Key Performance Indicator
  11. 11. Data flow and Environment (Our Use Case)
  12. 12. Data Flow Diagram External Data Source Analytics Layers Data Pipeline Parsed Raw Entity Resolution Process Building insights on top of the entities Data Output Layer Anomaly Detection Trends UI for End Users
  13. 13. Environment Description Cluster Dev Testing Live Staging ProductionEnv OB1K RESTful Java Services
  14. 14. Situations
  15. 15. MongoDB + Spark Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read MasterSahrded MongoDB Replica Set
  16. 16. Cassandra + Spark Worker 1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
  17. 17. Cassandra + Serving Cassandra Cluster Write Read UI Client UI Client UI Client UI Client Web ServiceWeb ServiceWeb ServiceWeb Service
  18. 18. Problems ● Multiple physical servers ● Multiple logical services ● Want Scaling => More Servers ● Even if you had all of the metrics ○ You’ll have an overflow of the data ● Your monitoring becomes a “Big Data” problem itself
  19. 19. The what really “Distributed” Means The DevOps Guy (It might be you)
  20. 20. So...Let’s Start!
  21. 21. Report to Where? ● We chose: ● Graphite (InfluxDB) + Grafana ● Can correlate System and Application metrics in one place :)
  22. 22. Report to Where? ● Save DevOps efforts if you’re willing to Pay :) ● Hosted Graphite ○ https://www.hostedgraphite.com/ ● Throwing the “Big Data” volume monitoring problem at someone else
  23. 23. Connections Connections... http://www.mememaker.net/meme/connections-connections-everywhere2/
  24. 24. Drivers to Datastores ● Actions they usually do: ○ Open connection ○ Apply actions ■ Select ■ Insert ■ Update ■ Delete ○ Close connection ● Do you monitor each? ○ Hint: Yes!!!! Hell Yes!!! ● Creating a wrapper in any programming language and reporting the metrics ○ Count, execution times, errors… ○ A bit of Infrastructure code that will give great visibility
  25. 25. Monitoring Operation System
  26. 26. Monitoring Operation System Metrics ● What to measure: ○ CPU ○ Memory ○ Disk Space ● How to measure: ○ CollectD or StatsD reporting to Graphite ○ New Relic ■ Nice and easy UI ■ Even the free account gives great tool ■ Alerting of thresholds
  27. 27. Monitoring Cassandra
  28. 28. Monitoring Cassandra ● OpsCenter - by DataStax
  29. 29. Monitoring Cassandra ● Is the enough?... We can connect it to Graphite also (Blog: “Monitoring the hell out of Cassandra”) ● Plug & Play the metrics to Graphite - Internal Cassandra mechanism ● Back to the basics: dstat, iostat, iotop, jstack
  30. 30. Monitoring Cassandra
  31. 31. Monitoring Cassandra - Alternative
  32. 32. Monitoring Cassandra - Some more :) ● Cyanite: http://cyanite.io/ Graphite with Cassandra backend as a datasource. ● Nodetool - Cassandra tool ● Back to the basics: dstat, iostat, iotop, jstack
  33. 33. Some help from “the Cloud”
  34. 34. Monitoring via AWS’s CloudWatch
  35. 35. Google Stackdriver (GCP) ● Can integrate both GCP and Amazon accounts
  36. 36. Monitoring Spark
  37. 37. What to monitor in an Apache Spark Cluster ● Application execution ● Resource consumption and allocation ● Task Failures ● Environment and Amount of servers ● Physical OS metrics ● Infrastructure services
  38. 38. Ways to Monitoring Spark ● Sending Metrics: Spark → Graphite (Execution) ● http://spark.apache.org/docs/latest/monitoring.html
  39. 39. Ways to Monitoring Spark ● Sending Metrics: Spark → Graphite (JVM metrics) ● http://spark.apache.org/docs/latest/monitoring.html
  40. 40. Ways to Monitoring Spark ● Grafana-spark-dashboards ○ Blog: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ ● Spark UI - Online on each application running ● Spark History Server - Offline (After application finishes) ● Spark REST API ○ Querying via inner tools to do ad-hoc monitoring ● Back to the basics: dstat, iostat, iotop, jstack ● Blog post by Tzach Zohar - “Tips from the Trenches”
  41. 41. Monitoring Your Data https://memegenerator.net/instance/53617544
  42. 42. Data Questions? ● Did all of the computation occur? ● Are there any data layers missing? ● How much data do we have? (Volume) ● Is all of the data in the Database?
  43. 43. Data Answers! ● KISS (Keep it simple stupid) ● Jenkins + Maven (JUnit) for the rescue ● Creating a maven “monitoring” project. ○ Running scheduled tasks, each for the relevant data source ■ Database data existence ■ S3 files existence ■ Data flow that keeps on coming from sensors ■ (Any other data source that you can imagine…) ○ Scheduled task that write amount metrics to Graphite -> Dashboards ○ Report task execution to Graphite
  44. 44. Data Answers! ● The method doesn’t really matter, as long as you: ○ Can follow the results over time ○ Know what your data flow and know where things might fail ○ It’s easy for anyone to add more monitoring (For the ones that add the new data each time…) ○ It don’t trust others to add monitoring (It will always end up the DevOps’s “fault” -> No monitoring will be applied)
  45. 45. Logging? Monitoring? https://lh4.googleusercontent.com/DFVcH-E5XKj8cbhEtI0qabmf_wwVqWWvk0pK5H5rnC_kVxY2tXClKfzV- LvAH61YRLJUEvtO9amjWfjcY4Z57VBYCuQ95_hdAVEHgLAuepJiArH0wJERWuzzmgnPysCiIA
  46. 46. ● Elastic ● Architecture: Server Server Server ELK - Elasticsearch + Logstash + Kibana Shippers Queue Indexer Web UIStorage
  47. 47. ● (Simpler) Architecture: ○ The problem: Log42 only works with TCP :( => Log4J2 works with UDP too Server Server Server ELK - Elasticsearch + Logstash + Kibana Indexer Web UIStorage TCP / UDP
  48. 48. ELK - Elasticsearch + Logstash + Kibana http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/
  49. 49. ELK - Elasticsearch + Logstash + Kibana http://blog.takipi.com/log-management-tools-face-off-splunk-vs-logstash-vs-sumo-logic/
  50. 50. Who else Logs? ● Graylog2 ● …. ● Logging As a Service :) ○ Logz.io (http://logz.io/blog/deploy-elk-production) ○ Logly ○ sematext
  51. 51. How does it look in real life? ● http://www.digitalgov.gov/2015/01/07/elk/ ● http://www.ragedsyscoder.com/monitoring-slides/file/img/tvs.jpg
  52. 52. Did someone say “Dashboard”? http://www.funpic.hu/_files/pictures/original/86/71/27186.jpg
  53. 53. Redash ● http://redash.io/ ● Open Source: https://github.com/getredash/redash ● Came out as one of many Open source tool by Everything.me ● Created and Maintained by Arik Fraimovich (You rock!) ● Written in Python ● Has an on-premise and hosted solution ●‫רןאאקמ‬
  54. 54. Redash - Data Sources
  55. 55. Redash - Screenshots
  56. 56. Redash - Screenshots
  57. 57. Redash - the “Why?” ● Having multiple data sources in the organization ● Wanting to see all a combination of data sources in one place ● It’s open source and ready to use ● Why implement fancy UI and spend a lot of time?!?!? ● So...just use it!
  58. 58. Alerting
  59. 59. Alerting ● Syren - Open source ● Reporting to: ○ Email, Flowdock, HipChat, HTTP, Hubot, IRCcat, PagerDuty, Pushover, SLF4J, Slack, SNMP, Twilio
  60. 60. ELK - And what about alerting??? ● Elastalert ● http://engineeringblog.yelp.com/2015/10/elastalert-alerting-at-scale-with-elasticsearch.html ● http://engineeringblog.yelp.com/2016/03/elastalert-part-two.html
  61. 61. Some more alerting ● Cloudwatch and Stackdriver has their own alerting mechanism ● New Relic has it’s own alerting too ● Even with our Jenkins tests we’ve created alerting via emails ○ Beware of “Spam” ● Find which solution you would like as long as: ○ You can notice what is wrong => when it’s wrong ○ Be able to “Acknowledge” your errors ○ Do something you won’t be able to ignore :)
  62. 62. Summary - Monitoring Stack Alerting Metrics Collection Datastore Dashboard Data Monitoring Log Monitoring
  63. 63. Conclusions ● Correlating Application and System metrics!!!! ● Ask the right monitoring questions and answer them with Dashboards ● KISS - simple is key, what’s hard, we tend not to do at all ● Alert about what you can actually react to (And to the relevant person) ● Measure whatever you can - only way to know if you’re improving ● Monitor your business KPIs too. ● If all of the above is not enough, Graphs are fricking cool! http://www.rantlifestyle.com/2013/09/23/how-happy-this-baby-is-will-shock-you/
  64. 64. Questions?
  65. 65. Demi Ben-Ari ● LinkedIn ● Twitter: @demibenari ● Blog: http://progexc.blogspot.com/ ● Email: demi.benari@gmail.com ● “Big Things” Community Meetup, YouTube, Facebook, Twitter ● GDG Cloud Thanks! my contact:
  66. 66. Resources ● Monitoring distributed systems - A case study in how Google monitors its complex systems

×