From an external observer’s perspective components of distributed systems are starting and terminating in an unpredictable manner, which makes the monitoring challenging. Components can also start multiple times on a single server as well as on multiple machines. The Hadoop ecosystem is one example for such a distributed application and the primary example of this talk. The fundamental question to be addressed is: How can such unpredictable distributed systems be monitored? This talk presents a general analysis of the problem and its existing solutions. Based on this analysis, a new theoretical concept is developed and realized in a practical solution. A fully automated monitoring solution for distributed systems will be demonstrated. The solution is flexible and portable and can therefore be applied also outside the Hadoop environment. This new solution is an efficient and promising contribution to the monitoring community.
3. Page 3MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
ABOUT ME AND THIS TALK
Past: 5years at ConSol‘s Monitoring Team
Finished my Master of Science inthe field of Computer Science
- Specialization: Distributed Systems
- Master Thesis: Concepts for Monitoring Distributed Systems
Now: BigData, Machine Learning, Artificial Intelligence Department at BMW as System Architect for Cloud Native Topics
4. Page 4MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
THE BMW GROUP IN NUMBERS (2017)
129.900
Employees worldwide
2.463.526
Sold cars worldwide
5. Page 5MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
BIGDATA, MACHINE LEARNING, ARTIFICIAL INTELLIGENCE @ BMW GROUP
Storage:
- ~ 3 PB Hadoop
- ~ 2 PB Streaming and other
Memory:
- ~ 25TB Hadoop
- ~ 25TB Streaming and other
~ 1TB data growth per day
7. Page 7MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
INTRODUCTION
DISTRIBUTED SYSTEMS
“A distributed system is a collection of independent computers that appearsto its users as a single coherent system.“
-Tanenbaum & Steen (2006)
8. Page 8MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
INTRODUCTION
DISTRIBUTED SYSTEMS
Hadoop Ecosystem
YARN
HDFS
MapReduce
(Spark)
9. Page 9MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
INTRODUCTION
RESEARCH QUESTION
Howto monitor distributed systems?
11. Page 11MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
ANALYSIS OF MONITORING CONCEPTS AND SYSTEMS
ANALYSIS OF MONITORING CONCEPTS
Push vs. Pull
Blackbox vs. Whitebox
Agent-based vs. Agent-less
12. Page 12MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
ANALYSIS OF MONITORING CONCEPTS AND SYSTEMS
REQUIREMENTS ON MONITORING SYSTEMS
Scalability
Robustness
Extensibility
Manageability / Administratively Scalable
Portability
Overhead
Security
13. Page 13MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
ANALYSIS OF MONITORING CONCEPTS AND SYSTEMS
CONCLUSION
Ganglia Nagios Prometheus
Pro - Manageability / Administratively
Scalable
- Extensibility
- Robustness
- Portability
- Overhead
Contra - Geographical Scaling
- Security
- Overhead - Manageability / Administratively
Scalable
→ Nagios + Prometheus
15. Page 15MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
MONITORING SOLUTION
SHOWCASE ENVIRONMENT
4 (ResourceManagers and NameNodes)
+ 252 * 2 (DataNodes and NodeManagers)
+ 252 * 3 (ApplicationMaster, Map and Reduce)
= 1264 processes at minimum.
16. Page 16MonitoringDistributed Systems | Philip Griesbacher | 07.11.2018 @ OSMC
MONITORING SOLUTION
CHALLENGE
“If a human operator needstotouch your system during normal operations, you have a bug.“
- Carla Geisser, Google