In this talk, I will introduce Portainer, Prometheus and Grafana to do an effective monitoring of an infrastructure. We will also look into the Prometheus configuration in detail with optimization in mind and a small demo of a simple use case. We will cover some basic challenges of using prometheus in a production environment and look for possible steps to overcome those issues.
2. 2
Agenda
Machine Learning Reply | DSCAdria 23 | May 18th
About Reply & Abhisar Bharti
1
Monitoring and Machine Learning Operations
2
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
3
Summary
4
3. 3
Agenda
Machine Learning Reply | DSCAdria 23 | May 18th
About Reply & Abhisar Bharti
1
Monitoring and Machine Learning Operations
2
3
Summary
4
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
4. About Reply & Abhisar Bharti
Global presence and growth
Reply at a glance - Our growth in revenue and employees
Employees
Revenue
In Mio €
State-of-the-art technologies
High investments in future technologies and
the establishment of leading-edge
technologies lead to value creation and long-
term customer relationships.
Wachstumsfaktoren
Top-Talents
Hiring top talent and developing employees
into accepted leaders within their industry or
technology.
Start-up Concept
Creation of Reply start-ups for new trends and
technologies, knowledge building and
"survival of the fittest".
Croatia
Zagreb
Brazil
São Paulo, Belo Horizonte
Berlin, Bremen, Düsseldorf, Frankfurt,
Gütersloh, Hamburg, München
Germany
France & Benelux
Paris, Amsterdam, Brüssel, Luxemburg
Italy
Bari, Mailand Padova, Rom, Turin,
Trieste, Verona
London, Basingstoke,
Chester, Cockpole Green
Great Britain
Poland & Romania
Katowice, Bucharest
Belarus
Minsk
Chicago, Detroit, Seattle
USA
China
Peking
Vienna
Austria
230
277
330 340
384
440
495
560
632
706
781
884
1036
1182
1250.2
1483.8
1891.1
2006 `07 `08 `09 `10 `11 `12 `13 `14 `15 `16 `17 `18 `19 `20 `21 `22
1925 2272 2686 2994 3149 3422 3725 4253 4689 5245 6015 6456 7606 8157 9059 10579 13467
4
Machine Learning Reply | DSCAdria 23 | May 18th
5. 5
Machine Learning Reply - United by the Digitalization
About Reply & Abhisar Bharti
ARTIFICIAL INTELLIGENCE
Intelligent Automation Machine Learning
CLOUD PLATFORMS
Development & Operations Data
CYBER SECURITY
Security Operation Center Security Consulting
Internet
of
Things
AUTONOMOUS VEHICLES
INDUSTRIAL SYSTEMS
CONNECTED PRODUCTS
ENERGY ECOSYSTEMS
HEALTHCARE
RETAIL & CPG
ENERGY
HEALTHCARE
MANUFACTURING &
LOGISTICS
FINANCIAL SERVICES
TELECOM & MEDIA
Industry
Platforms
IMMERSIVE
EXPERIENCE
DESIGN & UX
VIDEO
SOCIALMEDIA &
STORYTELLING
DIGITALECOSYSTEM
Customer
Experience
Reply expertise and multiple services
Machine Learning Reply | DSCAdria 23 | May 18th
6. 6
Who am I?
About Reply & Abhisar Bharti
Machine Learning Reply | DSCAdria 23 | May 18th
a.bharti@reply.de
▪ Machine Reply journey started on Feb, 2022
▪ Time-series enthusiast
▪ Contributor for research in Structure Literature Review techniques
▪ M.Sc (Data and Knowledge Engineering)
▪ Data storyteller
Abhisar Bharti
Data Science Consultant
7. 7
Agenda
Machine Learning Reply | DSCAdria 23 | May 18th
About Reply & Abhisar Bharti
1
Monitoring and Machine Learning Operations
2
3
Summary
4
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
8.
9. 9
It is time for a questions…
Monitoring and Machine learning operations
Machine Learning Reply | DSCAdria 23 | May 18th
10. 10
Monitoring and Machine Learning Operations
Monitoring and Machine learning operations
Machine Learning Reply | DSCAdria 23 | May 18th
Metrics
monitoring
▪ Is the system up?
▪ Is the system down?
▪ Is the system degrading?
▪ Collect relevant data that explains state
▪ Label them
▪ Analyse them
11. 11
The Importance of Real-Time Monitoring
Monitoring and Machine learning operations
Machine Learning Reply | DSCAdria 23 | May 18th
Unregulated
Lifecycle
Wrong
Feedback
Wrong
User
Behavior
Wrong
Model
Wrong
Pre-
dictions
Time
Business
Value
Monitored
Unmonitored
KPI
monitoring
▪ Over time unmonitored model deviates from
desired results
▪ Wrong prediction can lead to degrading business
value
12. 12
It is time for a questions (again)…
Monitoring and Machine learning operations
Machine Learning Reply | DSCAdria 23 | May 18th
13. What to monitor?
Monitoring and Machine learning operations
Latency
Operational
System (Soft- and Hardware) Data (Context, User Behavior, Predictions)
Machine Learning Reply | DSCAdria 23 | May 18th
Functional
13
Model (Predictions)
IO/Memory/CPU usage
System uptime
Disk utilization
Accuracy
Precision, Recall, F1
AUC-ROC
RMSE
Population Stability Index
(PSI)
Characteristic Stability Index
(CSI)
Kolmogorov–Smirnov test
Kullback–Leibler divergence
14. 14
Agenda
Machine Learning Reply | DSCAdria 23 | May 18th
About Reply & Abhisar Bharti
1
Monitoring and Machine Learning Operations
2
3
Summary
4
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
15. 15
How to monitor?
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
Maximizing the Value of Your Monitoring Data: Insights and Analysis
Machine Learning Reply | DSCAdria 23 | May 18th
Alarm
Structured /
Unstructured
ML-
System
▪ Open-source monitoring
soultion.
▪ Pull based approach.
▪ Easy to configure,deploy
and maintain.
▪ Container ready.
▪ Orchestration ready
(dynamic config).
▪ Open-source toolfor time
series analysis.
▪ Feature for alert
notification.
▪ Supportfor wide variety
of data source.
▪ Supportfor advanced
visualization and metrics
support
16. 16
Setting up an optimized prometheus
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
Machine Learning Reply | DSCAdria 23 | May 18th
Monitoring A Spring BootApplication, Part 2: Prometheus–Tom Gregory
▪ Influx of huge amount of metrics
▪ Outage in visualization
▪ Scrape time in minutes
▪ Megabytes of data collected per scrape
Need
for
Prometheus
optimization
17. 17
Sample Dashboard for MLOPS using Grafana
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
Ways to optimize Prometheus
Machine Learning Reply | DSCAdria 23 | May 18th
First identify those metrics that are not used in
visualization
1
Drop the identified metrics
2
Increase scrape interval
3
Migration to cloud provider like AWS
4
18. 18
Challenges
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
1Catering the
need of diverse
stakeholders
2Complexity
3Handling
government
regulations
4Need for
Prometheus
optimization
Machine Learning Reply | DSCAdria 23 | May 18th
19. 19
Agenda
Machine Learning Reply | DSCAdria 23 | May 18th
About Reply & Abhisar Bharti
1
Monitoring and Machine Learning Operations
2
3
Summary
4
Prometheus and Grafana: Advantages, Limitations, Challenges & Outcomes
20. Summary
20
Key Takeaways
Overview of current and upcoming collaboration
❑ Overview of Monitoring
❑ Need for Model monitoring
❑ Metrics to monitor
Generaloverview
1 ❑ Overview of sample workflow
❑ Introduction to Prometheus and Grafana
Monitoringworkflow
2
❑ Need for Prometheus optimization
❑ Steps to optimize Prometheus
❑ Challenges with Machine Learning operations
Optimization and challenges
3 ❑ Design & Implementation of a dashboard
❑ Final performance KPIs
Results
4
Machine Learning Reply | DSCAdria 23 | May 18th
21. Summary
21
Key Takeaways
Overview of current and upcoming collaboration
❑ Overview of Monitoring
❑ Need for Model monitoring
❑ Metrics to monitor
Generaloverview
1 ❑ Overview of sample workflow
❑ Introduction to Prometheus and Grafana
Monitoringworkflow
2
❑ Need for Prometheus optimization
❑ Steps to optimize Prometheus
❑ Challenges with Machine Learning operations
Optimization and challenges
3 ❑ Design & Implementation of a dashboard
❑ Final performance KPIs
Results
4
Machine Learning Reply | DSCAdria 23 | May 18th
22. Summary
22
Key Takeaways
Overview of current and upcoming collaboration
❑ Overview of Monitoring
❑ Need for Model monitoring
❑ Metrics to monitor
Generaloverview
1 ❑ Overview of sample workflow
❑ Introduction to Prometheus and Grafana
Monitoringworkflow
2
❑ Need for Prometheus optimization
❑ Steps to optimize Prometheus
❑ Challenges with Machine Learning operations
Optimization and challenges
3 ❑ Design & Implementation of a dashboard
❑ Final performance KPIs
Results
4
Machine Learning Reply | DSCAdria 23 | May 18th
23. Summary
23
Key Takeaways
Overview of current and upcoming collaboration
❑ Overview of Monitoring
❑ Need for Model monitoring
❑ Metrics to monitor
Generaloverview
1 ❑ Overview of sample workflow
❑ Introduction to Prometheus and Grafana
Monitoringworkflow
2
❑ Need for Prometheus optimization
❑ Steps to optimize Prometheus
❑ Challenges with Machine Learning operations
Optimization and challenges
3 ❑ Design & Implementation of a dashboard
❑ Final performance KPIs
Results
4
Machine Learning Reply | DSCAdria 23 | May 18th