Service quality monitoring system architecture

1
Service Quality Monitoring System
Architecture
Author: Matsuo Sawahashi
Division: GTS Japan, Solutioning, Chief Architect
Mail: matsuos@jp.ibm.com

2
Self-
introduction
Name: Matsuo Sawahashi
Company: IBM Japan
Division: Global Technology Services
Title: Executive Architect / Chief Architect
Current job:
• Connected Vehicle Project at my client
• Design multi-cloud networking architecture leveraging SD-WAN and Cloud-Exchanges
• Design connected-vehicle platform architecture on Azure based on Zero Trust Security concept
• Design service quality monitoring system based on SRE (Site Reliability Engineering) principle
• GTS Japan Technical Vitalization Community Leader
• Provide mentoring and round table session for junior engineers
• Provide leading-edge technical seminars
• JUAS (Japan System Users Association) part time instructor
Certifications
• TOGAF9 certification
• The Open Group Distinguished Architect
Publications
• OpenStack Deep Technique Guide

3
Executive
Summary
• The latest distributed system utilizing the cloud is a very
complicated configuration in which the components span a
plurality of components
• Applications for customers are part of products, and service
quality targets directly linked to business indicators are needed
• Legacy monitoring system based on traditional system
management is not linked not only to business indicators but also
to measure service quality
• Google advocates the idea of site reliability engineering (SRE)
and introduces efforts to measure quality of service
• Based on the concept of SRE, the service quality monitoring
system collects and analyzes logs from various components not
only application codes but also whole infrastructure components
• Since very large amounts of data must be processed in real time,
it is necessary to design carefully with reference to the big data
architecture
• To utilize this system, you can measure the quality of service,
and make it possible to continuously improve the service quality

4
Problem
Statement
• Legacy approach in service management
• Monitoring each component individually and independently
• Access logs, error logs, CPU / RAM usages, etc..
• Application, server, network and storage
• Monitoring indicators are not tied to business indicators
• What is problem in legacy approach
• It is difficult to measure business service quality
• It is difficult to understand the user’s frustrations directly
• How many users feel frustration in response time?
• What are the functions that are not used so much?
• Which components are making performance worse?
• Approach
• We need to know what is going on in the whole system
including application, middleware, server, storage and network

5
Referenced
Vision – SRE
“Site Reliability
Engineering”
• What is SRE?
• A methodology of system management and service operation
• Google is advocating and practicing
• Goal is to continue to improve site reliability
• What to do in SRE
• Defining business and IT alignment meant in practice
• Define Service Level Indicator (SLI) to measure service reliability
• Define Service Level Objective (SLO) for each SLI
• Monitoring everything - performance, availability and scalability
• Performing continuous improvement based on the result of
monitoring

6
Service Quality
Monitoring
System
• Want is this?
• A system for collecting and analyzing logs through from whole
components making up a system and viewing statistics to
evaluate whether SLO has been achieved
• How does it work?
• Capture whole user’s transaction logs related to user’s
interaction through out from application components and
infrastructure components
• Provide a dashboard including search and analysis functions
• Benefit
• Can monitor the operating status according to business goal
• Can know the user’s experience (UX) systematically
• Can identify where the problem occurred immediately
• Can answer the cause of the problem as soon as there is an
inquiry

7
Architectural
Overview
Diagram
Application
Component
Infrastructure
Component
Log
Collector
Log
Aggregator /
Message
Queuing
Real-time
Streaming
Processing
Store
w/Search &
Analyze
Visualization
/ Dashboard
Infrastructure
Component
Application
Component
Infrastructure
Component
Log
Collector
Log
Collector
Infrastructure
Component
Log
Collector
User Device
Log
Collector
Load balancer
Application
Server
Database
Server
Firewall
SRE Team
Operator
Management
•Collecting logs
•Aggregating logs •Filtering
•Indexing
•Joining
•Storing data
•Indexing data
•Searching data
•Analyzing data
•Dashboard

8
Big data
architecture
patterns
Lambda Architecture
Hot path
Lambda architecture
Cold path
Batch Layer Service Layer
Master Data Batch View
Speed Layer
Real-time
View
Analytics
Client
Data
Source
• Speed Layer (Hot path) analyzes data in real time
• Batch Layer (Cold path) stores all of incoming data in its raw form and performs batch processing on the
data
• Service Layer indexes the batch view for efficient querying
• The Speed Layer updates the serving layer with incremental updates based on the most recent data
• The Lambda architecture was first proposed by Nathan Marz, author of Storm in 2012
• To realize service quality monitoring system, we need to treat huge log data produced from
variety of components
• Very large data sets require a long processing time to run the sort of queries that clients
need
• These queries need some algorithms such as MapReduce that operate in parallel across the
entire data sets and can not be performed in real time
• We want to get some results in real time with some loss of accuracy some times, then we will
combine batch result and real time result using below architecture patterns
Real-time
processing
Queuing

9
Big data
architecture
patterns
Kappa Architecture Kappa architecture
Speed Layer
Real-time
View
Analytics
Client
Data
Source
• A drawback to the Lambda architecture is its complexity – processing logic appears in two
different places – the cold and hot paths – using difference frameworks
• The Kappa architecture uses a stream processing system and all data flows through a single path
Real-time
processing
Queuing

10
Azure Blob
OS
Implementation
Example
Log Aggregator
& Message
Queuing
Real-time
Streaming
Processing
Store w/Search
& Analyze
Visualization /
Dashboard
Log Collector
Logstash
w/Azure
plugin
Filebeat
Kafka
Apache
Storm
OR
Apache
Spark
Streaming
Elasticsearch Kibana
Azure
Application
Insight
Azure
Monitor Azure Hubs
Logstash
w/Azure
plugin
Azure
Infrastructur
e
Component
s
Application
Code
Components (App / Infra)
•Collecting logs •Aggregating logs •Filtering
•Indexing
•Joining
•Storing data
•Indexing data
•Searching data
•Analyzing data
•Dashboard
•Write out logs •Write out logs
batch processing loop if needed

11
Architectural
Decision
Example
Issue Which architecture should be adopted for processing large log data in real-time and
batching
Decision Kappa architecture
Status Completed
Category Platform
Assumptions A real-time processing feature would be required for viewing latest service quality
measures; and large batch processing feature would be also required for viewing
statistical data over long period.
Options 1. Lambda architecture
2. Kappa architecture
Arguments
(Rationale)
Both architectures would support our requirements, however Lambda architecture has
a complex structure and a lot of servers are required, and running cost may increase.
Kappa architecture has a simple structure.
Risk None
Implications None
Notes None

12
Architectural
Decision
Example
Issue Which product should be used to realize Log Aggregator & Message Queuing function
Decision Kafka
Status Completed
Category Platform
Assumptions This function is the act of collecting large events logs from a variety of different
systems and data sources.
Options 1. Kafka
2. Redis
Arguments
(Rationale)
Redis is an in-memory store and it would be much faster than the disk-based Kafka.
Redis’s in-memory store is small and it can’t store large amount of data for long
periods of time. Kafka supports parallelism due to log partitioning of data. Redis does
not have parallelism.
Risk None
Implications None
Notes None

13
Architectural
Decision
Example
Issue Which product should be used to realize Real-time Streaming Processing function
Decision
Status Under investigation
Category Platform
Assumptions This function is the act of processing streaming data in real-time such as adding
indexes and calculating something. It is important characteristics to have not only
speed but also exactly once capability since this system must be able to analyze the
cause and location of the problem promptly and reliably.
Options 1. Storm
2. Spark Streaming
Arguments
(Rationale)
Storm holds true streaming model for stream processing via core storm layer. Spark
Streaming acts as a wrapper over the batch processing. Storm supports three
message processing mode: At least once, At most once, Exactly once. Spark supports
only one message processing mode i.e. “At least once”.
Risk None
Implications None
Notes None

14
Architectural
Decision
Example
Issue Which product should be used to realize Store w/Search & Analyze function
Decision Elasticsearch
Status Completed
Category Platform
Assumptions This function is the act of storing logs and adding indexes for analysis
Options 1. Elasticsearch
2. Splunk
Arguments
(Rationale)
Elasticsearch is an open source software product and would avoid vender lock-in.
Elasticsearch is free, but extended features are needed to purchase subscriptions.
Splunk is proprietary commercial software with high pricing level. Elasticsearch
supports a lot of plugins. Elasticsearch has now overtaken Splunk in term of the
population of Google searches.
Risk None
Implications None
Notes None

15
Architectural
Decision
Example
Issue Which product should be used to realize Visualization with Dashboard function
Decision Kibana
Status Completed
Category Platform
Assumptions This function is the act of viewing analyzed log data and metric, and providing a
dashboard
Options 1. Kibana
2. Grafana
Arguments
(Rationale)
Grafana is designed for analyzing and visualizing metrics, and it does not allow full-
text data querying. Kibana is the ‘K’ in the ELK Stack produced by Elasticsearch and
most popular open source log analysis platform. Kibana supports not only metrics but
also analyzing log messages. Grafana supports built-in user control and
authentication features, but Kibana requires X-Pack which is a commercial (not free)
bundle of ELK add-ons for access control and authentication or adding open source
solutions such as SearchGuard.
Risk None
Implications None
Notes None

16
Use Case
Example
# Trigger Input Outcome TAT Remark
UC001 Failure inquiries from end
users (unavailable, hardening,
different results, etc.)
• User ID
• Time (Option)
• Screen ID (Option)
• Error Code (Option)
Identification of failure (delay)
occurrence location
(application component or
infrastructure component) and
suggestion of workaround and
solution
Within 5
minutes
UC002 Failure inquiries from
monitoring operators (large
alert occurrence, unknown
alert occurrence, obviously
different events from normal
times, etc.)
• Time (Option)
• Alert Message (Option)
Same as above Within 5
minutes
UC003 Inquiries from the system
administrator (Is it working
normally? Is there any
problem? What is the
capacity situation? What is
the performance situation?)
• N/A Dashboard (number of users
within 1 hour, error rate, delay
rate, capacity upper limit
value and current usage rate
for each component, delay
rate within the most recent
one hour for each component,
trend graph
Within 5
minutes
UC004 Monthly report • N/A Transition graph of number of
users, error rate, delay rate of
the current month, same
information per component
Within 24
hours

17
Log Data
Structure and
Format
Example
Item Type Sample
Transaction ID Text A3828OQZAG8367483
Current time Datetime 2018-09-10T09:01:48Z
Service name Text Authorization_Service
Component
name
Text Login_Component
API name (URL) Text http://www.company.com/login_api
HTTP method Text GET
HTTP status
code
Text 200
Request status Text Success
Response time Text 21
Component
A
API A-1
Component
B
API B-1
Component
CAPI C-1
API A-2
Push Button1
Push Button2
Service 1
Service 2
 A concept of “Service”, “Component” and “API” Structure including Logging point
 Log format
Logging point
Data Architecture for gathering log data from application component

18
Performance &
Capacity
Assumption
Example
• The average number of components for all requests
• For instance : 10
• The average number of API calls in each component
• For instance : 5
• Average log length
• For instance : 2 KB
• The number of logs per request
• 10 x 5 = 50 records
• Log size per request
• 50 records x 2 KB = 100 KB
• The number of access to application (peak)
• For instance : 1,000 req/sec
• The number of access to log store
• 50 records x 1,000 req/sec = 50,000 req/sec
• The average number of access for 24 hours
• For instance : 1,000,000 req/day
• Log size per day
• 100 KB x 1,000,000 req/day = 100,000,000 KB/day = 95 GB/day
Component X 5 API Calls
Log
Component X 5 API Calls
X 10 Components
Log size : 2 KB
50,000 req/sec
1,000 req/sec (Peak)
95 GB/day
1,000,000 req/day
Sizing Model

Service quality monitoring system architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Service quality monitoring system architecture

Similar to Service quality monitoring system architecture (20)

Recently uploaded

Recently uploaded (20)

Service quality monitoring system architecture