SlideShare a Scribd company logo
1
Service Quality Monitoring System
Architecture
Author: Matsuo Sawahashi
Division: GTS Japan, Solutioning, Chief Architect
Mail: matsuos@jp.ibm.com
2
Self-
introduction
Name: Matsuo Sawahashi
Company: IBM Japan
Division: Global Technology Services
Title: Executive Architect / Chief Architect
Current job:
• Connected Vehicle Project at my client
• Design multi-cloud networking architecture leveraging SD-WAN and Cloud-Exchanges
• Design connected-vehicle platform architecture on Azure based on Zero Trust Security concept
• Design service quality monitoring system based on SRE (Site Reliability Engineering) principle
• GTS Japan Technical Vitalization Community Leader
• Provide mentoring and round table session for junior engineers
• Provide leading-edge technical seminars
• JUAS (Japan System Users Association) part time instructor
Certifications
• TOGAF9 certification
• The Open Group Distinguished Architect
Publications
• OpenStack Deep Technique Guide
3
Executive
Summary
• The latest distributed system utilizing the cloud is a very
complicated configuration in which the components span a
plurality of components
• Applications for customers are part of products, and service
quality targets directly linked to business indicators are needed
• Legacy monitoring system based on traditional system
management is not linked not only to business indicators but also
to measure service quality
• Google advocates the idea of ​​site reliability engineering (SRE)
and introduces efforts to measure quality of service
• Based on the concept of SRE, the service quality monitoring
system collects and analyzes logs from various components not
only application codes but also whole infrastructure components
• Since very large amounts of data must be processed in real time,
it is necessary to design carefully with reference to the big data
architecture
• To utilize this system, you can measure the quality of service,
and make it possible to continuously improve the service quality
4
Problem
Statement
• Legacy approach in service management
• Monitoring each component individually and independently
• Access logs, error logs, CPU / RAM usages, etc..
• Application, server, network and storage
• Monitoring indicators are not tied to business indicators
• What is problem in legacy approach
• It is difficult to measure business service quality
• It is difficult to understand the user’s frustrations directly
• How many users feel frustration in response time?
• What are the functions that are not used so much?
• Which components are making performance worse?
• Approach
• We need to know what is going on in the whole system
including application, middleware, server, storage and network
5
Referenced
Vision – SRE
“Site Reliability
Engineering”
• What is SRE?
• A methodology of system management and service operation
• Google is advocating and practicing
• Goal is to continue to improve site reliability
• What to do in SRE
• Defining business and IT alignment meant in practice
• Define Service Level Indicator (SLI) to measure service reliability
• Define Service Level Objective (SLO) for each SLI
• Monitoring everything - performance, availability and scalability
• Performing continuous improvement based on the result of
monitoring
6
Service Quality
Monitoring
System
• Want is this?
• A system for collecting and analyzing logs through from whole
components making up a system and viewing statistics to
evaluate whether SLO has been achieved
• How does it work?
• Capture whole user’s transaction logs related to user’s
interaction through out from application components and
infrastructure components
• Provide a dashboard including search and analysis functions
• Benefit
• Can monitor the operating status according to business goal
• Can know the user’s experience (UX) systematically
• Can identify where the problem occurred immediately
• Can answer the cause of the problem as soon as there is an
inquiry
7
Architectural
Overview
Diagram
Application
Component
Infrastructure
Component
Log
Collector
Log
Aggregator /
Message
Queuing
Real-time
Streaming
Processing
Store
w/Search &
Analyze
Visualization
/ Dashboard
Infrastructure
Component
Application
Component
Infrastructure
Component
Log
Collector
Log
Collector
Infrastructure
Component
Log
Collector
User Device
Log
Collector
Load balancer
Application
Server
Database
Server
Firewall
SRE Team
Operator
Management
•Collecting logs
•Aggregating logs •Filtering
•Indexing
•Joining
•Storing data
•Indexing data
•Searching data
•Analyzing data
•Dashboard
8
Big data
architecture
patterns
Lambda Architecture
Hot path
Lambda architecture
Cold path
Batch Layer Service Layer
Master Data Batch View
Speed Layer
Real-time
View
Analytics
Client
Data
Source
• Speed Layer (Hot path) analyzes data in real time
• Batch Layer (Cold path) stores all of incoming data in its raw form and performs batch processing on the
data
• Service Layer indexes the batch view for efficient querying
• The Speed Layer updates the serving layer with incremental updates based on the most recent data
• The Lambda architecture was first proposed by Nathan Marz, author of Storm in 2012
• To realize service quality monitoring system, we need to treat huge log data produced from
variety of components
• Very large data sets require a long processing time to run the sort of queries that clients
need
• These queries need some algorithms such as MapReduce that operate in parallel across the
entire data sets and can not be performed in real time
• We want to get some results in real time with some loss of accuracy some times, then we will
combine batch result and real time result using below architecture patterns
Real-time
processing
Queuing
9
Big data
architecture
patterns
Kappa Architecture Kappa architecture
Speed Layer
Real-time
View
Analytics
Client
Data
Source
• A drawback to the Lambda architecture is its complexity – processing logic appears in two
different places – the cold and hot paths – using difference frameworks
• The Kappa architecture uses a stream processing system and all data flows through a single path
Real-time
processing
Queuing
10
Azure Blob
OS
Implementation
Example
Log Aggregator
& Message
Queuing
Real-time
Streaming
Processing
Store w/Search
& Analyze
Visualization /
Dashboard
Log Collector
Logstash
w/Azure
plugin
Filebeat
Kafka
Apache
Storm
OR
Apache
Spark
Streaming
Elasticsearch Kibana
Azure
Application
Insight
Azure
Monitor Azure Hubs
Logstash
w/Azure
plugin
Azure
Infrastructur
e
Component
s
Application
Code
Components (App / Infra)
•Collecting logs •Aggregating logs •Filtering
•Indexing
•Joining
•Storing data
•Indexing data
•Searching data
•Analyzing data
•Dashboard
•Write out logs •Write out logs
batch processing loop if needed
11
Architectural
Decision
Example
Issue Which architecture should be adopted for processing large log data in real-time and
batching
Decision Kappa architecture
Status Completed
Category Platform
Assumptions A real-time processing feature would be required for viewing latest service quality
measures; and large batch processing feature would be also required for viewing
statistical data over long period.
Options 1. Lambda architecture
2. Kappa architecture
Arguments
(Rationale)
Both architectures would support our requirements, however Lambda architecture has
a complex structure and a lot of servers are required, and running cost may increase.
Kappa architecture has a simple structure.
Risk None
Implications None
Notes None
12
Architectural
Decision
Example
Issue Which product should be used to realize Log Aggregator & Message Queuing function
Decision Kafka
Status Completed
Category Platform
Assumptions This function is the act of collecting large events logs from a variety of different
systems and data sources.
Options 1. Kafka
2. Redis
Arguments
(Rationale)
Redis is an in-memory store and it would be much faster than the disk-based Kafka.
Redis’s in-memory store is small and it can’t store large amount of data for long
periods of time. Kafka supports parallelism due to log partitioning of data. Redis does
not have parallelism.
Risk None
Implications None
Notes None
13
Architectural
Decision
Example
Issue Which product should be used to realize Real-time Streaming Processing function
Decision
Status Under investigation
Category Platform
Assumptions This function is the act of processing streaming data in real-time such as adding
indexes and calculating something. It is important characteristics to have not only
speed but also exactly once capability since this system must be able to analyze the
cause and location of the problem promptly and reliably.
Options 1. Storm
2. Spark Streaming
Arguments
(Rationale)
Storm holds true streaming model for stream processing via core storm layer. Spark
Streaming acts as a wrapper over the batch processing. Storm supports three
message processing mode: At least once, At most once, Exactly once. Spark supports
only one message processing mode i.e. “At least once”.
Risk None
Implications None
Notes None
14
Architectural
Decision
Example
Issue Which product should be used to realize Store w/Search & Analyze function
Decision Elasticsearch
Status Completed
Category Platform
Assumptions This function is the act of storing logs and adding indexes for analysis
Options 1. Elasticsearch
2. Splunk
Arguments
(Rationale)
Elasticsearch is an open source software product and would avoid vender lock-in.
Elasticsearch is free, but extended features are needed to purchase subscriptions.
Splunk is proprietary commercial software with high pricing level. Elasticsearch
supports a lot of plugins. Elasticsearch has now overtaken Splunk in term of the
population of Google searches.
Risk None
Implications None
Notes None
15
Architectural
Decision
Example
Issue Which product should be used to realize Visualization with Dashboard function
Decision Kibana
Status Completed
Category Platform
Assumptions This function is the act of viewing analyzed log data and metric, and providing a
dashboard
Options 1. Kibana
2. Grafana
Arguments
(Rationale)
Grafana is designed for analyzing and visualizing metrics, and it does not allow full-
text data querying. Kibana is the ‘K’ in the ELK Stack produced by Elasticsearch and
most popular open source log analysis platform. Kibana supports not only metrics but
also analyzing log messages. Grafana supports built-in user control and
authentication features, but Kibana requires X-Pack which is a commercial (not free)
bundle of ELK add-ons for access control and authentication or adding open source
solutions such as SearchGuard.
Risk None
Implications None
Notes None
16
Use Case
Example
# Trigger Input Outcome TAT Remark
UC001 Failure inquiries from end
users (unavailable, hardening,
different results, etc.)
• User ID
• Time (Option)
• Screen ID (Option)
• Error Code (Option)
Identification of failure (delay)
occurrence location
(application component or
infrastructure component) and
suggestion of workaround and
solution
Within 5
minutes
UC002 Failure inquiries from
monitoring operators (large
alert occurrence, unknown
alert occurrence, obviously
different events from normal
times, etc.)
• Time (Option)
• Alert Message (Option)
Same as above Within 5
minutes
UC003 Inquiries from the system
administrator (Is it working
normally? Is there any
problem? What is the
capacity situation? What is
the performance situation?)
• N/A Dashboard (number of users
within 1 hour, error rate, delay
rate, capacity upper limit
value and current usage rate
for each component, delay
rate within the most recent
one hour for each component,
trend graph
Within 5
minutes
UC004 Monthly report • N/A Transition graph of number of
users, error rate, delay rate of
the current month, same
information per component
Within 24
hours
17
Log Data
Structure and
Format
Example
Item Type Sample
Transaction ID Text A3828OQZAG8367483
Current time Datetime 2018-09-10T09:01:48Z
Service name Text Authorization_Service
Component
name
Text Login_Component
API name (URL) Text http://www.company.com/login_api
HTTP method Text GET
HTTP status
code
Text 200
Request status Text Success
Response time Text 21
Component
A
API A-1
Component
B
API B-1
Component
CAPI C-1
API A-2
Push Button1
Push Button2
Service 1
Service 2
 A concept of “Service”, “Component” and “API” Structure including Logging point
 Log format
Logging point
Data Architecture for gathering log data from application component
18
Performance &
Capacity
Assumption
Example
• The average number of components for all requests
• For instance : 10
• The average number of API calls in each component
• For instance : 5
• Average log length
• For instance : 2 KB
• The number of logs per request
• 10 x 5 = 50 records
• Log size per request
• 50 records x 2 KB = 100 KB
• The number of access to application (peak)
• For instance : 1,000 req/sec
• The number of access to log store
• 50 records x 1,000 req/sec = 50,000 req/sec
• The average number of access for 24 hours
• For instance : 1,000,000 req/day
• Log size per day
• 100 KB x 1,000,000 req/day = 100,000,000 KB/day = 95 GB/day
Component X 5 API Calls
Log
Component X 5 API Calls
X 10 Components
Log size : 2 KB
50,000 req/sec
1,000 req/sec (Peak)
95 GB/day
1,000,000 req/day
Sizing Model
19
Dashboard
Example

More Related Content

What's hot

CIS bench marks for public clouds
CIS bench marks for public cloudsCIS bench marks for public clouds
CIS bench marks for public clouds
Nagesh Ramamoorthy
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
Splunk
 
SafePeak - In-Memory Dynamic Caching
SafePeak - In-Memory Dynamic CachingSafePeak - In-Memory Dynamic Caching
SafePeak - In-Memory Dynamic Caching
Vladi Vexler
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018
Taswar Bhatti
 
Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)
David Pasek
 
SPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices Session
SPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices SessionSPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices Session
SPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices Session
Michael Noel
 
SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...
SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...
SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...
Michael Noel
 
Architecture Concepts
Architecture ConceptsArchitecture Concepts
Architecture Concepts
Pratip Mallik
 
Interconnect session 1888: Rational Team Concert Process Customization: What ...
Interconnect session 1888: Rational Team Concert Process Customization: What ...Interconnect session 1888: Rational Team Concert Process Customization: What ...
Interconnect session 1888: Rational Team Concert Process Customization: What ...
Rosa Naranjo
 
Webinar: SAP HANA - Features, Architecture and Advantages
Webinar: SAP HANA - Features, Architecture and AdvantagesWebinar: SAP HANA - Features, Architecture and Advantages
Webinar: SAP HANA - Features, Architecture and Advantages
APPSeCONNECT
 
VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld
 
Ground Breakers Romania: Oracle Autonomous Database
Ground Breakers Romania: Oracle Autonomous DatabaseGround Breakers Romania: Oracle Autonomous Database
Ground Breakers Romania: Oracle Autonomous Database
Maria Colgan
 
Architecture concepts
Architecture conceptsArchitecture concepts
Architecture concepts
Pratip Mallik
 
Tuning OEM Templates
Tuning OEM Templates Tuning OEM Templates
Tuning OEM Templates
Datavail
 
Mastering the enterprise manager 12c environment
Mastering the enterprise manager 12c environmentMastering the enterprise manager 12c environment
Mastering the enterprise manager 12c environment
Kellyn Pot'Vin-Gorman
 
Aneka platform
Aneka platformAneka platform
Aneka platform
Shyam Krishna Khadka
 
Contract-oriented PLSQL Programming
Contract-oriented PLSQL ProgrammingContract-oriented PLSQL Programming
Contract-oriented PLSQL Programming
John Beresniewicz
 
Azure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overviewAzure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overview
George Walters
 
What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10
Precisely
 

What's hot (20)

CIS bench marks for public clouds
CIS bench marks for public cloudsCIS bench marks for public clouds
CIS bench marks for public clouds
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
 
SafePeak - In-Memory Dynamic Caching
SafePeak - In-Memory Dynamic CachingSafePeak - In-Memory Dynamic Caching
SafePeak - In-Memory Dynamic Caching
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018
 
Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)
 
SPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices Session
SPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices SessionSPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices Session
SPSSV 2013 - Ultimate SharePoint Infrastructure Best Practices Session
 
SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...
SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...
SPSAD - Ultimate SharePoint Infrastructure Best Practices Session - SharePoin...
 
Architecture Concepts
Architecture ConceptsArchitecture Concepts
Architecture Concepts
 
Interconnect session 1888: Rational Team Concert Process Customization: What ...
Interconnect session 1888: Rational Team Concert Process Customization: What ...Interconnect session 1888: Rational Team Concert Process Customization: What ...
Interconnect session 1888: Rational Team Concert Process Customization: What ...
 
Webinar: SAP HANA - Features, Architecture and Advantages
Webinar: SAP HANA - Features, Architecture and AdvantagesWebinar: SAP HANA - Features, Architecture and Advantages
Webinar: SAP HANA - Features, Architecture and Advantages
 
VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...
 
Ground Breakers Romania: Oracle Autonomous Database
Ground Breakers Romania: Oracle Autonomous DatabaseGround Breakers Romania: Oracle Autonomous Database
Ground Breakers Romania: Oracle Autonomous Database
 
Architecture concepts
Architecture conceptsArchitecture concepts
Architecture concepts
 
Tuning OEM Templates
Tuning OEM Templates Tuning OEM Templates
Tuning OEM Templates
 
Mastering the enterprise manager 12c environment
Mastering the enterprise manager 12c environmentMastering the enterprise manager 12c environment
Mastering the enterprise manager 12c environment
 
Aneka platform
Aneka platformAneka platform
Aneka platform
 
Contract-oriented PLSQL Programming
Contract-oriented PLSQL ProgrammingContract-oriented PLSQL Programming
Contract-oriented PLSQL Programming
 
Azure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overviewAzure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overview
 
SaaS External Presentation
SaaS External PresentationSaaS External Presentation
SaaS External Presentation
 
What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10
 

Similar to Service quality monitoring system architecture

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Amazon Web Services
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
Ram G Athreya
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
Markus Eisele
 
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
Agile Testing Alliance
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics Service
Christian Beedgen
 
Rakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten’s Journey with Splunk - Evolution of Splunk as a ServiceRakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten Group, Inc.
 
How to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).pptHow to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).ppt
StevenShing
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Indrajit Poddar
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
Mickey Boxell
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Piyush Kumar
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
gjuljo
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
Noriaki Tatsumi
 
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Lucas Jellema
 
Data & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeData & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real Time
SingleStore
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
Daniel Austin
 
Soa 1 7.ppsx
Soa 1 7.ppsxSoa 1 7.ppsx
Soa 1 7.ppsx
ssuser3a47cb
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?
Katherine Golovinova
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
gjuljo
 
Keynote : évolution et vision d'Elastic Observability
Keynote : évolution et vision d'Elastic ObservabilityKeynote : évolution et vision d'Elastic Observability
Keynote : évolution et vision d'Elastic Observability
Elasticsearch
 

Similar to Service quality monitoring system architecture (20)

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
ATAGTR2017 Unified APM: The new age performance monitoring for production sys...
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics Service
 
Rakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten’s Journey with Splunk - Evolution of Splunk as a ServiceRakuten’s Journey with Splunk - Evolution of Splunk as a Service
Rakuten’s Journey with Splunk - Evolution of Splunk as a Service
 
How to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).pptHow to Build TOGAF Architectures With System Architect (2).ppt
How to Build TOGAF Architectures With System Architect (2).ppt
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Feature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scaleFeature drift monitoring as a service for machine learning models at scale
Feature drift monitoring as a service for machine learning models at scale
 
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
 
Data & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real TimeData & Analytics Forum: Moving Telcos to Real Time
Data & Analytics Forum: Moving Telcos to Real Time
 
Managing Performance Globally with MySQL
Managing Performance Globally with MySQLManaging Performance Globally with MySQL
Managing Performance Globally with MySQL
 
Soa 1 7.ppsx
Soa 1 7.ppsxSoa 1 7.ppsx
Soa 1 7.ppsx
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?Migrating from a monolith to microservices – is it worth it?
Migrating from a monolith to microservices – is it worth it?
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
 
Keynote : évolution et vision d'Elastic Observability
Keynote : évolution et vision d'Elastic ObservabilityKeynote : évolution et vision d'Elastic Observability
Keynote : évolution et vision d'Elastic Observability
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

Service quality monitoring system architecture

  • 1. 1 Service Quality Monitoring System Architecture Author: Matsuo Sawahashi Division: GTS Japan, Solutioning, Chief Architect Mail: matsuos@jp.ibm.com
  • 2. 2 Self- introduction Name: Matsuo Sawahashi Company: IBM Japan Division: Global Technology Services Title: Executive Architect / Chief Architect Current job: • Connected Vehicle Project at my client • Design multi-cloud networking architecture leveraging SD-WAN and Cloud-Exchanges • Design connected-vehicle platform architecture on Azure based on Zero Trust Security concept • Design service quality monitoring system based on SRE (Site Reliability Engineering) principle • GTS Japan Technical Vitalization Community Leader • Provide mentoring and round table session for junior engineers • Provide leading-edge technical seminars • JUAS (Japan System Users Association) part time instructor Certifications • TOGAF9 certification • The Open Group Distinguished Architect Publications • OpenStack Deep Technique Guide
  • 3. 3 Executive Summary • The latest distributed system utilizing the cloud is a very complicated configuration in which the components span a plurality of components • Applications for customers are part of products, and service quality targets directly linked to business indicators are needed • Legacy monitoring system based on traditional system management is not linked not only to business indicators but also to measure service quality • Google advocates the idea of ​​site reliability engineering (SRE) and introduces efforts to measure quality of service • Based on the concept of SRE, the service quality monitoring system collects and analyzes logs from various components not only application codes but also whole infrastructure components • Since very large amounts of data must be processed in real time, it is necessary to design carefully with reference to the big data architecture • To utilize this system, you can measure the quality of service, and make it possible to continuously improve the service quality
  • 4. 4 Problem Statement • Legacy approach in service management • Monitoring each component individually and independently • Access logs, error logs, CPU / RAM usages, etc.. • Application, server, network and storage • Monitoring indicators are not tied to business indicators • What is problem in legacy approach • It is difficult to measure business service quality • It is difficult to understand the user’s frustrations directly • How many users feel frustration in response time? • What are the functions that are not used so much? • Which components are making performance worse? • Approach • We need to know what is going on in the whole system including application, middleware, server, storage and network
  • 5. 5 Referenced Vision – SRE “Site Reliability Engineering” • What is SRE? • A methodology of system management and service operation • Google is advocating and practicing • Goal is to continue to improve site reliability • What to do in SRE • Defining business and IT alignment meant in practice • Define Service Level Indicator (SLI) to measure service reliability • Define Service Level Objective (SLO) for each SLI • Monitoring everything - performance, availability and scalability • Performing continuous improvement based on the result of monitoring
  • 6. 6 Service Quality Monitoring System • Want is this? • A system for collecting and analyzing logs through from whole components making up a system and viewing statistics to evaluate whether SLO has been achieved • How does it work? • Capture whole user’s transaction logs related to user’s interaction through out from application components and infrastructure components • Provide a dashboard including search and analysis functions • Benefit • Can monitor the operating status according to business goal • Can know the user’s experience (UX) systematically • Can identify where the problem occurred immediately • Can answer the cause of the problem as soon as there is an inquiry
  • 7. 7 Architectural Overview Diagram Application Component Infrastructure Component Log Collector Log Aggregator / Message Queuing Real-time Streaming Processing Store w/Search & Analyze Visualization / Dashboard Infrastructure Component Application Component Infrastructure Component Log Collector Log Collector Infrastructure Component Log Collector User Device Log Collector Load balancer Application Server Database Server Firewall SRE Team Operator Management •Collecting logs •Aggregating logs •Filtering •Indexing •Joining •Storing data •Indexing data •Searching data •Analyzing data •Dashboard
  • 8. 8 Big data architecture patterns Lambda Architecture Hot path Lambda architecture Cold path Batch Layer Service Layer Master Data Batch View Speed Layer Real-time View Analytics Client Data Source • Speed Layer (Hot path) analyzes data in real time • Batch Layer (Cold path) stores all of incoming data in its raw form and performs batch processing on the data • Service Layer indexes the batch view for efficient querying • The Speed Layer updates the serving layer with incremental updates based on the most recent data • The Lambda architecture was first proposed by Nathan Marz, author of Storm in 2012 • To realize service quality monitoring system, we need to treat huge log data produced from variety of components • Very large data sets require a long processing time to run the sort of queries that clients need • These queries need some algorithms such as MapReduce that operate in parallel across the entire data sets and can not be performed in real time • We want to get some results in real time with some loss of accuracy some times, then we will combine batch result and real time result using below architecture patterns Real-time processing Queuing
  • 9. 9 Big data architecture patterns Kappa Architecture Kappa architecture Speed Layer Real-time View Analytics Client Data Source • A drawback to the Lambda architecture is its complexity – processing logic appears in two different places – the cold and hot paths – using difference frameworks • The Kappa architecture uses a stream processing system and all data flows through a single path Real-time processing Queuing
  • 10. 10 Azure Blob OS Implementation Example Log Aggregator & Message Queuing Real-time Streaming Processing Store w/Search & Analyze Visualization / Dashboard Log Collector Logstash w/Azure plugin Filebeat Kafka Apache Storm OR Apache Spark Streaming Elasticsearch Kibana Azure Application Insight Azure Monitor Azure Hubs Logstash w/Azure plugin Azure Infrastructur e Component s Application Code Components (App / Infra) •Collecting logs •Aggregating logs •Filtering •Indexing •Joining •Storing data •Indexing data •Searching data •Analyzing data •Dashboard •Write out logs •Write out logs batch processing loop if needed
  • 11. 11 Architectural Decision Example Issue Which architecture should be adopted for processing large log data in real-time and batching Decision Kappa architecture Status Completed Category Platform Assumptions A real-time processing feature would be required for viewing latest service quality measures; and large batch processing feature would be also required for viewing statistical data over long period. Options 1. Lambda architecture 2. Kappa architecture Arguments (Rationale) Both architectures would support our requirements, however Lambda architecture has a complex structure and a lot of servers are required, and running cost may increase. Kappa architecture has a simple structure. Risk None Implications None Notes None
  • 12. 12 Architectural Decision Example Issue Which product should be used to realize Log Aggregator & Message Queuing function Decision Kafka Status Completed Category Platform Assumptions This function is the act of collecting large events logs from a variety of different systems and data sources. Options 1. Kafka 2. Redis Arguments (Rationale) Redis is an in-memory store and it would be much faster than the disk-based Kafka. Redis’s in-memory store is small and it can’t store large amount of data for long periods of time. Kafka supports parallelism due to log partitioning of data. Redis does not have parallelism. Risk None Implications None Notes None
  • 13. 13 Architectural Decision Example Issue Which product should be used to realize Real-time Streaming Processing function Decision Status Under investigation Category Platform Assumptions This function is the act of processing streaming data in real-time such as adding indexes and calculating something. It is important characteristics to have not only speed but also exactly once capability since this system must be able to analyze the cause and location of the problem promptly and reliably. Options 1. Storm 2. Spark Streaming Arguments (Rationale) Storm holds true streaming model for stream processing via core storm layer. Spark Streaming acts as a wrapper over the batch processing. Storm supports three message processing mode: At least once, At most once, Exactly once. Spark supports only one message processing mode i.e. “At least once”. Risk None Implications None Notes None
  • 14. 14 Architectural Decision Example Issue Which product should be used to realize Store w/Search & Analyze function Decision Elasticsearch Status Completed Category Platform Assumptions This function is the act of storing logs and adding indexes for analysis Options 1. Elasticsearch 2. Splunk Arguments (Rationale) Elasticsearch is an open source software product and would avoid vender lock-in. Elasticsearch is free, but extended features are needed to purchase subscriptions. Splunk is proprietary commercial software with high pricing level. Elasticsearch supports a lot of plugins. Elasticsearch has now overtaken Splunk in term of the population of Google searches. Risk None Implications None Notes None
  • 15. 15 Architectural Decision Example Issue Which product should be used to realize Visualization with Dashboard function Decision Kibana Status Completed Category Platform Assumptions This function is the act of viewing analyzed log data and metric, and providing a dashboard Options 1. Kibana 2. Grafana Arguments (Rationale) Grafana is designed for analyzing and visualizing metrics, and it does not allow full- text data querying. Kibana is the ‘K’ in the ELK Stack produced by Elasticsearch and most popular open source log analysis platform. Kibana supports not only metrics but also analyzing log messages. Grafana supports built-in user control and authentication features, but Kibana requires X-Pack which is a commercial (not free) bundle of ELK add-ons for access control and authentication or adding open source solutions such as SearchGuard. Risk None Implications None Notes None
  • 16. 16 Use Case Example # Trigger Input Outcome TAT Remark UC001 Failure inquiries from end users (unavailable, hardening, different results, etc.) • User ID • Time (Option) • Screen ID (Option) • Error Code (Option) Identification of failure (delay) occurrence location (application component or infrastructure component) and suggestion of workaround and solution Within 5 minutes UC002 Failure inquiries from monitoring operators (large alert occurrence, unknown alert occurrence, obviously different events from normal times, etc.) • Time (Option) • Alert Message (Option) Same as above Within 5 minutes UC003 Inquiries from the system administrator (Is it working normally? Is there any problem? What is the capacity situation? What is the performance situation?) • N/A Dashboard (number of users within 1 hour, error rate, delay rate, capacity upper limit value and current usage rate for each component, delay rate within the most recent one hour for each component, trend graph Within 5 minutes UC004 Monthly report • N/A Transition graph of number of users, error rate, delay rate of the current month, same information per component Within 24 hours
  • 17. 17 Log Data Structure and Format Example Item Type Sample Transaction ID Text A3828OQZAG8367483 Current time Datetime 2018-09-10T09:01:48Z Service name Text Authorization_Service Component name Text Login_Component API name (URL) Text http://www.company.com/login_api HTTP method Text GET HTTP status code Text 200 Request status Text Success Response time Text 21 Component A API A-1 Component B API B-1 Component CAPI C-1 API A-2 Push Button1 Push Button2 Service 1 Service 2  A concept of “Service”, “Component” and “API” Structure including Logging point  Log format Logging point Data Architecture for gathering log data from application component
  • 18. 18 Performance & Capacity Assumption Example • The average number of components for all requests • For instance : 10 • The average number of API calls in each component • For instance : 5 • Average log length • For instance : 2 KB • The number of logs per request • 10 x 5 = 50 records • Log size per request • 50 records x 2 KB = 100 KB • The number of access to application (peak) • For instance : 1,000 req/sec • The number of access to log store • 50 records x 1,000 req/sec = 50,000 req/sec • The average number of access for 24 hours • For instance : 1,000,000 req/day • Log size per day • 100 KB x 1,000,000 req/day = 100,000,000 KB/day = 95 GB/day Component X 5 API Calls Log Component X 5 API Calls X 10 Components Log size : 2 KB 50,000 req/sec 1,000 req/sec (Peak) 95 GB/day 1,000,000 req/day Sizing Model