SlideShare a Scribd company logo
1 of 44
Download to read offline
Debug production server by
counter
羅仲成 Roy Lou
17 Media
2016 July
About me
- 17 Media Architect
- Past
- HTC: cloud backend
- Google: Google Fiber, embedded system
- NVIDIA: vlsi hardware
- roylou@gmail.com
About HTC CSI Project
- Cloud service infrastructure for
mobile apps (similar to Parse.com)
- Backed 5+ apps and 3M+ users
- 50 < # of VMs < 200 (Autoscaled)
- ~15 microservices
- Team of 15 engineers
One Gallery Umadeit
(Fun Fit)
External
outage
Internet
Connectivity ZooKeeper
Down
Application
Errors
Intranet
Connectivity
Redis Down
DB Down
Problems to Solve
Need utility to monitor, alert, debug production cluster issues:
- Infrastructure outage
- Application outage
What choices do I have
Infrastructure monitoring
Application monitoring (for weak typing languages)
Counter
func (s *Store) Get(key string) ([]byte, error) {
defer ctr.Time("get.proc_time", time.Now())
if val, err := s.Cache.Get(key); err == nil {
ctr.Event("get.cache_hit", 1)
return val, nil
}
val, err := s.DB.Get(key)
if err != nil {
ctr.Event("get.db.err", 1)
return nil, err
}
return msgs, nil
}
Counter Example - Read Cache
Client
Cache DB
func (t *RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
ctr.Event("qps", 1)
ctr.Event("send.bytes", uint64(req.ContentLength))
defer ctr.Time("latency", time.Now())
res, err := t.rt.RoundTrip(req)
if err == nil {
ctr.Event(fmt.Sprintf("status.%d", res.StatusCode), 1)
} else {
ctr.Err("internal.err", 1)
}
return res, err
} Counter Roundtripper
Client
Server
Roundtrip
Counter Example - Http Roundtrip
App
Container
Fluentd
Agent
VM
Counter Pipeline
App
Container
Fluentd
Agent
VM
Counter Pipeline
prometheus
ES alternative:
App
Container
Fluentd
Agent
VM
How frequent should I send counter?
How Frequent Should I Send Counter?
Option 1: Forward every counter to Elasticsearch
Option 2: Aggregate locally before forwarding
1000 counters / container * 100 counts / second = 100k qps
For us, aggregate and send every 30 sec
App
Container
Fluentd
Agent
VM
How long can I store counters?
How Long Can I Store Counters?
- 50,000 counters
- 1 record every 30 seconds
To save counter for 1 year:
50,000 * 4 (bytes) * 2 (counters/minute) * 525,600 (mins/year)
= 210,240,000,000 Bytes
= 210.24 TB
Need to aggregate for long term storage
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Counter Granularity:
- Past 10 days: 30 sec
- Past 1 month: 5 min
- Past 3 month: 30 min
- Past year: 1 hr
Time series counter
Topology View
Deploy with Counters
Deploy with Counters
Docker Registry
docker push
code Review
CI
git
push
deploy
- Mon night: Code freeze
- Tue morning: Deploy to staging
- If okay, deploy to production
30% => 50% => 100%
Rolling to X%
- Health check
- Manually inspect
counters
- Minimal e2e test
- Compare counter
with last deploy
Monitor / Alert with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Monitor/Alert with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
eQstr = 'host:"prod-cg-docvcs-group" AND pkg:docvcs_worker AND name:overall.err'
rQstr = 'host:"prod-cg-docvcs-group" AND pkg:docvcs_worker AND name:overall.request'
errors = esq_scalar('sum', 'total', eQstr, 'now-5m', 'now')
requests = esq_scalar('sum', 'total', rQstr, 'now-5m', 'now')
error_rate = errors * 100 / requests
-- Fail rate should be less than 10/s
alert_p2('docvcs fail_rate', error_rate, '>', 10, '15m')
alert_p0('docvcs fail_rate', error_rate, '>', 10, '45m')
Alarm when high error rate
Debug with Counters
Debug with Counters
- GDB
- Bisect with log
- Bisect with counters
counter
Autoscale with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
Autoscale with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
gcloud cli
qstr = 'name: docvcs.jobs.min.outstanding'
outstanding = esq_scalar(qstr, 'now-10m', 'now')
workload = outstanding / 200
autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)
autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)
minimum # of instances
maximum # of instances
maximum # of VMs to be scaled
target workload
safeguard
workload
▵Instance
0.65 0.85
0.45
6
safeguard
Business Logic with Counters
Business Logic with Counters
What else can counter do?
What can’t counter do?
Counter solves problem on 90% users.
Counter can’t solve problem on 1 user.
If so, need logs
Summary of Counter
A line of code. Can be used for:
- Rolling update
- Monitor / alert
- Debug cluster
- Autoscale cluster
- Simple business logics
- And many others (use your imagination)
Thank You
roylou@gmail.com

More Related Content

What's hot

What's hot (20)

Top 10 RxJs Operators in Angular
Top 10 RxJs Operators in Angular Top 10 RxJs Operators in Angular
Top 10 RxJs Operators in Angular
 
Cf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphiteCf summit-2016-monitoring-cf-sensu-graphite
Cf summit-2016-monitoring-cf-sensu-graphite
 
Serverless
ServerlessServerless
Serverless
 
A Series of Fortunate Events: Building an Operator in Java
A Series of Fortunate Events: Building an Operator in JavaA Series of Fortunate Events: Building an Operator in Java
A Series of Fortunate Events: Building an Operator in Java
 
Spring webflux
Spring webfluxSpring webflux
Spring webflux
 
Kube Your Enthusiasm - Paul Czarkowski
Kube Your Enthusiasm - Paul CzarkowskiKube Your Enthusiasm - Paul Czarkowski
Kube Your Enthusiasm - Paul Czarkowski
 
Orchestrate Event-Driven Infrastructure with SaltStack
Orchestrate Event-Driven Infrastructure with SaltStackOrchestrate Event-Driven Infrastructure with SaltStack
Orchestrate Event-Driven Infrastructure with SaltStack
 
Serverless Angular, Material, Firebase and Google Cloud applications
Serverless Angular, Material, Firebase and Google Cloud applicationsServerless Angular, Material, Firebase and Google Cloud applications
Serverless Angular, Material, Firebase and Google Cloud applications
 
Improving the Accumulo User Experience
 Improving the Accumulo User Experience Improving the Accumulo User Experience
Improving the Accumulo User Experience
 
Build reactive systems on lambda
Build reactive systems on lambdaBuild reactive systems on lambda
Build reactive systems on lambda
 
State in stateless serverless functions
State in stateless serverless functionsState in stateless serverless functions
State in stateless serverless functions
 
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...
 
API Design in the Modern Era - Architecture Next 2020
API Design in the Modern Era - Architecture Next 2020API Design in the Modern Era - Architecture Next 2020
API Design in the Modern Era - Architecture Next 2020
 
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
 
OSMC 2008 | Lessons in Nagios Learnt From Developing Opsview by Ton Voon
OSMC 2008 | Lessons in Nagios Learnt From Developing Opsview by Ton VoonOSMC 2008 | Lessons in Nagios Learnt From Developing Opsview by Ton Voon
OSMC 2008 | Lessons in Nagios Learnt From Developing Opsview by Ton Voon
 
Self-healing Applications with Ansible
Self-healing Applications with AnsibleSelf-healing Applications with Ansible
Self-healing Applications with Ansible
 
Dropwizard and Friends
Dropwizard and FriendsDropwizard and Friends
Dropwizard and Friends
 
SpringBoot and Spring Cloud Service for MSA
SpringBoot and Spring Cloud Service for MSASpringBoot and Spring Cloud Service for MSA
SpringBoot and Spring Cloud Service for MSA
 
Advanced Durable Functions - Serverless Meetup Tokyo - Feb 2018
Advanced Durable Functions - Serverless Meetup Tokyo - Feb 2018Advanced Durable Functions - Serverless Meetup Tokyo - Feb 2018
Advanced Durable Functions - Serverless Meetup Tokyo - Feb 2018
 
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
 

Similar to Debug production server by counter

How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
Jos Boumans
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 

Similar to Debug production server by counter (20)

Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
 
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Composable and streamable Play apps
Composable and streamable Play appsComposable and streamable Play apps
Composable and streamable Play apps
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Google Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with ZabbixGoogle Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with Zabbix
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Docker practical solutions
Docker practical solutionsDocker practical solutions
Docker practical solutions
 
StrongLoop Overview
StrongLoop OverviewStrongLoop Overview
StrongLoop Overview
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
Best Practices in Handling Performance Issues
Best Practices in Handling Performance IssuesBest Practices in Handling Performance Issues
Best Practices in Handling Performance Issues
 
Continous UI testing with Espresso and Jenkins
Continous UI testing with Espresso and JenkinsContinous UI testing with Espresso and Jenkins
Continous UI testing with Espresso and Jenkins
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Debug production server by counter