Lifting the Blinds: Monitoring Windows Server 2012

•Download as PPTX, PDF•

2 likes•14,119 views

Operating systems monitor resources continuously in order to effectively schedule processes. In this webinar, Evan Mouzakitis (Datadog) discusses how to get operational data from Windows Server 2012 using a variety of native tools.

Software

Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/
g the Blinds: Monitoring Windows Server

• SaaS based infrastructure and app monitoring
• Open Source Agent
• Time series data (metrics and events)
• Processing nearly a trillion data points per day
• Intelligent Alerting and Insightful Dashboards
Datadog Overview

Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores,
Caches, Queues and more...
Monitor Everything

Agenda
- Why should I monitor Windows Server?
- What are some indicators of performance
issues?
- How can I collect performance metrics for
analysis?

CPU metrics
- PercentProcessorTime
- ContextSwitchesPersec
- ProcessorQueueLength
- DPCsQueuedPersec
- PercentPrivilegedTime
- PercentDPCTime
- PercentInterruptTime

CPU: ContextSwitchesPersec
What it tracks:
Number of times the processor switched to a new thread
Correlate with:
Memory: PageFaultsPersec
Disk: DiskTransfersPersec
Network: BytesSentPersec/BytesReceivedPersec
Issue resolution:
Adding processors, thread partitioning, DPC partitioning,
hardware interrupt partitioning, disable I/O counters

CPU: PercentProcessorTime
What it tracks:
Percentage of time spent performing work (not idle)
Correlate with:
ProcessorQueueLength
Issue resolution:
More processors, bigger instance, optimize offending application,

CPU: ProcessorQueueLength
What it tracks:
Size of processor queue
Correlate with:
CPU: PercentProcessorTime, PercentPrivilegedTime, PercentDPCTime, PercentInterruptTime
Issue resolution:
Adding processors, thread partitioning, DPC partitioning,
hardware interrupt partitioning, disable I/O counters

CPU:DPCsQueuedPersec
What it tracks:
Deferred procedure call (DPC) enqueue rate
Correlate with:
CPU: PercentDPCTime
Disk: DiskTransfersPersec
Network: BytesSentPersec/BytesReceivedPersec
Issue resolution:
Remove buggy device, rollback driver

CPU: PercentPrivilegedTime/PercentDPCTime
PercentInterruptTime
What they track:
Percentage of time CPU spent in privileged mode/deferred procedure
calls/interrupts
Correlate with:
ContextSwitchesPersec/PercentPrivilegedTime/PercentDPCTime PercentInterruptTime
Issue resolution:
Adding processors, thread partitioning, DPC partitioning,
hardware interrupt partitioning, disable I/O counters

Memory metrics
- PoolNonpagedBytes
- PageFaultsPersec
- PagesInputPersec

Memory: PoolNonpagedBytes
What it tracks:
Amount of non-paged memory in use
Correlate with:
Windows Event 2019 “Nonpaged Memory Pool Empty”
Issue resolution:
Identify troublesome driver/roll back to known good state

What it tracks:
Rate of page faults
Correlate with:
PagesInputPersec
Issue resolution:
Increase system memory
Memory: PageFaultsPersec

What it tracks:
Rate pages are read (from disk) into memory
Correlate with:
PageFaultsPersec/ DiskTransfersPersec
Issue resolution:
Increase system memory, move page file to separate physical disk
Memory: PagesInputPersec

- AvgDiskQueueLength
- DiskTransfersPersec
- PercentIdleTime
Disk Metrics

Disk: AvgDiskQueueLength
What it tracks:
Running average of I/O ops in queue
Correlate with:
DiskTransfersPersec
Issue resolution:
Move data for I/O-intensive applications to separate disk; add disks to syste

Disk: DiskTransfersPersec
What it tracks:
Aggregate I/O rate
Correlate with:
AvgDiskQueueLength
Issue resolution:
Move data for I/O-intensive applications to separate disk; add disks to
system; increase disk cache

Disk: PercentIdleTime
What it tracks:
Percent of time disk is idle
Correlate with:
AvgDiskQueueLength
Issue resolution:
Move page file to separate disk; add disks to system; use SSDs

Powershell
- Windows’ scripting language (no more batch files!)
- Powerful language with deep OS support
- Integrates with C# natively
- Output is typed (unlike *NIX)

Windows Performance Toolkit
Requires Windows
Assessment and
Deployment Kit (formerly
Windows Performance
Toolkit)
https://www.microsoft.com
/en-
US/download/details.aspx
?id=39982

Questions?
Evan Mouzakitis
Research Engineer
Twitter: @vagelim
Email: evan@datadoghq.com
Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/

The tooling for building chatbots has exploded. Putting chatbots into production is now easier than ever. In this presentation, I focus on how you can use Azure Bot Service, Azure Search, and Cosmos DB to create a scalable backend for your chatbot. By using a fully managed, serverless architecture with continuous deployment, you can get your chatbot up and running quickly. Check out this deck to learn how to combine cloud computing and artificial intelligence so you can help humans and machines achieve more together. Learn more at http://www.neona.chat

Observability at Scale

Knoldus Inc.

Observability has emerged as one of the hottest topics on the DevOps landscape. Organizations seek to improve visibility into their cloud infrastructure and applications and identify production issues that may negatively impact #customerexperience. ➡️ But what are some of the best practices for scaling observability for modernapplications? ➡️ What challenges are #cloudplatforms facing? Explore how to overcome the challenges and unlock speed, observability, and automation across your DevOps lifecycle.

Map ReduceSri Prasanna

Azure DevOps Best Practices Webinar

Cambay Digital

Prometheus: A Next Generation Monitoring System (FOSDEM 2016)

Brian Brazil

Azure Chat Bot application

Vivek Singh

DevOps y DevSecOps son palabras de moda. Hay muchos artículos que describen qué son y qué no son. Creo que podemos estar de acuerdo en que son culturas, una forma de trabajo. También estoy seguro de que la mayoría de nosotros tenemos una impresión general de cómo debería ser: desarrollo, operaciones y seguridad trabajando juntos, rompiendo silos, entregando más rápido, automatizando, etc. En la mayoría de las discusiones que hemos tenido con los profesionales de la industria, una pregunta que surge una y otra vez con respecto a DevSecOps, es "¿Hay un marco para ¿Adopción de DevSecOps?" Ahora hay buenas razones para esta pregunta y una es que muchas personas de operaciones empresariales conocen marcos como ITIL y Cobit. La respuesta a esa pregunta es: ”CALMS”

Demystifying observability

Abigail Bangser

Debugging Your Debugging Tools: What to do When Your Service Mesh Goes Down

Aspen Mesh

In this CNCF Member Webinar, Neeraj Poddar (Aspen Mesh) and John Howard (Google) shared information on debugging your debugging tools when your service mesh goes down in production. Service meshes are widely used as a means to enforce policies and at the same time gain visibility into your application behavior and performance. As more organizations adopt service mesh in their architectures, they are relying more heavily on the metrics, tracing and other traffic management and security capabilities provided by the service mesh. But what happens when a critical piece of your infrastructure like Istio has issues while in production? In this webinar we will cover the debugging in production aspects of Istio, in particular the following topics will be covered: * How to debug and diagnose issues with your sidecar proxy Envoy * How to monitor and debug the Istio control plane * How to use operational tools like “istioctl” to understand issues with your configuration * Using profiling to identify bottlenecks * Recommendations for a production ready secure Istio deployment

Kubeflow

Karane Vieira

Introduction to Distributed Tracing

petabridge

As more and more developers move to distributed architectures such as micro services, distributed actor systems, and so forth it becomes increasingly complex to understand, debug, and diagnose. In this talk we're going to introduce the emerging OpenTracing standard and talk about how you can instrument your applications to help visualize every operation, even across process and service boundaries. We'll also introduce Zipkin, one of the most popular implementations of the OpenTracing standard.

DevSecOps - The big picture

Stefan Streichsbier

Introduction To DevOps | Devops Tutorial For Beginners | DevOps Training For ...

Simplilearn

This presentation on "Introduction to DevOps" will help you understand what is waterfall model, what is an agile model, what is DevOps, DevOps phases, DevOps tools and DevOps advantages. In traditional software development lifecycle, there is a lot of gap between development and operations team. DevOps addresses the gap between developers and operations. The development team will submit the application to the operations team for implementation. Operations team will monitor the application and provide relevant feedback to developers. According to DevOps practices, the workflow in software development and delivery is divided into 8 phases, Now, let us get started and understand these 8 phases in DevOps. Below topics are explained in this "Introduction to DevOps" presentation: 1. Waterfall model 2. Agile model 3. What is DevOps? 4. DevOps phases 5. DevOps tools 6. DevOps advantages Simplilearn's DevOps Certification Training Course will prepare you for a career in DevOps, the fast-growing field that bridges the gap between software developers and operations. You’ll become an expert in the principles of continuous development and deployment, automation of configuration management, inter-team collaboration and IT service agility, using modern DevOps tools such as Git, Docker, Jenkins, Puppet and Nagios. DevOps jobs are highly paid and in great demand, so start on your path today. Why learn DevOps? Simplilearn’s DevOps training course is designed to help you become a DevOps practitioner and apply the latest in DevOps methodology to automate your software development lifecycle right out of the class. You will master configuration management; continuous integration deployment, delivery and monitoring using DevOps tools such as Git, Docker, Jenkins, Puppet and Nagios in a practical, hands-on and interactive approach. The Devops training course focuses heavily on the use of Docker containers, a technology that is revolutionizing the way apps are deployed in the cloud today and is a critical skillset to master in the cloud age. Who should take this course? DevOps career opportunities are thriving worldwide. DevOps was featured as one of the 11 best jobs in America for 2017, according to CBS News, and data from Payscale.com shows that DevOps Managers earn as much as $122,234 per year, with DevOps engineers making as much as $151,461. DevOps jobs are the third-highest tech role ranked by employer demand on Indeed.com but have the second-highest talent deficit. 1. This DevOps training course will be of benefit the following professional roles: 2. Software Developers 3. Technical Project Managers 4. Architects 5. Operations Support 6. Deployment engineers 7. IT managers 8. Development managers Learn more at: https://www.simplilearn.com/

DevOps and Tools

Mohammed Fazuluddin

Google cloud study jam 2019 #cloud studyjam

Wessam ElSharawy

AzureOpenAI.pptx

Udaiappa Ramachandran

Azure OpenAI Service provides REST API access to OpenAI's powerful language models, including the GPT-3, GPT-4, DALL-E, Codex, and Embeddings model series. These models can be easily adapted to any specific task, including but not limited to content generation, summarization, semantic search, translation, transformation, and code generation. Microsoft offers the accessibility of the service through REST APIs, Python or C# SDK, or the Azure OpenAI Studio.

What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...

Edureka!

OpenTelemetry: From front- to backend (2022)

Sebastian Poxhofer

Distributed tracing using open tracing & jaeger 2

Chandresh Pancholi

Scaling monitoring with Datadogalexismidon

Monitoring kubernetes across data center and cloud

Datadog

What's hot

Devops as a service

Saravanan Subburayal

Dataday Texas 2016 - Datadog

Datadog

Frappé Framework - A Full Stack Web Framework

rushabh_mehta

Meetup OpenTelemetry Intro

DimitrisFinas1

Relational Database CI/CD

Jasmin Fluri

Hadoop And Their Ecosystem ppt

sunera pathan

REX: Cloud Native Apps on a K8S stack

Mathieu Herbert

Keep CALMS and DevSecOps

Luciano Moreira da Cruz

Demystifying observability

Abigail Bangser

Debugging Your Debugging Tools: What to do When Your Service Mesh Goes Down

Aspen Mesh

Kubeflow

Karane Vieira

Introduction to Distributed Tracing

petabridge

DevSecOps - The big picture

Stefan Streichsbier

Introduction To DevOps | Devops Tutorial For Beginners | DevOps Training For ...

Simplilearn

DevOps and Tools

Mohammed Fazuluddin

Google cloud study jam 2019 #cloud studyjam

Wessam ElSharawy

AzureOpenAI.pptx

Udaiappa Ramachandran

What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...

Edureka!

OpenTelemetry: From front- to backend (2022)

Sebastian Poxhofer

Distributed tracing using open tracing & jaeger 2

Chandresh Pancholi

What's hot (20)

Devops as a service

Dataday Texas 2016 - Datadog

Frappé Framework - A Full Stack Web Framework

Meetup OpenTelemetry Intro

Relational Database CI/CD

Hadoop And Their Ecosystem ppt

REX: Cloud Native Apps on a K8S stack

Keep CALMS and DevSecOps

Demystifying observability

Debugging Your Debugging Tools: What to do When Your Service Mesh Goes Down

Kubeflow

Introduction to Distributed Tracing

DevSecOps - The big picture

Introduction To DevOps | Devops Tutorial For Beginners | DevOps Training For ...

DevOps and Tools

Google cloud study jam 2019 #cloud studyjam

AzureOpenAI.pptx

What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...

OpenTelemetry: From front- to backend (2022)

Distributed tracing using open tracing & jaeger 2

Viewers also liked

Scaling monitoring with Datadogalexismidon

Monitoring kubernetes across data center and cloud

Datadog

Application Monitoring using Datadog

Mukta Aphale

Running & Monitoring Docker at Scale

Datadog

Containerization (à la Docker) is increasing the elastic nature of cloud infrastructure by an order of magnitude. If you have adopted Docker, or are considering it, you are probably facing questions like: - How many containers can you run on a given Amazon EC2 instance type? - Which metric should you look at to measure contention? - How do you manage fleets of containers at scale? Datadog’s CTO, Alexis Lê-Quôc, presents the challenges and benefits of running Docker containers at scale. Alexis explains how to use quantitative performance patterns to monitor your infrastructure at the new level of magnitude and increased complexity introduced by containerization.

Why Visibility into Your Stack Matters

Amazon Web Services

When running any amount of systems, gaining visibility into what they are doing can be a non-trivial matter. Starting on the path to monitoring can prove bumpy, and if you don’t measure, you don’t know. In this session, Michael Fiedler, Director of TechOps, will speak on personal experience with scalability, deployment, and monitoring challenges prior to using Datadog - and how that changed. He will cover how to get started, and examples of where monitoring the company's platform with Datadog provided the guiding light towards the team solving scalability problems.

Datadog- Monitoring In Motion

Cloud Native Apps SF

Datadog + VictorOps Webinar

Datadog

CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

CloudCamp Chicago

The May 2015 CloudCamp "unconference" focused on "Big Data and Cloud" About CloudCamp: the event features short lightning talks, an "unpanel" with audience participation and questions, and small breakout clusters around beers and pizza. Hosted by Cohesive Networks at TechNexus. Slides for the night's Lightning Talks: "Big Data without Big Infrastructure" - Dan Chuparkoff, VP of Product at Civis Analytics @Chuparkoff "Simplicity, Storytelling and Big Data" - Craig Booth, Data Engineer at Narrative Science @craigmbooth "Spark: A Quick Ignition" - Matthew Kemp, Architect of Things at Signal @mattkemp "Building warehousing systems on Redshift" - Tristan Crockett, Software Engineer at Edgeflip @thcrock Join us next time. Register at cloudcampchicago.eventbrite.com

Elastic Data Analytics Platform @Datadog

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Native container monitoring

Rohit Jnagal

20161108 datadog and_sushi

Masahiro Hattori

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud

Sylvain Kalache

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Redis Labs

Think you have big data? What about high availability requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads

Data Logging and TelemetryFrancesco Meschia

Deep-Dive to Application Insights

Gunnar Peipman

Intro to open source telemetry linux con 2016

Matthew Broberg

Abstract As part of the team delivering Snap, an open telemetry framework, I've run through dozens of use cases where gathering disparate metrics from services can roll up into meaningful diagrams for operations engineers and developers alike. We will use Snap's plugin model to collect, process and publish these measurements into meaningful graphs using open source tools. By joining this session, you can follow along and install industry-standard open source projects, deploy them and then use Snap to collect, process and visualize these metrics. Audience Anyone with an operations-background (or future ahead of them) that wants to see the breadth of available open source tooling around telemetry. This proposal is designed for the hands-on user, who is comfortable running containers or virtual machines locally. Experience Level Intermediate Benefits to the Ecosystem By joining this session, you can follow along and install industry-standard open source projects, deploy them and then use Snap to collect, process and visualize these metrics. This empowers users within the Linux ecosystem to see their knowledge as powerful when visualized next to other layers of the datacenter.

Sysdig Monitorama Slides

Loris Degioanni

RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...

Amazon Web Services

Amazon CloudWatch provides AWS customers the monitoring platform for keeping tabs on their cloud infrastructure and applications. In this session, we show you how to use CloudWatch to monitor vital operational resource data such as EC2 Instance CPU Utilization, ELB Request Counts, RDS Read Throughput and much more. Learn how to configure CloudWatch Alarms to alert you any time services are operating outside of ranges you define. Finally, see how you can monitor applications on your EC2 instances or outside of AWS.

Volta: Logging, Metrics, and Monitoring as a Service

LN Renganarayana

Our Logging, Metrics and Monitoring as a Service, Volta, is aimed at providing a scalable logging and metrics service for applications and services across the stack: starting from low level networks and core openstack services to platform services to Symantec products. Volta integrates with Keystone to provide secure authentication and multi-tenancy which is used to limit the visibility of logs/metrics to specific users/tenants or to specific services (e.g., only nova or only swift). Volta also provides features for setting up Alerts on log and metric events. In this session, we will share with you how we have built Volta using battle tested open source / OpenStack components such as Keystone, Kafka, Storm, ElasticSearch, InfluxDB, Logstash, Kibana, and Grafana. We will also present our Keystone based authentication and multi-tenancy model and its implementation for limiting the visibility of logs and metrics for queries and alerts.

'The History of Metrics According to me' by Stephen Day

Docker, Inc.

Metrics and monitoring are a time honored tradition for any engineering discipline. It is how we ensure the systems we use are working the way we expect. If this is a time honored tradition, why is it not a built into every piece of software we create, from the ground up? With software engineering, usually the trick to solving anything is to make it easier. By solving the hard parts of application metrics in Docker, we should make it more likely that metrics are a part of your services from the start.

Viewers also liked (20)

Scaling monitoring with Datadog

Monitoring kubernetes across data center and cloud

Application Monitoring using Datadog

Running & Monitoring Docker at Scale

Why Visibility into Your Stack Matters

Datadog- Monitoring In Motion

Datadog + VictorOps Webinar

CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Elastic Data Analytics Platform @Datadog

Native container monitoring

20161108 datadog and_sushi

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Data Logging and Telemetry

Deep-Dive to Application Insights

Intro to open source telemetry linux con 2016

Sysdig Monitorama Slides

RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...

Volta: Logging, Metrics, and Monitoring as a Service

'The History of Metrics According to me' by Stephen Day

Similar to Lifting the Blinds: Monitoring Windows Server 2012

Perfmon And Profiler 101

Quest Software

SharePoint 2013 Performance and Capacity Management

jems7

Web Performance Part 3 "Server-side tips"

Binary Studio

Testing pc’s performance lf

iteclearners

Ch14.run time support systemsYi-Jun Zheng

Performance Whackamole (short version)PostgreSQL Experts, Inc.

#SUGCON 2015 Sitecore Monitoring

chriswoj

Optimization In Mobile Systemsmomobangalore

Sql Server Performance Tuning

Bala Subra

This session is for you if you want to learn tips and techniques that are used to optimize database development with special emphasis on SQL Server 2005. If you write lot of stored procedures and want to learn the tools of a DBA, this is the session for you. If you are new to SQL Server development environment, you will learn how the various constructs compare to each other and better performance can be produced every time with a brief introduction to understanding Execution Plans.

Big data meet_up_08042016

Mark Smith

SQL 2005 Disk IO PerformanceInformation Technology

Testing pc’s performance

iteclearners

Windows Internal - Ch9 memory management

Kent Huang

Netezza fundamentals for developersBiju Nair

Introductiontoasp netwindbgdebugging-100506045407-phpapp01Camilo Alvarez Rivera

Google Cloud Computing on Google Developer 2008 Dayprogrammermag

Sql server troubleshooting

Nathan Winters

How Data Instant Replay and Data Progression Work Together

Compellent Technologies

16. PagingImplementIssused.pptx

MyName1sJeff

Application Performance LectureVishwanath Ramdas

Similar to Lifting the Blinds: Monitoring Windows Server 2012 (20)

Perfmon And Profiler 101

SharePoint 2013 Performance and Capacity Management

Web Performance Part 3 "Server-side tips"

Testing pc’s performance lf

Ch14.run time support systems

Performance Whackamole (short version)

#SUGCON 2015 Sitecore Monitoring

Optimization In Mobile Systems

Sql Server Performance Tuning

Big data meet_up_08042016

SQL 2005 Disk IO Performance

Testing pc’s performance

Windows Internal - Ch9 memory management

Netezza fundamentals for developers

Introductiontoasp netwindbgdebugging-100506045407-phpapp01

Google Cloud Computing on Google Developer 2008 Day

Sql server troubleshooting

How Data Instant Replay and Data Progression Work Together

16. PagingImplementIssused.pptx

Application Performance Lecture

More from Datadog

What it Means to be a Next-Generation Managed Service Provider

Datadog

Webinar that took place on July 12 2017. The emergence of cloud-based infrastructure has dramatically reshaped the IT landscape for managed service providers and their customers. Infrastructure is now dynamic, elastic, and instantly available to any individual or organization. Customers are becoming increasingly aware of the value of cloud services, and with this heightened awareness comes the desire to partner with providers who can guide them toward innovative business solutions and high-performance environments. But in this new landscape, gaining insight into the status and performance of dynamic infrastructure and applications is more challenging than ever. Join us as we host Thomas Robinson, Solutions Architect at Amazon Web Services, and Patrick Hannah, VP of Engineering at CloudHesive, to discuss what it means to be a next-generation managed service provider and how Datadog provides visibility into modern cloud infrastructure and helps you adopt new approaches to remain competitive in this ever-changing environment.

Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015

Datadog

PyData NYC 2015 - Automatically Detecting Outliers with Datadog

Datadog

Monitoring even a modestly-sized systems infrastructure quickly becomes untenable without automated alerting. For many metrics it is nontrivial to define ahead of time what constitutes “normal” versus “abnormal” values. This is especially true for metrics whose baseline value fluctuates over time. To make this problem more tractable, Datadog provides outlier detection functionality to automatically identify any host (or group of hosts) that is behaving abnormally compared to its peers. These slides cover the algorithms we use for outlier detection, and show how easy they are to implement using Python. This presentation also covers the lessons we've learned from using outlier detection on our own systems, along with some real-life examples on how to avoid false positives and negatives. Learn more at www.datadoghq.com.

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015

Datadog

Monitoring Docker containers - Docker NYC Feb 2015

Datadog

Treating Infrastructure as Garbage

Datadog

Events and metrics the Lifeblood of Webops

Datadog

The Data Mullet: From all SQL to No SQL back to Some SQLDatadog

Big (IT) data

Datadog

Deep dive into Nagios analytics

Datadog

Just enough web ops for web developers

Datadog

Customer Ops: DevOps <3 customer support

Datadog

I <3 graphs in 20 slides

Datadog

Effective monitoring with StatsD

Datadog

Alerting: more signal, less noise, less pain

Datadog

Fact based monitoringDatadog

Fact-Based Monitoring

Datadog

Your configuration management is fact-based. Your orchestration is fact-based. Is your monitoring fact-based? What does that even mean? Monitoring is very similar to configuration, at least in its expression. Configuration cares about files, services, and hosts being present and in a certain state (""nginx should be running with the following configuration""). Monitoring cares about services being present, running, and in a certain state. Both describe your infrastructure as it should be (""nginx should be running and respond in less than 200ms""). Fact-based monitoring is about being able to control monitoring with the same facts that Puppet uses (""monitor nginx latency wherever Puppet says it should run""). This is in contrast with imperative monitoring (""monitor nginx on host a, b and c"") that gets out of sync and leads to mailbox meltdowns from spurious alerts. Using open source and commercial examples, this talk will help you express your monitoring in a way that will feel very natural to your Puppet configuration.

Monitoring NGINX (plus): key metrics and how-to

Datadog

NGINX just works and that's why we use it. That does not mean that it should be left unmonitored. As a web server, it plays a central role in a modern infrastructure. As a gatekeeper, it sees every interaction with the application. If you monitor it properly it can explain a lot about what is happening in the rest of your infrastructure. In this talk you will learn more about NGINX (plus) metrics, what they mean and how to use them. You will also learn different methods (status, statsd, logs) to monitor NGINX with their pros and cons, illustrated with real data coming from real servers.

What’s in this Cookbook? - Mike Fiedler

Datadog

I Love Graphs - Alexis Lê-Quôc

Datadog

More from Datadog (20)

What it Means to be a Next-Generation Managed Service Provider

Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015

PyData NYC 2015 - Automatically Detecting Outliers with Datadog

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015

Monitoring Docker containers - Docker NYC Feb 2015

Treating Infrastructure as Garbage

Events and metrics the Lifeblood of Webops

The Data Mullet: From all SQL to No SQL back to Some SQL

Big (IT) data

Deep dive into Nagios analytics

Just enough web ops for web developers

Customer Ops: DevOps <3 customer support

I <3 graphs in 20 slides

Effective monitoring with StatsD

Alerting: more signal, less noise, less pain

Fact based monitoring

Fact-Based Monitoring

Monitoring NGINX (plus): key metrics and how-to

What’s in this Cookbook? - Mike Fiedler

I Love Graphs - Alexis Lê-Quôc

Recently uploaded

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Graspan: A Big Data System for Big Code Analysis

Aftab Hussain

We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18. - Accepted in ASPLOS ‘17, Xi’an, China. - Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17. - Invited for presentation at SoCal PLS ‘16. - Invited for poster presentation at PLDI SRC ‘16.

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Atelier - Innover avec l’IA Générative et les graphes de connaissances

Neo4j

Atelier - Innover avec l’IA Générative et les graphes de connaissances Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement. Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Aftab Hussain

Understanding variable roles in code has been found to be helpful by students in learning programming -- could variable roles help deep neural models in performing coding tasks? We do an exploratory study. - These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia

Top 7 Unique WhatsApp API Benefits | Saudi Arabia

Yara Milbes

Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience. In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx

ShamsuddeenMuhammadA

Vitthal Shirke Java Microservices Resume.pdf

Vitthal Shirke

Large Language Models and the End of Programming

Matt Welsh

OpenMetadata Community Meeting - 5th June 2024

OpenMetadata

The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features. * How to run your own data quality framework * What is the performance impact of running data quality frameworks * How to run the test cases in your own ETL pipelines * How the Incident Manager is integrated * Get notified with alerts when test cases fail Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E

GraphSummit Paris - The art of the possible with Graph Technology

Neo4j

Enterprise Resource Planning System in Telangana

NYGGS Automation Suite

Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics. To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...

Globus

Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.

Recently uploaded (20)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Graspan: A Big Data System for Big Code Analysis

Globus Connect Server Deep Dive - GlobusWorld 2024

Essentials of Automations: The Art of Triggers and Actions in FME

Atelier - Innover avec l’IA Générative et les graphes de connaissances

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Top 7 Unique WhatsApp API Benefits | Saudi Arabia

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Enhancing Research Orchestration Capabilities at ORNL.pdf

Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx

Vitthal Shirke Java Microservices Resume.pdf

Large Language Models and the End of Programming

OpenMetadata Community Meeting - 5th June 2024

GraphSummit Paris - The art of the possible with Graph Technology

Enterprise Resource Planning System in Telangana

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...

Lifting the Blinds: Monitoring Windows Server 2012

1. Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/ g the Blinds: Monitoring Windows Server

2. • SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting and Insightful Dashboards Datadog Overview

3. Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores, Caches, Queues and more... Monitor Everything

4. Agenda - Why should I monitor Windows Server? - What are some indicators of performance issues? - How can I collect performance metrics for analysis?

6. What to monitor?

8. CPU metrics - PercentProcessorTime - ContextSwitchesPersec - ProcessorQueueLength - DPCsQueuedPersec - PercentPrivilegedTime - PercentDPCTime - PercentInterruptTime

9. CPU: ContextSwitchesPersec What it tracks: Number of times the processor switched to a new thread Correlate with: Memory: PageFaultsPersec Disk: DiskTransfersPersec Network: BytesSentPersec/BytesReceivedPersec Issue resolution: Adding processors, thread partitioning, DPC partitioning, hardware interrupt partitioning, disable I/O counters

10. CPU: PercentProcessorTime What it tracks: Percentage of time spent performing work (not idle) Correlate with: ProcessorQueueLength Issue resolution: More processors, bigger instance, optimize offending application,

11. CPU: ProcessorQueueLength What it tracks: Size of processor queue Correlate with: CPU: PercentProcessorTime, PercentPrivilegedTime, PercentDPCTime, PercentInterruptTime Issue resolution: Adding processors, thread partitioning, DPC partitioning, hardware interrupt partitioning, disable I/O counters

12. CPU:DPCsQueuedPersec What it tracks: Deferred procedure call (DPC) enqueue rate Correlate with: CPU: PercentDPCTime Disk: DiskTransfersPersec Network: BytesSentPersec/BytesReceivedPersec Issue resolution: Remove buggy device, rollback driver

13. CPU: PercentPrivilegedTime/PercentDPCTime PercentInterruptTime What they track: Percentage of time CPU spent in privileged mode/deferred procedure calls/interrupts Correlate with: ContextSwitchesPersec/PercentPrivilegedTime/PercentDPCTime PercentInterruptTime Issue resolution: Adding processors, thread partitioning, DPC partitioning, hardware interrupt partitioning, disable I/O counters

14. Memory metrics - PoolNonpagedBytes - PageFaultsPersec - PagesInputPersec

15. Memory: PoolNonpagedBytes What it tracks: Amount of non-paged memory in use Correlate with: Windows Event 2019 “Nonpaged Memory Pool Empty” Issue resolution: Identify troublesome driver/roll back to known good state

16. What it tracks: Rate of page faults Correlate with: PagesInputPersec Issue resolution: Increase system memory Memory: PageFaultsPersec

17. What it tracks: Rate pages are read (from disk) into memory Correlate with: PageFaultsPersec/ DiskTransfersPersec Issue resolution: Increase system memory, move page file to separate physical disk Memory: PagesInputPersec

18. - AvgDiskQueueLength - DiskTransfersPersec - PercentIdleTime Disk Metrics

19. Disk: AvgDiskQueueLength What it tracks: Running average of I/O ops in queue Correlate with: DiskTransfersPersec Issue resolution: Move data for I/O-intensive applications to separate disk; add disks to syste

20. Disk: DiskTransfersPersec What it tracks: Aggregate I/O rate Correlate with: AvgDiskQueueLength Issue resolution: Move data for I/O-intensive applications to separate disk; add disks to system; increase disk cache

21. Disk: PercentIdleTime What it tracks: Percent of time disk is idle Correlate with: AvgDiskQueueLength Issue resolution: Move page file to separate disk; add disks to system; use SSDs

22. Tooling

23. Word of Warning

24. Powershell - Windows’ scripting language (no more batch files!) - Powerful language with deep OS support - Integrates with C# natively - Output is typed (unlike *NIX)

25. Powershell

26. Powershell

27. Perfmon

28. Windows Performance Toolkit Requires Windows Assessment and Deployment Kit (formerly Windows Performance Toolkit) https://www.microsoft.com /en- US/download/details.aspx ?id=39982

29. Windows Performance Recorder

30. Questions? Evan Mouzakitis Research Engineer Twitter: @vagelim Email: evan@datadoghq.com Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/

Editor's Notes

Our goal is to help you monitor everything from all levels of your stack so that you can make intelligent data based decisions about your applications and infrastructure.
Why monitor Windows in the first place? Monitoring the performance of the applications that run your business is critical; but applications don’t live in a vacuum. Applications interact with the underlying operating system often to, request resources, preempt the execution of other processes, access hardware devices, and more. Being aware of the health and performance of the operating system gives you more information when troubleshooting issues anywhere higher up in the stack (not to mention that monitoring the operating system is critical for insight into hardware issues). For example, is a SQL Server database query slow because of the query itself, or because the SQL Server is also hosted alongside Exchange and they are competing for disk access? These kinds of issues can only be surfaced when you monitor both the application in question and the underlying operating system.
A monitoring plan typically tries to cover Work metrics, Resource metrics, and non-metric data like events or code changes. As the broker between applications and hardware resources, when monitoring Windows server we are primarily focused on resource metrics, because that is what the operating system is managing. Work metrics are usually more applicable to application-level monitoring, but as you will see there are a few work metrics related to disk access that we’ll cover here too.
What kind of resources are we interested in monitoring? What kinds of metrics can we surface from those resources? Generally speaking, the most useful resources to monitor are CPU, RAM, disk, and network. Things like power consumption, thermal monitoring, noise and data of a similar nature, while useful, don’t usually add meaningful context to application or operating system performance issues.
At the highest level, the following metrics are useful in assessing CPU performance, and can shed light on performance bottlenecks depending on what the kind of work the CPU spends most of its time performing.
ContextSwitchesPersec tracks the number of times the processor switched to a new execution context. Context switches are computationally expensive; before the processor can enter the execution context of another thread, it must first save the current context, push the old context to the bottom of its priority queue, find the highest priority queue containing an executable thread, pop it from its queue, load its context, and finally execute the thread. In a multi-core machine (common today), context switching add significant overhead. By default, the Windows Task manager measures I/O per-process, and attributing I/O to a particular process in a multi-core multithreaded environment can have a drastic performance impact under heavy I/O loads. If that’s the case, you would benefit from disabling global and per-process I/O counters by adding a CountOperations entry as a REG_DWORD with a value of 0 to the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\I/O System\
PercentProcessorTime is a metric most everyone is familiar with, even if they don’t know the name. It tracks the percentage of time the CPU was doing something. In and of itself, this metric isn’t all that useful. For example, if I’m analyzing data on a single core machine, I’d expect the CPU to in use 100% of the time. However, when correlated with ProcessorQueueLength, which tracks the number of pending threads, you have enough information to determine whether or not the system is suffering a CPU bottleneck. A queue length greater than 2 * the number of processors, coupled with prolonged periods of maxed out CPU utilization very clearly indicate that the system does not have enough processor resources to perform all of its tasks.
The processor queue length is a value which reflects the number of threads that are ready to run, but are not able to use the processor. A healthy measure of processor queue length is about 2 * the number of processors on the system. Even on multicore machines, there is only one processorqueuelength performance counter. High values for this counter very clearly indicate CPU contention. You can correlate this metric with other CPU metrics like PercentProcessorTime, PercentPrivilegedTime, PercentDPCTime, PercentInterruptTime to determine where the CPU is spending its time, and to narrow down if the CPU is the bottleneck causing backed up queue.
Hardware requirements demand real-time, unfettered access to the CPU in order to ensure that high-priority work (like accepting keyboard input) is performed when it is needed. Interrupts provide a means by which devices can interrupt the processor and force it to perform the requested operation (triggering the processor to perform a context switch). Some work from devices may be put off until later, but still must be accomplished in a timely manner. Enter DPCs. Through DPCs, real-time processes like device drivers can schedule lower-priority tasks to be completed after higher-priority interrupts are handled. DPCs are created by the kernel, and can only be called by kernel mode programs. A large or near-constant number of DPCs could point to issues with low-level system software. An unused but buggy sound driver could be the culprit, for example.
This trio of metrics, taken together, help to shed light on where the CPU is spending its time. In particular, privileged time reflects the time spent executing instructions for kernel-mode programs. Code executing in privileged mode have unrestricted access to the system’s hardware. This includes device drivers, core operating system functions, etc. If you observe a system spending 30 percent or more of its time processing privileged instructions, check the values of PercentDPCTime and PercentInterruptTime. If either of those two metrics report values greater than 20%, it is likely that a poorly written device driver, or very busy peripheral is the culprit.
As with CPU metrics, Windows exposes a wealth of performance counters tracking memory statistics. We’ve omitted AvailableMemory and similar metrics from this webinar because they are pretty self-explanatory. The three listed here, PageFaultsPersec, PoolNonpagedBytes, and PagesInputPersec provide insight into the nature of issues which may be impacting performance. We’ll touch on each in turn, but at a high level, PageFaultsPersec tracks the rate of page faults, PoolNonpagedBytes describes the current size of non-pageable memory, and the last, PagesInputPersec, describes the rate of pages read from disk (which is distinct from the number of page reads from disk).
Windows maintains two general pools of memory: a paged pool and non paged pool. The paged pool is for general use and is the pool used by all user space applications for memory allocation. Because user space applications are more tolerant to latency, or, to put it another way, because user space applications don’t generally have real-time requirements, they can get by if the requested memory needs to be read in (or paged in) from disk. Because kernel-level software has real-time execution requirements, device drivers and the like make use of the non paged pool. The non paged pool is guaranteed to reside in physical memory at all times, with no possibility of being paged to disk (hence the name “non paged”). This significantly reduces latency by preventing the possibility of page faults. No memory pool is infinite, and poorly written device drivers could end up exhausting the entire non paged pool if left unchecked. If you are seeing reports of Event 2019, it’s already too late. But keeping an eye on the size of this pool and its growth over time are necessary to identify and deal with any troublesome drivers or hardware.
Page faults occur when a thread references a page that is not in the current set of memory-resident pages. Because the thread can’t perform its work without the requested memory, a hardware interrupt occurs, the processor enters into kernel-mode (resulting in a context switch—both upon entering and exiting kernel-mode), and attempts to locate the page in memory. If the page is found somewhere else in memory, it is that address which is returned to the requesting thread. This is called a “soft” page fault. If the page is not elsewhere in memory the kernel will look in the page file and read it into memory. This is called a “hard” page fault. Because this operation requires accessing the disk, it is more computationally expensive to perform this type of lookup. Page faults occur under normal operating conditions, but a spike in page faults could result in serious performance degradation, depending on the “hardness” of the fault. By tracking the page fault rate alongside the page input rate, you can differentiate between hard and soft page faults. High values of both metrics unequivocally indicate hard page faults. There’s not much you can do to prevent soft page faults from occurring, but increasing the amount of RAM available on the system is a straightforward way of alleviating hard page faults. It is worth mentioning that when a hard page fault does occur, Windows attempts to retrieve multiple, contiguous pages into memory, to maximize the work performed by each read. This, in turn, can potentially increase a page fault’s performance impact, as more disk bandwidth is consumed reading in potentially unneeded pages. All of this can potentially be avoided by putting your page file (see next section) on a separate physical (not logical) disk, or increasing the amount of RAM available to your system.
As I mentioned, there are two types of page faults, and tracking PagesInputPersec alongside PageFaultsPersec gives you the information you need to determine the type of page fault occurring. If you are seeing high values of both metrics, the page faults are hard. The effects of hard page faults can be exacerbated if disk is a contentious resource. To give a simplified example, if your have a system with one disk and it’s running an I/O intensive application, page faults will hit this system harder (and performance will degrade in the application) because Windows is competing with the application for disk access (and Windows always wins). This goes to show that an excessive number of page faults can be responsible for system wide effects, completely unrelated to the application experiencing performance degradation.
Though there are many disk metrics worth tracking, I’ve distilled the list to the most essential, while omitting the obvious, like PercentFreeSpace.
The AvgDiskQueueLength counter gives an estimated average of the number of I/O operations currently awaiting execution. Generally speaking, this counter should not exceed 2 * the number of drives on the system. If you are seeing greater values than that, it means the system cannot service the number of I/O requests it’s receiving in a timely manner, which can lead to processing delays, degraded application performance, and more.
DiskTransfersPersec is an aggregate measure of both disk reads and writes. It is useful for shedding light on the cause of bottlenecks. High values for this metric do not always indicate issues; for example if you are running I/O intensive applications on your server you are definitely going to observe high values for this metric (and most likely for PercentIdleTime as well). However, if I/O ops are not being enqueued (per the AvgDiskQueueLength metric) and applications are not hurting for memory (and thus paging to disk), there should be no observable performance impact.
PercentIdleTime is a pretty intuitive metric that tracks the percent of time disks are idle. Depending on the role of the system under investigation, low idle times may be expected, especially for when running I/O intensive applications like SQL Server or Exchange. If that’s not the case, low values should be investigated. If you don’t already have your page file stored on a separate drive, you should do so. Otherwise, consider either adding disks to the system to increase performance, or swap out HDDs for SSDs if possible.
Windows offers numerous methods by which you can collect, store, and visualize system performance data. Because the methods are so varied, I will only go through a couple of the tools that I have experience with. All of the tools mentioned are native to Windows Server 2012 R2 so you can get up and running quickly.
Reading performance counters does not generally appear to have much of an impact on system performance. In my tests, collecting 2631 counters with 1-second sample rate caused a 4 percent increase in user CPU usage (by perfmon). There are a few things to keep in mind, though: depending on the data collected and the duration of the collection, the collected data could be very large. To give you an idea about the size of the data collected, in a test collecting handle and kernel base events, pagefaults, cpu, I/O and memory samples, the data grew at a rate approaching 100 MB/min. Additionally, if you are collecting data from your local machine, you may see occasional spikes in I/O latency; in my tests I observed response times for some user space applications in excess of 2000 ms! Also, I did not attempt to collect performance counters from user applications which may have an impact on the application’s performance. And as I mentioned earlier in the CPU section, if you are sampling I/O with processor-specific information, you most certainly will observe degradation in performance.
Powershell is great for collecting performance counters programmatically. You can query the event log from powershell as well. You can use powershell to collect metrics from local and remote machines.
Here are some example powershell commands for retrieving CPU-related performance counters. As you can see, there is a regular pattern. For a full list of commands to retrieve performance counters for CPU, memory, disk, network, and events, check out my “How to collect Windows Server 2012 metrics” article on the datadog blog. https://www.datadoghq.com/blog/collect-windows-server-2012-metrics/#toc-powershell
Last thing about powershell, if you want to do something in powershell and there’s no pre-packaged cmdlet to get you what you want, you can always interface with WMI to get what you’re looking for.
In my honest opinion, perfmon is not nearly as useful as xperf or Windows Performance Recorder when it comes to investigating performance issues. It is a good tool to help spot issues, but not so good for getting into the nitty gritty. Here’s a screenshot of perfmon collecting “System Performance counters” a counter set provided out of the box. As you can see, there is a lot going on. My investigation was focusing on the cause of excessive memory use, visualized as the black bar nearly pinned to the 100 mark. From this image it’s clear that something is going on, but since I was only collecting the Total memory usage (as opposed to collection per-process), it isn’t clear which process is exhausting RAM. To determine the underlying cause in this case requires me to re-run perfmon, this time collecting per-process counters in addition to the total, and hoping that my issue arises again. As you’re about to see, we can do better.
The Windows performance toolkit contains the Windows Performance Recorder & Windows Performance Analyzer (WPA). Though technically not strictly “native” since it requires a download, it is a useful, graphical tool for collecting and analyzing windows performance data and is made by Microsoft.
Windows performance recorder is a modern replacement for xperf. It features both graphical and command line interfaces. Here you can see the available collection profiles. Collecting data with the Windows Performance Recorder is as easy as clicking “Start”. Technically, Windows Performance Recorder (and xperf) do not merely collect performance counters; they are a tracing mechanism for collecting fine-grained performance data. As you will see, traces are superior to performance counters when investigating performance issues.

Lifting the Blinds: Monitoring Windows Server 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lifting the Blinds: Monitoring Windows Server 2012

Similar to Lifting the Blinds: Monitoring Windows Server 2012 (20)

More from Datadog

More from Datadog (20)

Recently uploaded

Recently uploaded (20)

Lifting the Blinds: Monitoring Windows Server 2012

Editor's Notes