Capturing comprehensive storage workload traces in windows

Capturing Comprehensive Storage
Workload Traces in Windows
Dr. Bruce Worthington
Windows Server Performance
Microsoft Corporation

My Motivation
• I’m tired of seeing storage research and performance
analysis limited by real-world trace availability.
– It’s not all that much better now than it was when I
was in grad school in the early 90’s…
• I’ve been saying I would help supply researchers with
long-term real-world traces and post-processing tools for
almost a decade.
– I’m finally following through on the promise to supply
production traces.
– More importantly, Microsoft has made it easy for
anyone to capture detailed storage workload
traces on Windows systems (along with many
other types of traces and profiles).

Outline
• The tools
– Event Tracing for Windows (ETW)
– New: xperf, xperfinfo
– Old: logman, trace*
• The traces
– Benchmark (steady-state) workloads
– Production workloads
• The challenge

Event Tracing for Windows (ETW)
• ETW has been the core Windows tracing component
since Windows 2000 and is continually improved
• Many Windows components, including the kernel,
produce events describing their behavior
• Events from user-mode applications and kernel-mode
drivers can be logged
• High performance, low overhead, highly scalable
– Efficient buffering and non-blocking logging
mechanisms using per-processor buffers written to
disk by a separate thread
• Tracing can be enabled/disabled dynamically without
requiring system reboots or application restarts

Event Tracing for Windows (ETW)
• Events can be sample-based, but most are single-instance (event A
occurred at time T)
• Support for real-time consumption and file-based tracing
• Configurable logging mode, buffer size, buffer count
– Sequential traces
– Circular traces
– Circular traces in memory (flight black-box) [Vista]
• Adding custom events enables better correlation of application activity with
low-level resource usage
• On a standard Vista computer, the logging API (EventWrite) takes about
5,000 cycles, mostly spent in acquiring the timestamp via
QueryPerformanceCounter (QPC)
– About 2.5% processor overhead for a sustained rate of 10,000
events/second on a 2GHz processor – not including the cost of flushing
trace buffers to disk
• Postprocessing the binary disk log correlates events with context and
domain specific knowledge

ETW Architecture
• Provider
– Provides event traces. Can be user-mode app,
kernel-mode driver, or kernel.
– Providers use ETW APIs to register with the ETW
framework to send event traces from various
points in the code.
– When enabled, the provider sends event traces to
a specific trace session designated by the
controller.
• Controller
– Assists in starting, stopping or updating trace
sessions in the kernel as well as enabling or
disabling providers
– Sets trace session properties such as sequential or
circular file logging or direct delivery to consumers
• Consumer
– Reads trace files or listens to active trace sessions
and processes logged events
– Not aware of the Providers
– Only receive event traces from the trace sessions
or log files
Introduction
• Event Trace Session infrastructure
– Brokers the event traces from provider(s)
to consumer(s) and adds data to each
event (e.g., TimeStamp, Thread,
Process, CPU)

“NT Kernel Provider”
• Events
– Process, thread and image
– Sampled Profile
– Context Switch
– Dispatcher (Ready Thread)
– DPC (Deferred Procedure Call)
– ISR (Interrupt Service Routine)
– Disk I/O
– File I/O
– Registry
– Hardfault
– Pagefault
– Driver delay
– TCP/UDP
– Power
– ALPC
– Virtual Allocation
– Heap
– Memory
– …
• Related providers
– Thread Pool
– Power Transition
– Winlogon
– Services
– Prefetch
• Other providers
– Shell
– Internet Explorer
– Media Foundation
– Media Center
– …

System Config: ETW Instrumentation
• Automatically added to kernel traces
• Rundown of system configuration at trace start/stop time
– CPU (number of logical and physical processors, frequency)
– Memory (memory size, page size, allocation granularity)
– Disk (physical disks, partitions, volumes)
– Video adapters
– Network adapters (IPv4, IPv6)
– Services (including service tag)
– Plug-and-Play Information
– IRQ Assignment
– Power capabilities (S1 - S5)
– Network Identity (computer name, domain name)
– Group Masks (what kernel flags are enabled)

Storage-Related Instrumentation
• Disk Events:
– Read, Write, Flush Initiation/Completion
• File Events:
– Filename Create, Delete, Rundown (when trace stopped)
– File I/O Initiation, Hard Fault
• Create, Cleanup, Close, Flush, Read, Write, Set Info, Query Info,
FSCTL, Delete, Rename, Directory Enumeration, Directory
Notification
– File I/O Completion
• Driver Events:
– Driver Call, Return (Major Function)
– Driver Complete Request, Complete Request Return
– Driver Completion Routine
• Binary storage-related event sizes range from ~30-80 bytes (not
counting events that dump unique filenames)

The new tools: xperf & xperfinfo
• Extensible performance analysis toolset
• High-level control and decoding of ETW traces
– Emphasis on kernel events and system-wide
resource usage
– Support for 3rd
-party events, primarily in
conjunction with kernel events
• Cross-platform
– Windows XP SP1+, Vista
– Windows Server 2003, Windows Server 2008
• Cross-architecture (x86, x64, ia64)
• Capture-anywhere, process-anywhere

xperf
• Detailed interactive analysis of performance
traces
• High-level resource usage graphs on common
trace timeline with zoom capability
• Low-level discrete graphs for resource state
transitions
– Individual context switch and disk I/O events
• Powerful interactive summary tables with
dynamic grouping, sorting and aggregation
capability

Currently Available Graphs and
Summary Tables
– Disk I/O Counts
– Disk I/O Detail
– Disk Utilization
– File I/O
– DPC (Deferred
Procedure Call)
– ISR (Interrupt Service
Routine)
– Hardfault
– Pagefault
– Driver Delay
– Sample Profile
– CPU Availability
– CPU Scheduling
– Process Lifetime
– Registry counts
– Services
– Plug ’n’ Play
– Marks
– Generic
– …
Overview of xperf

Storage Activity Notes
• Disk Reads, Writes and Flushes
– Vista introduced low-priority I/Os, which are deferred in a special queue
to allow current and near-term future normal priority I/Os to complete
– Flushes may be “completed” by low-level storage drivers under certain
conditions
• Hard Faults
– Synchronous I/Os that block execution of issuing thread
– Paging-in from disk pages not currently present in memory
• Communication aspect: System read-ahead and write-back
– Asynchronous just-in-time prefetch for sequentially read buffered files
– Asynchronous buffered writes
– Issued from “System (4)” process
• xperf & xperfinfo infer disk “queue” parameters (wait time, service time,
queue depths, skip behavior) assuming a single serialized spindle
– not a valid assumption for disk arrays
– Queue depth can be thought of as “number of requests in flight”

xperf Demo
• System Config
• CPU
• DPC & ISR
• Process
• Disk Summary Table
• Disk I/O Detail
• Disk I/O Detail Summary
• Hard Faults

Sample xperf Screenshots
• Sidebar chart selection
• Selecting a time range
• CPU Usage Summary Table
• DPC and ISR CPU Usage Frames
• Disk I/O Summary Table
• Disk I/O Detail
• Disk I/O Detail Summary Table

Sidebar Frame Selector
Sidebar
Overview of xperf
Frame
Scrollbar

Selecting a Time Range
(on CPU Usage Frame)
Overview of xperf

Go to CPU Usage Summary Table
Context-Menu Summary Table
Overview of xperf

CPU Usage Summary Table
Status Bar Report
% of Time excluding
DPC and ISR
% Total Time
Selected Time
Interval
Close
Summary
Table
Overview of xperf

DPC and Interrupt CPU Usage Frames
Overview of xperf

Go to Disk I/O Summary Table
Summary Table
Overview of xperf

Disk I/O Summary Table
Expanded
Individual I/Os
Overview of xperf

Go to Disk I/O Detail
Detail
Graph
Overview of xperf

Disk I/O Detail (Disk #0)
Overview of xperf
Change
Disk
Select
Processes

Overview of xperf

Selection
Overview of xperf

Disk I/O Detail Summary Table
Default
sort
field
Overview of xperf

xperfinfo
• High level control and decoding
• Merging and dumping of ETW traces
• Many command line actions to analyze and
report on various aspects of a trace
• Various buffering and log file options
• Multiple timer sources
• Traces of boot activity

xperfinfo Demo
• Start/stop
• Provider list
• Dump
• Postprocessing summaries

Taking a Kernel Trace
• Start kernel trace; run scenario; stop and merge
• Start user trace; run scenario; stop
• Hint: Retrieve all known kernel flags and groups
Overview of xperfinfo
C:analysis> xperfinfo –on base+FILE_IO+INTERRUPT
C:analysis> MyTestApp.exe
C:analysis> xperfinfo –d trace.etl
C:analysis> xperfinfo –help providers
C:analysis> xperfinfo –start MySession –on Kerberos+MRxSmb –f kerberos.etl
C:analysis> MyTestApp.exe
C:analysis> xperfinfo –stop MySession

Dumping a Trace
C:analysis> xperfinfo -i trace.etl –o trace.txt
[1/2] 100.0%
[2/2] 100.0%
C:analysis> notepad trace.txt

xperfinfo Named Providers
• DISK_IO: Disk I/O
• DISK_IO_INIT: Disk I/O initiation
• SPLIT_IO: Split I/O
• FILE_IO: File system op end
times/results
• FILE_IO_INIT:
Create/open/close/read/write
• FILENAME: Create/delete/rundown
• HARD_FAULTS: Hard page faults
• ALL_FAULTS: All page faults including
hard, copy-on-write, demand-zero faults
• DPC: Delayed Procedure Calls
• INTERRUPT: Interrupts
• DRIVERS: Driver events
• PROC_THREAD: Create/delete
• CSWITCH: Context switch
• COMPACT_CSWITCH
• DISPATCHER: CPU Scheduler
• PREFETCH: Prefetching
• LOADER: Image load/unload
• SYSCALL: System calls
• PROFILE: CPU sample profile
• MEMORY: Memory tracing
• POOL: Memory pool tracing
• VIRT_ALLOC: Virtual alloc reserve and
release
• NETWORKTRACE: TCP/UDP, send/rcv
• REGISTRY: Registry tracing
• POWER: Power management
• WORKER_THREAD: System worker
thread
• PERF_COUNTER: Process perf
counters
• ALPC: Advanced Local Procedure Call
• …

Available xperfinfo Reports (“actions”)
• tracestats
• sysconfig
• dumper
• diskio
• filename
• hardfault
• pagefault
• dpcisr
• process
• cswitch
• drvdelay
• marks
• perfctrs
• profile
• registry
• boot
• suspend
• shutdown
• …

The old tools
• Installed with Windows
– Logman: Collects performance counters
– Tracerpt: Processes ETW log files or real-time sessions
• Installed with Driver Development Kit
– http://www.microsoft.com/whdc/devtools/ddk/default.mspx
– Tracelog: Starts, stops, or enables trace logging
• http://msdn2.microsoft.com/en-us/library/ms797927.aspx
– Tracefmt: Dumps ETW binary files into text files
– Traceview: Controls and displays ETW information

The traces: Benchmark Workloads
• Easy to capture and make available
– TPC-C, TPC-E, TPC-H, TPC-DS?
– SAP-SD
– Terminal Server
– NetBench
– SPC?
– …

Example: TPC-C Trace
• Windows Server 2008 / SQL Server 2005
• ~32 minutes
• 93.5 million disk I/Os (58.6M read, 35.9M write)
• 16-socket, dual-core 3.4 GHz Intel Xeon (16 MB
L3 cache)
• 256 GB RAM
• 1106 15Krpm FCSCSI disks
– 79 database LUNs

Example: TPC-C Request Sizes
• 94.8% 8KB requests; 2.1% 16KB requests
• Remaining requests:

Example: TPC-C Trace Locality
• ~3% of all writes are within 128 sectors of the
previous write to the same disk

Example: Terminal Server
Knowledge Worker (TS-KW) Trace
• Windows Server 2008, Office 2007
– Word, Excel, and Outlook activity
– 180 concurrent users
• ~45 seconds
– >750,000 Context Switches
• 16 thousand disk writes (in-memory working set)
– Mostly sequential
• 2-socket, quad-core 2.66 GHz Intel Xeon (4 MB L2
cache)
• 32 GB RAM (holds working set)
• One ATA 120GB disk

Example: TS-KW Interarrival Times

The traces: Production Workloads
• First set of internal Microsoft targets:
– SQL Server
– Exchange
– SharePoint
– File Server
– Web Server
– Media Server
– SAP
– Active Directory
– Security Server
– Backup
– Search
– Office Desktops, Laptops, Tablets

Example: SQL Server Replica for RADIUS Authentication
Data for RAS & Wireless (worldwide)
• Windows Server 2003 / SQL Server 2005
• Three sequential 1-hour traces
– 126.9 thousand I/Os (16.3K read, 110.5K write)
– Locality:
• ~25% sequential requests
• More than half of all read and write requests are within 100,000 sectors
of the immediately previous request (to the same disk)
– Mostly 512KB reads in first trace; no 512KB reads in other
traces
• 4-socket, hyperthreaded 1.9 GHz Intel Xeon
• 8 GB RAM
• Dual-port Gb network card
• Five 4GB “disks” (configuration unknown)

xperf Demo
• Find region of 512KB reads in Trace 1
– Identify file being read
– Examine corresponding Disk I/O Detail
• Hard Fault Frame
• Hard Fault Summary Table
– File + File Offset  Disk + Disk Offset
(Top to bottom disk I/O tracing)

Select Read-Heavy Region
Overview of xperf

Disk I/O Summary Table
Overview of xperf
One particular *.mdf file

Overview of xperf

Overview of xperf
Most of this
activity is to the
mdf file indicated
in the Disk I/O
Summary Table

Select Hard Fault Region
Overview of xperf

Hard Fault Summary Table
Overview of xperf
Expand
File
Offsets
Timestamps

RADIUS SQL Server Replica, Trace 2:
Disk Offset Distribution for Disk 0

Disk Offset Distributions

Interarrival Times (within each disk)

The traces: System Configurations
• From mobile devices to datacenter servers
– Scale-out and scale-up environments
• 1-32 sockets
• 1-64 cores
• 1-1000 GB RAM
• NTFS, FAT, Raw
• ATA, SATA, SCSI, SAS, FC
• Solid state drives

The traces: Postprocessing
• Simple scripts and programs (e.g., perl and C#) will
be available to:
– “Sanitize” traces by replacing some or all file,
directory, and process names with generic strings
– Extract basic statistics from xperf dumps on an
overall, R/W, per-disk, or per-size basis
• Request sizes
• Spatial distributions
• Queue lengths
• Interarrival times
• …

The tools and traces: Availability
• xperf & xperfinfo will be released with the next Windows SDK
(in conjunction with Windows Vista SP1 and Windows Server
2008)
• Benchmark traces will be provided to the SNIA IOTTA group
in Sept ’07
– Scripts for sanitization and basic stats analysis included
• Production traces will be provided as they are captured and
sanitized, hopefully on a monthly basis for years to come
– Captures are in progress on multiple Microsoft IT servers
with varying workloads
– Traces will be dumped in manageable chunks
• All tools and traces have standard Microsoft disclaimers
• Microsoft would like to thank Seagate for providing disk drives
to store the internal Microsoft trace repository!!!

xperf/xperfinfo Future Enhancements
• Equivalent file block  disk block event
correlation for write requests (a la hard fault reads)
– Mapped file writes
– Lazy writer
– Dirty page writer
– Unbuffered writes
– In the mean time, write-after-read’s and (in some
cases) sequential writes can be translated
• Built-in process/file/directory sanitization
• Extensibility
• …and much more!

Summary
• Event Tracing for Windows (ETW) = the engine
– Instrumentation built into the retail Windows operating system
– The NT Kernel Provider provides coverage of kernel-level activity
• xperf = the interactive browser
– High-level graphs
– Summary tables
– Individual event detail
• xperfinfo = the command line automation tool
– ETW controller and decoder
– Exports human-readable decoding of all trace events
– Many custom actions distilling various aspects of the trace
• Alpha version of xperf/xperfinfo can be requested from:
– wperftkt@microsoft.com
– Bruce.Worthington@microsoft.com
• Additional Resources
– Event Tracing for Windows on MSDN
http://msdn2.microsoft.com/en-us/library/aa363787.aspx
– “Windows Internals 4th edition” by Russinovich and Solomon

The challenge: Capture and Share Traces!
• Microsoft is committed to gathering long-term
(weeks/months) traces on many production systems
within the corporate IT environment
• Start the wheels rolling in your organization to allow
similar traces to be captured, sanitized, and published;
use existing tools to start with and xperf/xperfinfo when
they become available
• Create and share post-processing tools, simulators,
models, etc., via SNIA IOTTA repository
– http://iotta.snia.org
• Provide feedback on xperf & xperfinfo
– Be patient, as this is an engineering analysis tool (not
an MS product) and is supported as such. 

What is ETW used for?
• Debug application bugs including hangs,
crashes, or unexpected behavior
• Diagnose performance problems
• Track computing resource consumption at
application transaction level for capacity
planning
Introduction

ETW vs. Performance Counters
ETW
• Individual events described using
multiple standard/custom attributes
• Each event requires a timestamp
• Each event requires additional
space
• An ETW trace can be used to
compute aggregations on any group
of events (filtered by time or any
attributes) at post-processing time
– Various perspectives
• Can zoom down to individual events
Performance Counters
• Aggregate information about groups
of events
• Each sample requires a timestamp
• Each sample requires additional
space
• Very light, events aggregated in
place
• A sampled performance counter
trace provides a bottom aggregation
level
– Information below the bottom
aggregation level is lost
• Individual event information is lost

Capturing comprehensive storage workload traces in windows

More Related Content

What's hot

Viewers also liked

Similar to Capturing comprehensive storage workload traces in windows

Recently uploaded

Capturing comprehensive storage workload traces in windows