Solaris performance tuning
Upcoming SlideShare
Loading in...5
×
 

Solaris performance tuning

on

  • 255 views

Solaris performance tuning

Solaris performance tuning

Statistics

Views

Total Views
255
Views on SlideShare
255
Embed Views
0

Actions

Likes
0
Downloads
19
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Solaris performance tuning Solaris performance tuning Presentation Transcript

  • <Insert Picture Here> Solaris Performance and Tuning Am erAther 12/03/2012
  • Topics • Performance Analysis Methodology and Principle • Enterprise Server Architecture • Setting and Viewing System and Kernel Parameters • CPU Scheduling, Process Mgmt and Kernel profiling • Monitoring Virtual and Physical Memory Usage • Analyzing Resource Contention and NUMA related latencies ………………………………………………………… • File System Performance and IO Strategies [Not Discussed] • IO Subsystem Tuning [Not Discussed] • Network Stack Tuning [Not Discussed] • Resource Management, Containers and Server Virtualization [Not Discussed] • DTrace Basics [Not Discussed] Copyright © 2012 Amer Ather. All Rights Reserved Am erAther 12/03/2012
  • Module 1 Performance Management and Tuning Principles Copyright © 2012 Amer Ather. All Rights Reserved Am erAther 12/03/2012
  • Define Performance PERFORMANCE (noun) : “The manner in which or the efficiency with which something reacts or fulfills its intended purpose." or "the execution or accomplishment of work, acts, feats, etc." • From this definition, it can be readily seen that the "efficiency" and overall "utilization" of resources are key characteristic of the "performance" of a System • The key aspects of assessing the performance relates directly to the volume of productive OUTPUT over a duration of time that a system produces. Copyright © 2012 Amer Ather. All Rights Reserved Am erAther 12/03/2012
  • Performance Model Computer system under observation Performance measurements Modifications Workload Intrusive tasks • Performance model depicts the system and factors that make demands on its resources • Workload is a group of processes that perform a certain task and target of performance improvement. Individual process performance can be influenced by CPU scheduling latencies, memory starvation, lack of network and IO bandwidth or congestion • Intrusive tasks are processes that are not part of workload but compete for resources resulting in negative performance of the targeted workload Too much monitoring itself can skew data due to higher overhead https://blogs.oracle.com/clive/entry/too_much_proc_is_bad Copyright © 2012 Amer Ather. All Rights Reserved Am erAther 12/03/2012
  • Tuning Principles  Problem Statement: In general, performance analysis and tuning exercise start with a problem statement that should take into account: Something has deviated from the normal (what you should expect) for which you don’t know the reason and would like to know the reason  Monitoring: Identify the bottleneck by using monitoring and measurement tools  Using a system to record statistics over time. This data may reveal long term patterns that may be missed when using regular stat tools  Identification: Using stats tools to narrow down the investigation to particular resources, and identifying possible bottlenecks.  Analysis: Drill down further by using tracing tools for further examination of particular system areas.  Tuning: Remove bottleneck by applying software tuning, hardware or software upgrades. Make one change at a time (if possible) and observe results.  Repeat: Continue until acceptable resolution is reached Copyright © 2012 Amer Ather. All Rights Reserved Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Tuning is Iterative in Nature  System tuning and performance analysis is Iterative in nature • In the realm of Performance Analysis and system Tuning, once you remove one bottleneck, the system processing characteristics will change and results in a new performance profile • Thus iterative process is the best methodical approach to remediation. • Make certain that only one change is made at a time, otherwise, the effects ( + or - ) can not be quantified. • In many cases the underlying causes of bottlenecks are the result of several overlapping conditions, none of which individually cause performance degradation, but together can result in a bottleneck. • It is for this reason that performance analysis is typically an iterative exercise, where the removal of one bottleneck can some time result in the creation of another "hot spot elsewhere, requiring further investigation and /or correlation once a bottleneck has been removed. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Change Management While it is not related to performance tuning, change management is probably the single most important factor for successful performance tuning:  Document everything!  Implement a proper change management process before tuning  Test your change first. Avoid a temptation of tweaking settings on a production server  Never change (if possible) more than one variable while tuning.  Retest parameters that supposedly improved performance; sometimes statistics come into play.  Start each iteration of your test with your system in the same state. For example, if you are doing database testing, make sure that you reset the values in the database to the same setting each time the test is run. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Reason For Tuning  Service Level Agreement (SLA) defines acceptable, not optimal, performance goals:  SLA is a contract between a vendor and a user about what is acceptable performance. It becomes a reference document for asserting performance goals  Cost saving: Higher user productivity; More work can be done with less hardware and software  All efforts at making a system more efficient must start with observing and measuring normal system operation  Values derived from these measurements comprise a system’s baseline performance.  Tuning objectives can be:  Matching workload to system capacity and configuration  Maximizing consistency and volume of throughput  Reducing user response times  Redeploying or balancing the load on system resources Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Tuning Prerequisites What tells you that you have a performance issue on your system: • Users complaining that the "XYZ" application is taking longer than expected to perform the task. Examples are: Slow database query; slow Interactive response; batch job taking longer to complete; slow backups etc.. • When was the first time this abnormal behavior was noticed. • Most important question to ask is: What has changed: It could be new hardware; software upgrade; excessive tuning; new install or Patches, More user load, etc.. • Frequency of the event: Happens all the time or during certain time and days. What is running during degraded performance. • What is considered normal or expected. • How long should you expect the job/application to take to run or complete. This needs to be based on previous data runs. • Realistic goals about anticipated performance gain and the expectations are backed by valid benchmarks and historical data. • What other systems running the same job or application but aren't exhibiting these symptoms. • Make an architecture diagram of applicable environment, describing interaction : • Use top down approach of problem solving instead of bottom up • Look at an entire application infrastructure and stack. Focus on the bigger picture instead of staring at network and IO statistics without having a proper context. • Ask questions. However, before asking get a good grasp on a problem. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Gray Area – Interoperability Issues • Complex problems that span to multiple stacks are normally referred as “grey area”. • Gray area is a term used by many as an issue which breaks the mold of conventional break fix issues and starts entering into performance tuning arena. • Break fix is usually an indication that something is clearly broken such as a Solaris bug that resulting system crashed and resulting outages. • Performance tuning exercise usually starts when business has expanded and application architecture can't cope with the growth for example. It's a little difficult to gauge when a situation starts to go down that path when application architectures are complex and involve multiple stacks. In that case one may have to deal with interoperability issues. • In such situation environment can get pretty complex with trying to find the problematic area of interest. It is therefore important to ask questions and get a good grasp of an issue before jumping into solving it. • Your approach to performance analysis should be to break down the issue into smaller logical portions. This can be done best by drawing an architecture diagram showing interaction of different stacks and IO and network topology. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Tuning Limits  Effective performance practices promotes efficient use of system resources.  Remember the 80/20 rule — 80% of the performance improvement comes from tuning the application, and the rest 20% comes from tuning the infrastructure components.  Eventually, more tuning will not yield substantial performance improvements. When this occurs, a system has reached the practical limits with its current hardware and/or software.  Effective performance management practice has two primary objectives:  Reduce or eliminate aspects of the workload that provide no benefit.  Tailor the workload to match the system resources, or vice versa. • When performing specific systems tuning it should be noted that tuning often tailors a system towards a specific workload. So, the system will perform better under the intended load characteristics but it will probably perform worse for different workload patterns. • An example would be tuning a system for low latency which most of the time has an adverse effect on throughput. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Measurement Terminology  Response Time: Time it takes for control to return after a request.  Bottleneck: Bottlenecks occur at points in the system where requests are arriving faster than they can be handled, or where resources, such as buffers, are insufficient to hold adequate amounts of data.  Bandwidth: Data transfer rate - Amount of data that can be carried from one point to another in a given time period  Throughput: Measure of system’s overall performance in processing data by effectively using its components: processors, memory, buses, and storage devices.  Latency: Time elapsed between a command or work given to a computer (program or device) until its execution.  Utilization: Measures how busy a resource is and is usually represented as a percentage average over a time interval  Saturation: Measure of work that has queued waiting for the resource and can be measured as both an average over time or at a particular point in time. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Measurement Terminology  Total Cost Of Ownership (TCO): Capital investment in hardware and software plus indirect cost of installation, maintenance, support and downtime  Service Level Agreement (SLA): Defines acceptable, not optimal, performance goals: • Transaction Rates: Measures various service related transactions such as: DataBase, Application Services, Infrastructure, Network Services, etc.. . • Startup Time (System HW, OS boot, Volume Mgmt Mirroring, Filesystem validation, Cluster Data Services, etc..) • FailOver / Recovery Time: Time to recover a failed Service (includes recovery and/or startup time of restoring the failed Service) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Tunable Areas  Performance is the system’s ability to meet well defined criteria. System performance normally evaluated and measured across following system resources: • Memory usage: Application physical or virtual memory requirement may be larger than configured resulting in application slowdown or failure. • CPU utilization: Workload is CPU bound and thus competing for computing resource and may be getting preempted by higher priority tasks • IO pattern: What is the I/O characteristics of the application: sequential or random IO. How the storage is configured and data is laid out; File system choices should be taken in account to when evaluating I/O performance. • Network bandwidth: Congestion at the protocol layer or network infrastructure may be contributing to slow performance. Application may be getting overwhelmed with number of concurrent connections. • Lock contention: Lack of concurrency in application or kernel subsystem resulting in higher than normal contention for shared resources. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Matrics  Measurement of key attributes, called METRICS, when assessing system performance,  Defining key metrics requires understanding the workload and then monitoring all the component that workload depends.  Understanding how various components of Solaris works will help in identifying bottlenecks and then come up with solutions to fix it  For example, how the kernel gives preference to one Solaris process over others; how I/O interrupts are handled; how Solaris manages virtual and physical memory; how the Solaris file system and network stack is implemented in Solaris, etc.. "If you can't understand it... you can't effectively measure it, .. and if you can't measure it.. you can't assess it..". (T.Jobson 7/2006) http://dtrace.org/blogs/brendan/2012/02/29/the-use-method/ Am erAther 12/03/2012
  • Business Matrics • When doing performance analysis, it is useful to relate all performance-related metrics to what we really care about. that includes: • calls per time unit, transactions per time unit, records processed per time unit, and so on. • Performance metrics are only interesting in the context of how they relate to the business metric. • By considering the effect on the business metric, we shift the focus of investigation from identifying metrics that are out of bounds to identifying metrics that correlate with the business metric, For example: • If achievement of the business metric suffers when the tcp retransmission rate increases, improving the tcp retransmission rate will likely improve the business metric. • Likewise, if slow disk I/O does not correlate with degradation of the business metric, improving disk I/O is not likely to improve the business metric. Copyright © 2012 Amer Ather. All Rights Reserved Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Metrics Primary interest is to check CPU percent used in kernel or user mode since high CPU kernel usage can have a negative effect on application performance.  Processor matrices:  CPU utilization: Overall CPU utilization. 100% utilization is a sign of cpu saturation  User and kernel Time: % of cpu time spend in executing application and kernel code. Ideally, cpu should spent little time in kernel mode  Waiting: Waiting for an IO operation to complete. It is counted as idle time considering CPU is available to run runnable task during this period  Idle time: CPU is idle waiting for runnable tasks  Load Average: Number of processes waiting to run on cpu  Context switch: Rate of switching between threads due to change in priority, interrupts, block  Interrupts: Number of interrupts serviced by cpu.  Tools: vmtat, mpstat, prstat-c (USR,SYS), prstat-mL (LAT), lockstat-Ii, sar, uptime, kstat, DTrace (profiling, stack), fmadm faulty (fma events related to CPU physical failure)  DOCs:  How to Analyze High CPU Utilization In Solaris (Doc ID 1008930.1)  Using DTrace to understand mpstat output (Doc ID 1278725.1)  http://dtrace.org/blogs/brendan/2011/06/18/mpstat-videos/ Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Metrics Memory metrics: Primary focus is to capture system wide memory and swap usage  Free memory: Solaris allocates most of unused memory for caching file system blocks. ZFS uses kernel memory for caching file system blocks, that is only be released when there is a memory demand.  Knowing virtual memory requirement of the workload would avoid application startup issues.  Swap usage: page scanner activity or IO to the swap device is a sign system is short in memory.  Page Cache: Cache allocated for file system to improve read/write performance of application  High kernel memory usage is a sign of memory leak or excessive tuning  Tools: vmstat-p, swap-s, df –k /tmp, prstat-mLc (DFL), ps, “echo ::memstat|mdb –k”, sar, kstat, DTrace, fmadm faulty (cpumem-retire, ECC events)  DOCs:  Monitoring Memory and Swap Usage to Avoid A Solaris Hang (Doc ID 1348585.1)  How to use DTrace and mdb to Interpret vmstat Statistics (Doc ID 1009494.1)  http://dtrace.org/blogs/brendan/2011/04/27/vmstat-videos/ Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Metrics Block device matrices: Primary focus is to check IO service time and IO latency • Average Service time: IO latency at the sd/sdd driver level (driver that dispatches IO to the storage subsystem). It is the time taken for IO to be serviced by storage, reported in millisecond. • Active and wait queue: Outstanding IO requests: IO waiting in the driver queue (wait) and IO that has been dispatched (active) to storage. Any non-zero value is a sign of Storage saturation. • IO sizes: Kbytes read/write per second divided by read/write IOPS helps in identifying the average IO size submitted to IO subsystem.  Tools: iostat –xnz, iostat-Cxnz, kstat, iostat –En, kstat, swap-l, DTrace  DOCs:  http://joyent.com/blog/when-iostat-leads-you-astray/Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Performance Metrics Network Interface matrics: Primary Interest is to find congestion at the protocol and/or at Network Interface.  Protocol layer (TCP/IP) statistics  Packets/Bytes receive and sent  Collisions per seconds: Sign of congested network. Collisions are rare in properly configured network build with switches.  Packets dropped: Count of packets dropped by kernel due to either firewall rules or lack of network buffers.  Errors: Number of frames marked as faulty. Most likely caused by faulty cable or NIC card  Tools: netstat, snoop, nicstat, kstat, ping, DTrace Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Key Points To Remember about Performance and Tuning • Know the performance requirements and determine which metrics really matter. Perform the right test for the right purpose. • Define how the test will be performed: • what data to use, how long is a test run, what tools to use, and so on. • Stick to the performance goals. Do not attempt to change them unless there are good reasons to do so such as goals are unrealistic given the environment • Test should be done consistently and that results are repeatable if exactly the same set of data and procedures were used. • Change only one variable at a time when running a series of tests for problem analysis or diagnosis. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Key Points To Remember about Performance and Tuning • Record all results in a methodical manner with very good documentation. Your records are as good if someone can understand the historical results, visualize the tests without having seen them, and draw some conclusions. • Automate the procedures whenever possible to reduce human error. However, it is important that the correctness of the automation be verifiable and that 100% confidence in the test results can be claimed. • Minimize sharing of resources whenever possible to avoid contamination. If sharing is inevitable, back up your system before giving up resources for the next person. It is also a good practice to maintain an audit. • Perform the same test run at least three times to make sure that the numbers are right. System Performance Analysis and Tuning Overview (Doc ID 1450811.1) http://dtrace.org/blogs/brendan/2012/03/01/the-use-method-solaris-performance-checklist/ Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Module 2 Enterprise Server Architecture Consideration Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Enterprise Servers  Enterprise servers are intended for use in data centers servicing computing needs of a large organization. Commonly runs large databases and other mission critical applications.  Enterprise servers are designed to provide higher Availability and Serviceability (RAS) by reducing planned and unplanned outages.  Unscheduled: Outage due to unrecoverable malfunction in a hardware component of the server  Scheduled: Outage taken due to changes or updates to keep the server functioning. Example: Disruptive patch that require installing or other changes recommended by the vendor  Planned: Capacity upgrade or software upgrade that require outage.Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Enterprise Server Features Enterprise Class Server Features:  Support number of processors and large amount of memory  Capable of processing high volume of storage and network IO  High speed interconnects for higher performance  Multipath IO and Network link aggregation capabilities  Offer scalability, manageability, redundancy and other RAS features  Five-nine’s availability, if requested (99.999%)  Secure environment and problem isolation capabilities  Data Integrity features - Avoid silent data corruption  Minimize impact of component failures and system to operate with failed components. Ability to black list the failed component  Hot swappable CPUs, memory, IO board, power supplies and fan  Online reconfiguration, maintenance and software upgrades Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Enterprise Server Features  ECC protection built-in for data integrity and reduce downtime. Features like: Instruction-level retry, register protection and cache degradation  Self-healing capabilities and reporting prevents faults before they occur and brings system back up quickly  Automated adjustments to compute capacity by offering dynamic resource management capabilities to deal with computing, IO and memory needs without imposing downtime  Support of fault isolated hardware (Dynamic Domain) and Software partitioning (LDOM, XEN, VMware)  Granular resource control to run competing workloads without effecting each other to maximize system utilization levels and Quality of Service  Built-in remote management features and call home facilities Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved SPARC T4 Processor T4 Servers: Enterprise level features:  Eight Cores 2.5-3.0 GHz SPARC Processor. Total of 64 thread/processor  Each core is capable to switch between up to 8 threads using LRU algorithm for thread choice.  Each core has two integer execution pipelines. That means each core can execute two threads simultaneously  SPARC T4 processor can support 1,2 and 4-socket implementation • On-chip memory controllers, interface to memory via 4 Buffer-on-Board (BoB) high speed serial links. • The T4 chip has two dual-channel DDR3 memory controllers (1.07GHz) support transfer rate of 6.4 Gbps  Extended ECC, error correction and parity checking memory http://www.oracle.com/technetwork/server-storage/sun- sparc-enterprise/documentation/o11-090-sparc-t4- arch-496245.pdf Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Sparc T4 Processor T4 Servers: Enterprise level features:  T4 has 2 coherence link controller with each having 3 coherence links with each link is running at 9.6Gbps.  Coherence links support cache coherence (aka.Cache snooping)  Coherence link allows communication between four T4 processors without requiring external hub.  Each core has its own L1 and L2 caches:  L1 caches consist of separate data and instruction caches of sizes 16KB each  A single L2 caches consist of size 128KB  L3 cache is shared across all 8 cores of T4 processor and it is 4MB in size; has 8 banks and is 16-way associative  Memory Management Unit (MMU) supports page sizes up to 2GB Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Sparc T4-2 Server (example) Server built with 2 T4 processors:  Dual SPARC T4 processors:  Cores: 2x8 = 16 ; Threads: 16 x 8 = 128  Up to 512 GB of memory:  4,8,16GB DDR3 DIMMs  Ten PCIe slots  One10GB XAUI port for 4x10GbE  4 onboard 1Gbps ethernet ports  System Controller running ILOM  Six SAS-2 disk drive slots  Hot-pluggable disk drives  Redundant, hot-swappable power supplies fans  Remote Management  LDOMS http://www.oracle.com/technetwork/server- storage/sun-sparc- enterprise/documentation/o11-090-sparc- t4-arch-496245.pdf Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Sparc T4 Performance Features • Single-Threaded Performance: SPARC T4 have yielded improvements in single-thread performance by 5x-8x as compared to previous generation (T2+) • Dynamic Threading: Hardware dynamically and seamlessly allocates core resources such as instruction, data, and L2 caches and TLBs, These resources are allocated among the active strands. Software activates strands by sending an interrupt to a halted strand. Software deactivates strands by executing a HALT instruction on each strand that is to be deactivated. • Critical Thread: Oracle Solaris Scheduler is enhanced to recognize a “critical thread” by means of raising its priority to 60 or above through the use command line or system call. This will cause the whole core to be dedicated to this thread. Some workloads are designed for throughput, others for low latency. The T4 hardware is providing mechanisms to dynamically resource threads according to their runtime behavior. • Multicore/Multithreaded Awareness: Scheduler is aware of T4 NUMA latency (hierarchy). It can effectively balance the load across all available pipelines. Even though T4 processors exposes itself as 64 logical processors to Solaris (8 per core), each core can run only two threads simultaneously. Solaris scheduler understand this limit and thus efficiently balance active threads. • Server Virtualization (LDOM): SPARC T4 has supports multi-threaded (MT) hypervisor in the firmware to create stable virtual machines (VM). Thus architecture supports MT across all stacks. https://blogs.oracle.com/observatory/entry/critical_threads_optimization Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved x64 Commodity Hardware Intel Xeon based server supports number of enterprise level features:  Intel Hyper-Threading: 4-12 cores per die with 2 threads per core  Intel Turbo Boost: Turbo boost pushes the processors to operate cores at a higher frequency than normal  Intel QuickPath: Two bi-directional Interconnects for high speed CPU-CPU and CPU-IO communications  DDR3 Memory Controllers are integrated on the processor die. Each controller have 3 channels with each has 10.6Gbps capacity. Cache misses are handled faster and in parallel due to integrated memory controller  Intel MCA recovery enables the system to detect and correct errors in memory and cache that were previously “uncorrectable” through ECC or other means.  Intel Virtualization Technology offer cost- effective servervirtualization options. Oracle x64 class servers support XEN and VMware hypervisors. NOTE: oracleVM for x86 uses XEN as hypervisor Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved x64 Commodity Hardware (Example) Server built with two Intel Xeon processor: • Two Xeon processors each with 4-12 cores (2Threads/core) are connected using QPI • Each with integrated memory controller with 3 DDR3 channels with up to 3 DIMS per channel • IOH(Input/Output Hub) are also connected with CPU via QPI lane and provides an interface with PCI-Express bus - 16 PCIe 2.0 lanes for high speed networking and IO capabilities • The processor design creates a NUMA-style memory architecture since each processor in multi-socketed systems can access local memory (connected to the local memory controller) as well as remote memory that is connected to other processor. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved x64 commodity hardware (Example) • Disks are hot-pluggable • Redundant hot-swap fans • Redundant hot-swap power supply • Leveraging chassis-based infrastructure components (power supplies, fans etc.) means there are fewer moving parts on the individual server • Power management features for improve energy efficiency and performance-per-watt. When a processor workload decreases, unneeded components such as: cores, cache, and memory are put into sleep mode to reduce power consumption. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Sun Fire X4800 M2 Server (Intel Xeon E7-8000 Processor) • Powered by up to eight Intel Xeon E7-8800 2.4GHz Processors. Each processor has 10 cores with each core is capable of running 2 Hyperthreads concurrently. • Total of: 8 x 10 x 2 = 160 2.4GHz computing engines for Solaris to run • 2 TB of memory – 128 DIMMS • 32 DIMMS per CPU • Choice of 4G, 8G, 16G DDR3-1066 MHZ low voltage ECC DIMMS • 8 hot-swappable PCIe 2.0 Express Modules. Allows Fiber Channel, InfiniBand or Ethernet connections • Two x4 mini SAS-2 ports. Hot swappable SAS-2 Hard drivers or 8 eMLC SSD for upto 4.8 TB of internal storage. • 2 Network Express modules provide 8 1GB Ethernet (RJ-45) and 8 10GB Ethernet ports (Single Fiber Port – SFP+) • Hot swappable I/O, disks, cooling fans and power supply units http://www.oracle.com/us/products/servers-storage/servers/x86/x4800-m2-server- 403762.html https://shop.sun.com/store/product/2e61b9ad-9e7a-11e0-8075-080020a9ed93 Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Why Multi-Core • It has become increasingly hard to improve serial performance. It takes large silicon space to enable processor to execute instruction faster and doing so increases the amount of power consumed and heat generated • Performance gains obtained through additional cores produces a processor that has the potential to do twice the work with fraction of power consumed. • As the gap between processor and memory speeds widens, performance gain by ramping up the processor clock begins to have diminishing returns with processor stalling waiting for memory • Studies have shown that processors in most servers in real world deployments spent 80% of their time stalled waiting for memory or IO and thus high clock rates and deep pipelines of traditional processors are wasted stalling on cache refills from main memory • Hardware threads in Multi-core processor reduce the overhead of these frequent cache stalls and achieve maximum memory bandwidth by automatically parking stalled hardware threads and switching to next ready to run hardware threads leading to efficient processor utilization. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-Core Processors Chip Multithreading • Chip multi-threading (CMT) is similar to software multi-threading (MT) but implemented in hardware on the processor chip. • Software MT refers to execution of multiple tasks within a process executed by number of software threads that are executed concurrently on many processors. • CMT processor can do the same in the hardware by executing many software threads simultaneously within a single physical processor.  Chip Multithreading (CMT) refers to the family of processor technologies that allow a given physical processor to simultaneously execute multiple threads of execution  Core of a processor is the part that executes the instruction in an application. Oracle, Intel and other vendors ship processor with multiple cores.  Intel Xeon processor and SPARC T4 allows a single processor core to process instructions from multiple instruction streams simultaneously  Hardware threads in each core are seen by Solaris as a separate CPU. CPU caches are shared among hardware threads • To gain significant increase in throughput by utilizing all available hardware threads and cores, concurrency in the software is required. Analyzing Performance of Chip Multi-Threading (CMT) Servers (Doc ID 1343999.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-Core What is Scalability Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-Core Scalability Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Server Topologies SMP Servers • In Symmetric multiprocessor (SMP) systems , called UMA (Uniform Memory Architecture), all physical memory is seen as a single pool, equidistant in terms of latency from the set of independent CPUs •All cpus share hardware resources and same address space – single instance of kernel • SMP systems provide more computing power, because there are more CPU with which to schedule work Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved SMP Servers Scalability  Ideally, SMP systems should provide a linear scalability as more CPUs are introduced. Realistically, it is difficult to achieve due to following factors:  All CPU in SMP system compete for shared system resources (the memory bus, the I/O bus, and so on), thus depending on the nature of workload linear scalability is difficult to achieve.  Both hardware (cache coherency) and software (multithreading) must be designed to exploit the system's parallel characteristics.  SMP architecture must prevent concurrent CPU access to shared resources by serializing access. This mutual exclusion of shared resources limits scalability and thus prevent linear performance improvement  Building a large SMP server is difficult due to physical limitation of the shared bus and higher contention with increase in number of CPUs. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved NUMA is the Answer  Typically, NUMA machines are made up of a number of nodes, each with its own CPUs, memory and IO devices.  Nodes are connected via high speed interconnect which allows each node to access local and remote memory and IO devices  NUMA architecture addresses the short coming of SMP systems by allowing larger systems configurations at a cost of varying memory latencies  NUMA is a design trade-off considering it introduces prohibitively large memory latencies, which in turn can impact performance  Each node acts as a SMP system with fast access to local memory and IO devices. However, access to remote memory and devices are relatively slower. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved NUMA is the Answer  Similar to SMP System, all nodes share common address bus and run under control of single instance of kernel  Similar to SMP system, hardware keeps the memory cache lines coherent across all the nodes in the NUMA machine.  Application performance on NUMA systems can be non-deterministic, especially for multithreaded application where assumptions are that multiple parallel threads will complete a given task in a constant amount of time.  Kernel is, therefore, optimized to take into account the locality of the resources and strive to keep all the required resources, CPU, memory, IO to be closer to the executing thread.  For T4 2-4 processor system, a remote to local memory latency ratio is 1.47.Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Clusters There are two approaches for growing computing capacity: Vertical Scaling:  Adding compute capacity to an individual computer. Example: NUMA  Major issues are limitation on scalability and system failure result in major outage. Horizontal Scaling:  This is accomplished by connecting many individual computers (nodes) using networking technologies such as Ethernet, InfiniBand  Helps increase the compute capacity and the tolerance of individual system failure.  Horizontal expanded systems function as a collection of separate computers that require coordination. Example: Oracle RAC (Real Application Clusters)  Challenges with Horizontal Scaling is: Throughput and Latency.Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Clusters  High-Available Clusters: Primary purpose is to improve the availability of services that the cluster provides. 2-4 node configuration are common:  Redundant nodes acting as a backup and provide service when the node fails  Heartbeat is managed by cluster framework. Examples: Solari Cluster, Symantec Firstwatch, Oracle 10-11g RAC  Load on cluster nodes can also be managed via Load balancer software. Where nodes share the computational workload and function as a single logical server.  High-Performance Clusters (HPC): Primary purpose is to perform large computation tasks. Hundreds of node are used to perform simulation of atomic explosion; weather patterns or scientific calculations.  Message passing protocol such as MPI (Message Passing Interface) is used for inter- node communication and distribution of work.  Nodes can be tightly coupled using a dedicated high speed network to perform compute intensive task, also called supercomputing.  Other extreme is Grid computing where little or no inter-node communication is required. Grid computing is well suited for application that allows computations to take place independently without the need to communicate intermediate results between nodes Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Memory Interleaving  Memory interleaving increase bandwidth by allowing simultaneous access to more than one bank of memory.  Enterprise servers supports interleaving because it boosts memory access performance via parallel access to multiple on-boards memory units.  Interleaving works by dividing the system memory into multiple blocks of two or four, called 2-way or 4- way interleaving. NOTE: T4 can have 16-way interleaving  The access paths from the memory controller to memory are limited. Only one set of paths is available for address and data. Thus reading large data volumes from a single memory chip can be time-consuming. For this reason, the internal memory architecture is configured in 2-4 layers (banks) to increase data read performance.  Each block of memory is accessed by different sets of control lines that are merged together on the memory bus. Thus read/write of blocks of memory can be overlapped.  consecutive memory addresses are spread over the different blocks of memory to exploit interleaving.  prtdiag reports interleave factor used. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Memory Interleaving Intel Xeon Processors • Both SPARC and Intel based servers support memory interleaving. • Memory interleaving refers to how physical memory is interleaved across the physical DIMMs • A balanced system provides the best interleaving. Intel Xeon system is considered balanced when all memory channels on a socket have the same amount of memory. • The bandwidth to remote memory is limited and has access latency that is 75% higher than local memory access. So, the goal should be always to populate both sockets when both processors are populated. • Also, consider using Dual rank DIMMS in the system considering these DIMMS offer better interleaving and hence better performance. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Hardware Caches •CPU cache provides faster access to data than main memory. It is used to hold recently used data fetched from the memory. There are separate caches for instruction and data.  More than half of cpu area is dedicated to cache; 30% of total execution time is due to cache miss resulting in low cpu utilization. • There may be multiple level cache with first level cache is smaller than the second level but provide faster access. Some system even have 3rd level of cache. Intel and Sparc servers support 3 levels • Data is fetched from memory in blocks of 64 bytes, called cache line. Cache lines are always aligned at address that is multiple of 64. 4KB cache can hold 64 cache lines of 64 bytes each. # prtpicl –v|grep cache Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Intel Xeon Cache Nehalem Architecture • L1 Caches : 32 KB •Instruction cache: 4-way associative with latency of 3 cycles •Data cache: 8-ways associative with latency of 4 cycles •L2 Cache: 256 KB: •Instruction and Data cache with latency of 10 cycles •L3 Cache: 8-30M: •Shared cache with latency of 40 cycles. •It is an inclusive cache, means, it contains the copy of the contents of L1 and L2. •Cache miss in L3 means data is not present in any other cache and that is required reading it from memory •Cache hit in L3 means verifying all the private caches of each core to see if the data is present. Instead of expensive cache snooping, flag associated to each cache line in L3 cache indicates which private cache has the data in it. http://www.behardware.com/articles/733-4/report-intel- nehalem-architecture.html Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Cache Associativity Hardware  For direct mapped cache, each cache line in memory would mapped to exactly same cache location. This cache location must be flushed before placing another cache line with the same address  To avoid this problem, associatively of the cache needs to be increased. This allows a single cache line to map into more positions in the cache and therefore reduce the possibility of conflict and cache misses. n-way associative cache has n possible locations  When two threads of same application accessing the cache higher associatively is required to avoid both threads accessing the same cache line and thus displacing each other cached data.  Intel Xeon L1 cache is 32KB 8-way set associative and latency is 4 cycle  Intel Xeon L2 cache is 256KB 8-way associative and latency is 12 cycles  Intel Xeon L3 cache is 8MB 16-way associative and latency is 40 cycles  SPARC T4 L3 cache is 4MB 16-way associative. # prtpicl -v|grep cache-associativity Two way associative cache Direct mapped cache 64bytes 64 bytes Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Page Coloring to reduce Cache misses  Operating System memory allocation code can control page placement in the CPU cache to minimize the chance of cache misses caused by low cache associatively and thus provide consistent performance  Memory is 10-20 times slower than CPU cache and thus can have a dramatic effect on performance Caches help performance if software uses the cache efficiently..  Optimal placement of pages in the CPU cache is often depends on the memory access patterns of the application  Rather than randomly map physical pages to a virtual addresses, physical pages are allocated with a specific pre-determined relationship.  Physical page is chosen as a function of hashed algorithm to ensure even distribution across the cache and to ensure different address range is used for each process to avoid similar type of processes accessing the same cache slot. 32 kbyte cache Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Cache Thrashing  Cache thrashing happens when a process frequently uses two lines of memory that must use the same cache location.  Cache line conflicts can also occur when threads on different CPUs access the same set of data at the same time. This can have a drastic effect on cache-hit rate.  If the data is modified after each access, the cache lines in caches on other processors are marked out-of-date, forcing main memory access. • The T-series processors offer a rich set of performance counters for counting hardware events such as cache misses, TLB misses, loads, stores, etc. These are accessed using cpustat(1M) command. https://blogs.oracle.com/martinm/entry/t4_performance_counters_explained https://blogs.oracle.com/brendan/entry/amd64_pics Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Write-Through and Write-Back Caches When cache line is modified by CPU, changes are propagated to other caches and physical memory in two ways:  Write-back cache: Changes are flushed to memory at a deferred time. Useful when cache contents are frequently modified. Write-back caches collect write operations for later delivery to main memory. However, if the system crashes before the data is copied to main memory, all the data in the cache is lost. NOTE: Write-back caching yields somewhat better performance than write- through caching because it reduces the number of write operations to main memory. With this performance improvement comes a slight risk that data may be lost if the system crashes.  Write-through cache: Updates happen to memory whenever the contents in cache change. Useful when cache contents are infrequently modified. Thus, a write-through cache performs all write operations in parallel i-e, data is written to main memory and the cache simultaneously. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Cache Snooping  Cache coherency in SMP or ccNUMA systems is maintained by protocol called cache snooping. This avoids the situation in which a CPU retrieves outdated data.  In this mechanism, the write-back caches ensures that any component that request data receive the current copy of the data. The current copy of the data can be accessed from either the main memory or the CPU cache with the current copy.  Cache controllers monitor requests on the bus:  If a CPU requests a cache line that is not in the cache, block is read from a higher level cache or from the main memory.  If a CPU modifies a cache line, all other CPUs that have that line cached are informed to discard that cache line. Discarding a cache line is called, write invalidation Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved TLB – Translation Lookaside Buffer  Application use virtual addresses when accessing data. These virtual addresses are translated into physical memory addresses by combination of hardware (MMU) and SoftwareTables (translation tables). This Physical address is then used to fetch the application data from the memory into CPU cache.  Virtual to physical address translation is kept in the hardware buffer, called TLB, to avoid using slow in- memory tables for translation.  TLB, similar to cache, has a limited capacity. When the translation is not found in the TLB, it is counted as TLB miss. Translation is then fetched from an in-memory data structures, called translation tables resulting in performance penalty.  TLB miss can be due to capacity (limited entries) or conflict. Conflict happens when multiple physical pages map to the same TLB entry resulting in eviction of old mapping from TLB  TLB entries can be programmed to cache mapping of various page sizes, 4k, 64k,..4M , 2G. Large pages allow each TLB entry to cover more physical memory and less TLB misses resulting in improved application performance.  trapstat(1M) can be used to capture TLB miss statistics Page Tables Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Module 3 Setting and Viewing Kernel parameters Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Tuning Parameters • Kernel maintains several variables that, when modified, can alter the performance of the system. These are generally known as “Tuning Parameters”. • Solaris kernel offers variety of parameters and settings that can be tweaked to maximize performance for a specific workload. • Tuning has its trade offs! Tuning one area of kernel may have adverse effect on other subsystem. Also, setting that may improve performance for a web servers may not be ideal for database systems. • Be caution, some tunable may change with each release and thus can only be applied to a specific kernel release. • Changing one parameter at a time and monitoring the affect of it on overall system performance would help you to decide if it is worth changing. • Remember default value of a parameter is normally best for majority of the workload and configuration. Avoid the temptation of tuning unless required! • If running commercial workload such as Oracle Database, review the best practices documents about list of parameters to change for optimum performance. For Example: • Best practices for running Oracle in Solaris Containers: http://www.oracle.com/technetwork/server-storage/solaris10/solaris-oracle-db-wp-168019.pdf Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Viewing Kernel parameters using kstat • The kstat utility examines available kernel statistics • kstats are data structures maintained by various kernel subsystems and drivers. • kstats provide a mechanism for exporting data from the kernel to user programs without requiring that the program read kernel memory or have super user (root) privilege. • Example: Dump information about system memory usage: #kstat –n system_pages module: unix instance: 0 name: system_pages class: page availrmem 15853427 crtime 158.566761 desfree 128635 fastscan 823805 freemem 13674260 kernelbase 16777216 lotsfree 257270 minfree 64317 … See kstat(1M): http://docs.oracle.com/cd/E18752_01/html/816-5166/kstat-1m.html#REFMAN1Mkstat-1m Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Viewing and Setting Kernel Tunables via mdb • Number of kernel subsystems and drivers related kernel parameters can be viewed or set via mdb (kernel debugger) • With mdb on can change the default value of system parameters and that modify the system or driver behavior. • mdb can change the parameter on a running kernel. One should take caution when setting kernel parameters on a LIVE system, it may result in unexpected outage, if not careful. • Setting kernel parameter using mdb does not last across reboot • Viewing zfs kernel parameter using mdb: # echo "zfs_no_write_throttle::print"|mdb –k • Setting zfs file system specific kernel parameters using mdb: # echo zfs_nocacheflush/W0t1 | mdb –kw # echo zfs_no_write_throttle/W0t1 | mdb –kw Using mdb to Verify Solaris Kernel Tunable Parameter Values on a 64 bit OS (Doc ID 1011832.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Setting Kernel Tunables using /etc/system file • /etc/system file can be used to set kernel parameter. • values specified in the file are read at boot time. • For example: To set zfs kernel parameter, edit /etc/system file and reboot: set zfs:zfs_arc_max=11232321536 # 10G • Incorrect value in /etc/system file can some time result in system boot failure. To recover from it boot system with: # boot –a When prompt is displayed, enter the name of the good /etc/system file or /dev/null. • List of Kernel tunables are listed in Solaris Tunable Paramters Reference Manual: http://docs.oracle.com/cd/E18752_01/html/817-0404/index.html Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Setting Kernel Parameters using Resource Control • Large enterprise servers are configured with plenty of hardware resources: CPU, Memory, IO channels. One can use Solaris Resource Control framework to partition or control access to these system resource across multiple competing workloads • Solaris Resource Control frame work provides finer control on setting some system parameters. resource_controls(5) provides list of system parameters that can be set. • Setting these system parameters only affect set of processes or workload instead of system wide • Setting these parameters does not require system reboot as compare to /etc/system file that require system reboot and effect all processes on the system. • Preferred way to set IPC Tunables on a system hosting multiple databases is to set process, project or zone specific resource controls instead of updating /etc/system file. • How Solaris 10 (and later) Integrates System V Inter-Process Communication (IPC) Resource Controls (Doc ID 1006158.1) • Resource control can be applied to a process or group of process that are part of the same workload (project) or a workload that is running inside the container (zone). To adjust resource control “max-file-descriptor” of a running process to 64K, type: # prctl -n process.max-file-descriptor -r -v 65536 -i process <PID> Solaris resource management :http://docs.oracle.com/cd/E19082-01/819-4323/819-4323.pdf Best practices for running Oracle on container (zone): http://www.oracle.com/technetwork/server-storage/solaris10/solaris-oracle-db-wp-168019.pdf Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Setting Resource Limit to a Running Shell (ulimit) • One can set system resource limit to the current shell • ulimit-a controls resource controls at the shell level. • You can set the same resource control using prctl to affect processes. • Relationship between shell resource limits and process resource controls is listed below shell (ulimit) process (prctl) time(second) process.max-cpu-time file(blocks) process.max-file-size data(kbytes) process.max-data-size stack(kbytes) stack(kbytes) coredump(blocks) process.max-core-size nofiles(descriptors) process.max-file-descriptor vmemory(kbytes) process.max-address-space How To Set The Limit For The Maximum Number Of Open Files Per Process In Solaris 10 And Solaris 11 (Doc ID 1408563.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Solaris Application Environment • Application to Kernel interface is well defined. • Kernel or driver upgrade does not change system libraries, such as libc, the interface that nearly all application use. •That means user application should not be affected when upgrading to new kernel • Device drivers and kernel modules are tightly coupled with the kernel and may require changes to work with new kernel • Kernel tunables may get obselete and/or replace by new tunables. system libraries, libc Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Kernel Tunable Areas • Processor subsystem • Process priority and Scheduling • Interrupt Handling • NUMA awareness • VM subsystem • Paging and Swapping • Programming TLB via Large Pages • IO subsystem • File systems and IO characteristics • Network Stack Tuning • NIC driver Tuning • RAS features Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Module 4 CPU Utilization and Process Scheduling Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved What is a process • A process defines the run-time state of a program file. • It is an abstraction that contains execution environment for a user program. • It is an instance of execution that runs on a processor. • It is an entity to which system resources are allocated. • Process in kernel is represented by: • Kernel threads: Kernel threads are object that gets schedule on processor. These are scheduling entities not a process. • User Threads: This is a user level thread that is maintained by user library • Lightweight Process (LWP): For user process execution, kernel threads have a corresponding LWP. These kernel threads are scheduled for execution by the kernel on behalf of the user process • These data structures are repository of all kinds of information that kernel needs to identify the process. • Example: priority, PID, resource limits, blocking event, open files, memory segments etc.. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved What is a Thread • A Thread is a sequence of instructions that can be executed in parallel within a process • Think of threads in a program as smaller portions of the process running concurrently. • LWP and its corresponding kernel thread define virtual execution environment for a user program. There is 1:1 relationship between user thread, LWP and Kernel thread. That means each application thread is bound to LWP and each LWP has an associated kernel thread. • Threads are not processes, but rather lightweight threads of execution. Thread creation is less expensive than process creation because it does not need to copy resources on creation. • Multi-threaded (MT) program can have multiple user threads, LWP and associated Kernel thread. • Example: Two lightweight processes may share some resources, like the process address space, the open files and so on. Whenever one modifies a shared resource, the other immediately sees the change. • Solaris kernel schedule user threads independently. i-e. Even when one thread of a process is blocked other threads can still be running on the CPU. Threads have their own stack, registers and signal can be masked on per thread basis. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Process States A process can be in any of the following process states: • ONPROC (O): Process is running on a processor. • RUN (R): Proces is on CPU runq • SLEEP(S): Process is blocked on a lock or IO • STOP (T): Process is stopped by shell (job control) or by debugger • WAITING(W): Process usage is limited using CPU-caps resource control • ZOMBIE(Z): Zombie state: process terminated but parent is not waiting. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Context Switch • A process contains software and hardware context. A process context is an execution environment of the process or all the information needed to describe the process. • Software Context: Process text, data, stack, shared memory regions and other part of user address space are part of software context. • Hardware Context: set of data that must be loaded into cpu registers before the process resumes it execution. It is the snapshot of cpu registers, flags and argument passed on stack. • To control the execution of processes, kernel must be able to suspend the execution of the process and resume the execution of some other process. This activity is called context switch. Process context is saved and restored during context switching. Thus the overhead! Sleep queue Sleep queue Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Context Switch • Having too much context switching is undesirable because the CPU has to flush its register and cache every time to make room for the new process. • Context switch can happen due to following factors: • A process blocks on a lock or resource to be available • A process request an IO operation and waiting for its completion • A process blocked on a system call • A process created a child process and waiting for it to complete execution • A process consumes all of its allocated CPU time • A higher priority process or Interrupt caused a pre-emption. • A voluntary context switch occurs when a thread blocks for a resource • An involuntary context switch occurs when process has either exhausted its allocated CPU running time (time quantum) or higher priority thread has become runnable. This causes the kernel scheduler to pre-empts the running thread to execute a higher priority thread. • Frequent involuntary context switching can hurt application performance. To see if process is getting involuntary context switch, run: # prstat –mL [ICX] # mpstat [icsw] -- Reports system-wide involuntary context switching Use nice(1) to improve priority or priocntl(1M) to change class that offers better cpu run time as compared to timeshare class, example: “FX” class. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Process Priority • Process priority is a number that determines the order in which the process is handled by the CPU. • A process having a higher process priority has a greater chance of running on a CPU. Note: It is the kernel thread associated to the user LWP that actually gets priority assigned not the process • The kernel dynamically adjusts dynamic priority up and down as needed using a heuristic algorithm based on process behaviours and characteristics. • A user process can change the static priority indirectly through the use of the nice level of the process. niceness is primarily a control over the proportion of CPU time that each thread receives • There are 170 global priorities. A process with higher priority gets longer time slice on CPU. TS: Timeshare scheduling class. Default class for user process. Process priority change dynamically according to recent processor usage. Priority assigned calculate according to TS scheduling table IA: Interactive class: Same as TS except priroty boost is given to threads with window under focus FSS: Fair-share scheduling class. Normally used in conjunction with resource management (project) framework. Processes are allocated CPU shares instead of priorities. FX: Fixed priority class: No penalty for higher cpu usage. Priority remains fixed! SYS: System class: Assigned to kernel threads running in kernel. There is no time quantum. They run until they block or complete. RT: Real time class: Fixed priority and time quantum. Highest priority and thus they can even prempt kernel threads. NOTE: Interrupts that runs as threads are serviced at the highest priority (160-169) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Process Affinity • CPU affinity is a scheduler property that "bonds" a process to a given set of CPUs. • Running the process to same CPU after it wakes up from sleep improves CPU cache usage. • Scheduler assign the process to CPU queue according to the criteria listed below: • The priority of the thread • The home lgroup of the thread • Whether or not the thread is bond (using processor set or pbind) • Dynamic load balancing by the dispatcher code. • Warm affinity, controlled by rechoose_interval tunable (default 3), places the threads back on the CPU they last ran on, thus potentially benefiting from a warm hardware cache. However, if too many cycles have passed since the thread last ran, then scheduler just selects the best CPU in the home lgroup or remote lgroup if the thread does not have high priority. • Process affinity can be overwritten by using pbind(1M). This cause the process to stay on the same CPU. However, it does not stop scheduler to assign other process to this CPU. • For exclusive use, consider using processor set. psrset(1M). This will dedicate set of CPUs for exclusive use by the workload. • NOTE: Solaris Resource Management Framework offer more flexible ways to configure processor sets, called Resource pool. This should be used instead of processor set, if possible. http://www.solarisinternals.com/wiki/index.php/Zones_Resource_Controls Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved System Calls • System calls are the legal method of trapping into the kernel from user-space to perform a privileged operation • Processes only enter the kernel via trap: system call, page fault, other exceptions • Syscalls are referenced by number • Return negative error code and sets errno • example: return -ENOMEM; • Syscalls run in the kernel in the context of the invoking process • To find system call issued by application use: truss(1) # truss –aeflE –vall –o truss.out <pid> NOTE: Running truss on an active process can slow the process down due to active tracing on a system call enter and return. Instead use DTrace Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved fork and exec • fork() system call is used to create a new process. At fork time, the entire address space of the parent process is not copied, thus both processes share the same address space. • The exec() system call copies the new program to the address space of the child process. Because both processes share the same address space, writing new program data causes a page fault exception. Kernel assigns the new physical page to the child process. This deferred operation is called the Copy On Write. • The child process usually executes their own program rather than the same executable as its parent does. This operation avoids unnecessary overhead because copying an entire address space is slow and inefficient that wastes processor time and resources. • When program execution has completed, the child process terminates with an exit() system call. The exit() system call releases most of the data structure of the process and notifies the parent process of the termination sending a signal. At this time, the process is called a zombie process. • Parent waits for child process termination by the wait() system call. As soon as the parent process is notified of the child process termination, it removes all the data structure of the child process and release the process descriptor. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Sleeping and Wakeup • Processes can go to sleep, suspend execution and allow other processes to run, until some event occurs that wake them up • When process blocked, it is moved from cpu run queue and placed into sleep queues - list of processes block for the same event. • Scheduler selects next task to run in the CPU queue • Process wakes up when the resource becomes available • Task state is set to RUN and thus placed on the CPU run queue. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Kernel Threads • Kernel threads are used to perform kernel related tasks (or background tasks) such as interrupt handling, file system flushing (fsflush), memory page management (page-out) and other device driver tasks (task queues) • Kernel threads are needed because it is not efficient to perform above tasks in context of process executing. End user programs get better response if such tasks are performed asynchronously in the background. • A kernel thread is similar to what is associated to user LWP. They get scheduled (SYS class) at a higher priority than user process. They exists only in kernel space and have access to kernel data structures but no access to user space Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Driver Hardware OS Interrupt User process B User process A System call System call Kernel Mode User Mode Physical layer sleep wakeup Interrupt Handling Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Interrupt Handling • Interrupt handling is one of the highest priority tasks. Interrupts are usually generated by I/O devices such as a network interface card, keyboard, disk controller, serial adapter, and so on. • An interrupt can also be generated by software. A common scenario is inter-processor interrupt mechanism that allows one CPU to send interrupts to other CPU to force it to run a handler to perform some task. Common tasks are: preemption, signal processing, start/stop threads. • Low level functions that use soft interrupts are: cross calls (xcall) that causes processors to invalidate translation entries when a virtual address space of the process is unmapped at process exit. This is required as part of cache consistency across SMP platform. • The interrupt notifies the kernel of an event (such as keyboard input, Ethernet frame arrival, and so on). It tells the kernel to interrupt process execution and perform interrupt handling as quickly as possible because some device requires quick responsiveness. This is critical for system stability and scalability. • When an interrupt notification arrives to the kernel, the kernel must disrupt the execution of current thread so that the Interrupt Service Routine (ISR) can be run to handle the interrupt. • When interrupt are serviced, the interrupted thread is pinned to the processor (it means current thread cannot be scheduled on other processors). This allows interrupt to be processed quickly without performing a full context switch. CPU uses one of the partially initialized interrupt thread (10 per CPU) and borrows current thread LWP and then execute the matching ISR. Once the interrupt is serviced, the pinned thread continues. • In a multi-processor environment, interrupts are bound to CPU at boot time and interrupt processing load is distributed across all the CPU to provide balance system performance • In some cases, flood of interrupts can disrupt performance of CPU bound workload. Interrupts processing not only steel CPU cycles from the critical application threads but can also displaces the cache by running the ISR routine. • To minimize the effect, one can use psrinfo, psradm(1M) to disable interrupts and psrset(1M) to bind the workload to processor sets. If using resource management frame work, one can configure resource pool and set the property to achieve similar effect. 209619 – Busy Interrupts, problems and solutions Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Interrupt Handling • Devices are assigned interrupt levels. Higher level interrupts can mask out the lower level interrupt by raising the PIL on the local processor. • Interrupts level-10 and below are called Low Priority Interrupts and are handled by interrupt threads. Every CPU maintains a pool of 10 partially initialized threads (one per interrupt level). These kernel threads are used to execute ISR of the corresponding interrupts. • ISR routines of Low Priority Interrupts can block on synchronization object or use kernel functions that can block. In case of blocking, interrupt threads are converted into full kernel threads and scheduled using priority 160-169. • NOTE: While interrupt thread is blocked, PIL of the processor stays at that level until the blocked interrupt thread is finished executing. That means lower level interrupts bound to the CPU cannot be serviced during that time. However, user and other kernel threads can run while interrupt thread is blocked • Higher level interrupts ( > level-10) cannot block because they cannot be scheduled. They have to be very quick and efficient to avoid delays. • ISR of these interrupts perform minimal work and then defer real work to low priority software interrupts. • Hardware Interrupt activity can be monitored using: intrstat(1M). Reports type of interrupts serviced by CPU and percent CPU time spent servicing it • mpstat(1) reports interrupt activity serviced as interrupt thread (intr) or full kernel thread (ithr) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Clock Processing • Solaris performs some accounting and bookkeeping activities every clock tick (10ms. To do this, a cyclic timer is created to go off every clock tick and call a clock handler (clock()). Every tick, the tick accounting determines: • if any user thread is running on a CPU and charges it with one tick, measures thread’s time quantum • freemem and anon values for tracking and reporting purposes. Updates lbolt counter • CPU run queue sizes to balance the workload. Callouts tick processing • LWP interval timers (virtual and profiling timers), if they have been set. • Traditionally, only one CPU is engaged in doing tick accounting (single threaded). Tick accounting is enhanced in recent Solaris 10 updates to become multi-threaded considering on large CPU systems, tick accounting code may not be finished within 10ms. • Kernel keeps track of number of times a cyclic has been expired and number of times it has been handled in a pending count. For expired cyclic, soft interrupt is posted, that causes the handler to be called until the pending count reaches zero. • pending count can be displayed using scat and mdb by typing: cyclic. For clock cyclic, non-zero pending value is a sign that click handler is not finishing within 10ms and thus needs to catch-up. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Callout Processing • Solaris callout facility is used by drivers and kernel to perform time based events scheduling. • Callout facility allows drivers to submit functions that can be executed in the future. • Timeout(9F) routine provide interface to callout facility for Drivers and kernel. • Functions submitted by drivers and kernel subsystem are linked in the callout table, where each table entry contains a pointer to the function, argument and time (clock ticks) when the function should be executed • At each clock interrupt, tick value is tested and the function is executed when the time interval is expired. • There are two types of callout supported: • Real time callouts: functions submitted via real time callout interface have low latency requirements. To execute functions submitted via real time callout interface, soft interrupt is posted (using softcall() mechanism), which interrupt the processor, resulting in function executing without going through scheduling latency. • Examples of real time callouts: - polltime: Set from poll(2) system call. It wakes up thread on poll event after poll interval - realtimeexpire: For generating SIGALARM to the process - setrun: sleep/wakeup condition variables (cv_timedwait (9F)) when sleep event is submitted with timeout value. Forces a thread wakeup when sleep time has expired and condition is still not true. • Normal callouts: A condition variable signal is set to wake up one the callout kernel threads to execute the function. A normal callout can be exposed to some additional latency for the callout threads to be scheduled. Kernel threads are assigned SYS class with priority (60-99). • Example of normal callouts: - schedpaging: Manages page-out rate - ts_update: Checks time quantum of interactive class (IA) threads and updates their priority as needed - sigalarm2proc: alarm(2) system call for generating SiGALARM when timer expires • Failure to process expired callouts in a timely manner may result in delays and hang like symptoms. This can happen when CPU responsible for servicing callout is busy servicing higher level interrupts. To view expired callouts, use: scat (callout –r, callout –ts) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved CPU Utilization At any given time, CPU may be doing any of the following activities: • CPU can be idle, that means waiting for some work • CPU can be running application code, called CPU user time • CPU can be running kernel code, called CPU system or kernel time. This is also counted as kernel overhead, since it steals CPU cycles from user program. Higher kernel overhead can result slow application performance. • CPU can be waiting for IO (network or disk) completion. This activity does not use CPU cycles therefore it can be counted as idle. It only tells that the last process/thread ran on the system was context switched due to IO wait. • CPU can be busy servicing interrupts. • CPU can be busy in dealing with process scheduling tasks such as: context switching due to time quantum expiration or pre-emption • CPU can be busy in servicing a system call, such as: fork(), malloc(), read(), write(), etc.. CPU time servicing a system call is counted as kernel time Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Understanding CPU Utilization and Saturation Multi-core (CMT) Processors • Multi-core processors have several strands or hardware threads. These hardware threads are seen as CPU by scheduler • Solaris schedules kernel threads onto these CPU. There is a one to one mapping of software threads onto these hardware threads. • Core is a shared resource. At every CPU cycle, hardware switches threads within a core thus allows multiple threads to execute. CMT architecture is capable of executing multiple threads concurrently. If only one strand is active, then the whole core can be utilized by that strand. • Conventionally a processor is considered to be idle by the kernel when there is no runnable thread in the system that can be scheduled on that processor. On previous generation SPARC processors, an idle state related to the pipeline of the processor remains unused and thus whole processor is considered idle • On Multi-core processors a hardware thread becoming idle doesn't mean an entire core is idle. Processor core will still continue to execute instructions as long as there is an active hardware thread. • When a hardware thread becomes idle, it is parked. A parked thread is taken out of the mix of threads available in a core for scheduling. Its time slice is allocated to the next runnable thread from the same core. A hardware thread becoming idle does not necessarily reduce core utilization. When all hardware threads in the core are idle, then it is considered an idle processor similar to non-CMT processors. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Understanding CPU Utilization and Saturation Multi-core (CMT) Processor • Solaris mpstat counts each strand as CPU and thus reports thread utilization instead of core utilization. • That means if %idle is reported 0 by mpstat for all CPU in the core, then core is considered fully utilized. However, mpstat does not tell if the core is saturated. That means if one cpu reporting %idle 0 and all other CPU showing %idle 100, then it is difficult to know if the core is fully utilized and would not be saturated by adding more load. • A single strand can keep the whole core busy, if application thread running on it is compute bound In that situation, adding more load (software threads) would steal cycles and would result compute bound thread to run at lower execution speed considering the core is now being shared among multiple threads. In real world workload, there is a frequent cache misses and that result in strands getting stalled and thus single thread can rarely make the whole core busy. • Thus to measure core utilization and saturation consider using pgstat(1M) or unbundled tool corestat. For strand utilization use mpstat(1M). • Also, to dedicate more core resources to a single thread on T4 processor consider changing the priority of the process to FX class. This way thread will be recognized as a “critical thread” and will potentially use a full core https://blogs.oracle.com/observatory/entry/critical_threads_optimization Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Monitoring CPU Utilization • uptime: Reports number of processes waiting for CPU resources for the past 15 minutes. If the value is higher than number of CPUs configured in the system, it is a sign of CPU saturation and thus result in scheduling latencies. • prstat: Reports per thread CPU usage of the process. Helpful to find scalability of the application. Also, prstat-mL (LAT Column) reports per thread CPU scheduling latencies. • vmstat: Reports runnable threads and CPU utilization. • mpstat: Reports cpu utilization statistics per CPU. It reports kernel lock contention, interrupt, xcall overheads. • sar : Reports cpu utilization over a period of time. • top: Swiss army knife of performance monitoring. • Process states, scheduling priorities and affinity and CPU utilization etc.. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved CPU Event Tracing • High CPU time in user mode is a sign that there are hot functions in the application burning CPU cycles. • There are application specific profiling tools available to capture information about hot functions in the application. For example: Oracle Database and JVM (java) have profiling tools to capture this information. • For kernel profiling, lockstat(1M) can be used to enable profiling. This is helpful in identifying hot functions in kernel dominating CPU usage. # lockstat -o lockstat.out -kIW -i 997 -s10 -D20 -n 75000 sleep 5 • In situations where CPU is spending time in user space (running application code), application specific profiling tools should be used. It may be a simple capacity planning issue where more or faster CPUs are required to meet workload computation requirements • High CPU time in system (kernel) mode may cause negative affect on the application performance, considering kernel tasks and threads run at high priorities and can result in scheduling latencies of application processes. • Factors contribute to high sys cpu usage are: • Frequent system calls by the application resulting high kernel resource usage • High paging activity on the file system and swap device • High IO and Network load • Memory shortages • Flood of hardware interrupts • To find what sort of activities resulting high kernel CPU usage, one need to use tracing and profiling tools. How to Determine What is Consuming CPU System Time Using the lockstat Command (Doc ID 1001812.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved CPU Event Tracing system call tracing via truss(1) • truss command intercepts and records the system calls invoked by a process. • It prints return value and error (errno) returned by the system call • Dump argument (-a) of system call and environment string (-e) passed to system call • Displays content of structure passed to the system call (-v) • Reports signal posted to the application (-s) • It prints time elapsed of a system call (-E). Useful to capture system call timing • It can be used to count type of system calls issued and errors returned by the application (-c) • One can trace a process (-p) or command • One can follow application child process (-f) or threads (-l) involve in issuing system calls. • One can use it as a break point to stop the process when it issues a perticular system call (-T). Use prun(1) to restart the process • Truss can also be used to trace library calls (-u) instead of system calls. # truss –aeflE –vall –o truss.out –p <PID> Caveat: While process is running under strace, the performance of the process suffers due to frequent intercepts on system calls Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved CPU Event Tracing cpustat(1M) • Getting full use of hardware not only saves cost but also provide key competitive benefits. • Profiling is about finding those hot spots that deserves more analysis and optimization. Thus making system or application more efficient • Profiling is about capturing statistics to understand how the code executes so that we can isolate the hot spot and then dedicate our attention to optimize those hot spots. • Historically, profiling is software based that uses timer interrupt that happens at a regular interval and we grab context information whenever these timer interrupts occur. However draw back is that they are low precision and has some overhead. • Hardware based profiling is preferred consider it is more efficient, precise and has low overhead. • Hardware profiling uses specialized set of counters on CPU, called PMU (Performance Monitoring Unit). PMU gives access to data that cannot be captured using software based profiling (lockstat). There are hundreds of hardware events available for analysis from PMU. • In Solaris, cpustat(1M) provide access to these Performance Counters on CPU. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved CPU Event Tracing cpustat (1M) • T series processors offer performance counters for counting hardware events such as cache misses, TLB misses, crypto operations, FP operations, load/store etc.. These counters are accessed using cpustat. # cpustat -h: shows the list of available counters • https://blogs.oracle.com/martinm/entry/t4_performance_counters_explained • Doc: "performance Instrumentation" chapter of the OpenSPARC T2 • Example: One of the performance counters always reports instruction count and the other can be programmed to measure other events such as cache misses and TLB misses etc # cpustat -c pic0=L2_dmiss_ld,pic1=Instr_cnt 1 • We can use cpustat tool to count "cycles-per-instruction". If the slow down is due to memory latency, then cpu cycles per instruction will go higher. # cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | awk '{printf "%s %.2f cpin",$0,$4,$5;}' cpi (cycles per instruction) is something we want to track while work load is running. • cputrack(1) utility allows CPU performance counters to be used to monitor process and LWP behavior. cputrack reports on cpu performance counter per process: # cputrack -c pic0=Cycle_cnt,pic1=Instr_cnt -p <pid> Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Module 5 Memory and Swap Monitoring Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Virtual Memory • Modern Unix system offers abstraction called virtual memory, that acts as a logical layer between the application memory request and hardware Memory Management Unit (MMU). • When a process uses a virtual address, MMU and kernel data structures, called tranlation tables, cooperate to find and map the virtual address to the physical address (physical memory location) • Virtual Memory has several advantages: • Programmer does not need to allocate or manage physical memory directly, that allows them to write architecture independent code. • Process always see linear contiguous ranges of bytes in their address space, regardless of the fragmentation of a physical memory. • Allows program to run without completely loaded – fast startup • Several programs can be executed concurrently with memory requirements bigger than physical memory configured. Kernel can place or relocate application memory pages anywhere in the physical memory or disk (swap) transparently. • Process can share a single memory image of a library/program – save memory • Process is isolated in its address space and thus can access only subset of total available memory. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Virtual Memory • Each program has its own virtual addresses space and virtual addresses are the memory addresses that program sees. • Virtual addresses are mapped to physical memory by the MMU and Kernel translation tables. • Two programs may be using the same virtual address, but these virtual addresses are mapped to different physical memory locations. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Process Address Space • A page is a group of contiguous linear addresses in physical memory (page frame) or virtual memory. Kernel handles memory with this page unit. A page is usually 4K bytes in size • To execute a process, kernel allocates memory areas in units of pages. • Memory area used by the process to perform its work is called process address space. • Process address space is defined as the range of memory addresses that presented to process as its environment • Typically process address space is consist of memory segments of type: • Text: Application instruction or code • Data: initialized and zero-initialized data (BSS) • Heap: dynamic memory allocation area based on demand, allocated using malloc(). Heap segment grows towards higher address • Stack: Local variables, function parameters and return address of the function are stored here. The stack grows towards lower address. • mmap: If the application has file data or device memory mapped in user space then it appears in user application as a separate segment One can view process address space by running: # pmap –xs <PID> Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved 32-bit vs. 64-bit Application • 64-bit processors can address up to 16 exabyte (EB), where as 32-bit is limited to a maximum of 4GB of memory • 64-bit addresses enables the application like databases to manipulate large sets of data in memory • 64-bit instruction set extensions for x86 processor are referred to as AMD64, EMT64, x86-64 or just x64. Architecture has much improved instruction set and ABI (Application Binary Interface): • x64 has abandoned the stack-based calling convention. Function parameters are now passed in registers rather than through the stack. Thus no load and store to/from the stack is required when accessing function parameters. In x86 (32-bit code) when a function is called, all the parameters to that function needed to be stored on the stack. • x64 has more general-purpose registers ( 6 in 32-bit code, 14 in 64-bit code). That means reduce frequency of register spills and fills events. • 64-bit executables have more overhead than 32-bit executables: • Longs and pointers become 8-bytes structure (64-bits) rather than 4-bytes (32- bits). • Memory footprint of a 64-bit application is larger than the same application compiled for 32-bits. • Performance can degrade for certain application after compiling as 64-bit. Application requires more CPU cache entries to fit the working set size. That may result in higher cache misses. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Demand Paging and Page Fault •In a demand paging system, both memory and process address space are divided into fixed-size pages, and these pages are brought into and out of the memory as required •No page is brought into the memory until needed (referenced), thus called Demand Paging. •When a process attempts to access a page in its memory segment part of process address space and the page is not resident in memory, the system generates a fault. Kernel handles the fault by bringing the page into memory. •Minor or soft page fault: An attempt is access a virtual address location in the process address space that resides within the mapped memory segment and the page is in physical memory but no MMU translation is established for it. Thus minor page fault can be serviced by locating the page in memory. This can happen, if the page is already brought in by some other process. •Major page fault: When an application page fault can only be serviced by performing an IO to the disk. For a major fault, the kernel has to either create a new page in the case of first time access or retrieve the page from the backing store. NOTE: Page fault in the virtual address that is not mapped into any memory segment result in segmentation violation and that result in process to exit. • vmstat (mf) and mpstat (minf, mjf) both report minor and major fault activities. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Virtual Swap • Solaris can run without physical disk swap configured and that is due to swapfs abstraction which acts as if a physical swap space were backing up the page. • Solaris works with virtual swap and it is composed of physical memory and physical disk swap. Virtual Swap = Physical Memory + Disk Swap •When there is no physical disk swap configured, swap reservation happens against physical memory. •Swap reservation against memory has a draw back and that is the system cannot do malloc() bigger than the physical memory configured. Advantage of running without physical disk swap is that the malicious program unable to perform huge mallocs and thus cannot cause the system to crawl due to memory shortages. •When process calls malloc()/sbrk() only virtual swap is reserved. Reservation is done against the physical disk swap first. If that is exhausted or not configured then reservation is done against physical memory. If both are exhausted then malloc() fails. •To make sure malloc() won't fail due to lack of virtual swap configure large amount of physical disk swap in the form of disk or file. •To monitor virtual swap reservation, use: swap -s, vmstat (swap) or df –k /tmp •Caution: swap –l is a wrong tool to monitor virtual swap usage. swap -l reporting a large value in "free" does not mean that there is a plenty of virtual swap available.. thus malloc() can still fail. swap -l does not provide information about virtual swap usage, it only provides information about physical disk swap allocation. On a system with plenty of memory, swap -l reports the same value for "block" and "free" column. Only time you should see free column in swap –l output is decremented when system is short in memory How does Solaris Operating System calculate available swap? (Doc ID 1010585.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Virtual Swap - Example # vmstat 5 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id ... 0 0 0 3296516 38201892 4321 49454 0 0 0 0 0 0 0 6 0 11521 164084 69372 11 31 59 0 0 0 3361076 38193196 3034 34037 0 0 0 0 0 0 0 47 0 9639 107575 37481 8 24 68 0 0 0 3501776 38286380 3325 36763 0 0 0 0 0 0 0 5 0 12679 113673 42466 8 25 67 0 0 0 3545612 38326200 4935 57916 0 0 0 0 0 0 0 63 0 13688 111744 35804 12 31 56 << ... Available virtual swap: 3545612 KB =~ 3G # /usr/sbin/swap -s total: 61515252k bytes allocated + 54986204k reserved = 116501456k used, 3472532k available Available virtual swap: 3472532 KB = 3G Used virtual swap: 116501456 KB = 111G swap –l reports disk backed swap usage. It has nothing to do with virtual swap. Physical disk swap configured: # /usr/sbin/swap -l swapfile dev swaplo blocks free /dev/zvol/dsk/uppool/swap 181,3 8 163839992 163839992 Total Disk backed swap: 163839992 x 512 = 78G Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Page Cache •Solaris kernel uses a page cache to keep disk data and refers to it when reading or writing application data to disk. •Solaris uses free physical memory to cache recently used data, this avoids subsequent access to that file from the disk. When application memory demand increases, memory used by page cache can be discarded. This helps application performance: • Program starts faster next time when it starts. • Pages are kept in memory for indefinite period of time and thus can be reused by other processes (system libraries, shared memory etc..) without accessing the disk. • File system read ahead caching. This feature reads more data then the application request if the IO pattern is sequential. This allows next adjacent read to be fulfilled from the page cache instead of disk •Application does not write data to disk, instead it writes it to the page cache. Data is cached in the page cache and later written to the disk. Advantage are: •Allows application to write at memory speed instead of disk speed •Delaying write to the disk allows process to modify data in memory. Improves performance because several write operation on a page can be satisfied by just one slow physical update •When File system is running at 99.9% cache hit rate, only trickle of data reaches disk and that reduces physical IO demand. • To find page cache memory usage, use: # echo “::memstat”|mdb –k. value reported in “Page cache” is the page cache memory. Understanding Cyclic Caching and Page Cache on Solaris 8 and Above (Doc ID 1003383.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Flushing dirty buffers in page cache • When a process reads data from disk, the data is copied into a user buffer and also cached in the memory, called page cache • This process and other processes can retrieve the same data from the page cache later • When process tries to change the data, data is changed into memory first. At this time, the data on disk and in memory is not identical and thus data in page cache is referred to as a dirty buffer. • The dirty buffer should be synchronized to the data on disk as soon as possible, or the data in memory could be lost if a sudden outage occurs. • The synchronization process for a dirty buffer is called flush. fsflush kernel thread is responsible for flushing data to the disk. Sync fsflush Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved ARC Cache - ZFS •ZFS manages its file system cache differently than UFS and vxfs. Unlike ufs and vxfs that use page cache, ZFS uses ARC cache for caching file system blocks. • ARC cache is part of kernel memory. This results in system with ZFS reports higher than normal kernel memory allocation • ZFS returns memory in ARC only when there is a memory pressure. However, some application such as oracle DB are more sensitive of memory pressure and thus in that case limiting ZFS ARC is needed. Also, system with large memory configured reaping of memory from ARC can trigger high system utilization at the expense of performance. •To limit ZFS ARC cache, one can set the tunable in /etc/system file and reboot: set zfs:zfs_arc_max=0x100000000 # 4G • There is no rule of thumb on setting a limit on ZFS ARC. It is a capacity planning question that differ from workload to workload. •One can monitor ZFS ARC usage using: # kstat –n arcstats •Memory allocated in ZFS ARC can also be monitor using: # echo “::memstat”|mdb –k. value reported in “ZFS File Data” is the page cache memory. How Solaris ZFS Cache Management Differs From UFS and VXFS File Systems (Doc ID 1005367.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved File System Cache – Write Throttling •When application writes, it is buffered in a file system cache to improve efficiency. However, large number of dirty (modified) blocks in file system cache can tied up memory and may cause memory shortages and that could effect overall system performance negatively. •File systems, like UFS and ZFS, use write throttling to avoid too much memory occupied by dirty buffers. To prevent a single process dirtying too many pages in the file system cache, application process is frequently put to sleep on write() to slow down the dirty buffer growth until storage catches up. •When ZFS throttling happens, write() is delayed 1 tick (10ms). If there is no zfs throttling, pwrite() should be instantaneous. ZFS throttling is an indication that ZFS has accumulated great deal of data to be synced and want to pause writers (application processes) before accumulating even more. •ZFS uses transaction group where all dirty buffers in memory are sync to disk in a single transaction. Normally, transcation group (txg) is syncd every 30 seconds but ZFS may start syncing txg more frequently (every 5 seconds), if it sees storage is falling behind. This is to allow consistent write performance instead of pauses. https://blogs.oracle.com/roch/entry/the_new_zfs_write_throttle •UFS high water mark limits the dirty pages in the page cache by a single file. ufs_HW (default 16MB) is used to temporarily put to sleep a process attempting to write to a given file. However the actual throttling condition is not related to the number of bytes in the write() call but the number of bytes in-flight to the page cache. •When the process issues a write(), it can dirty pages up to ufs_HW bytes. However, while process is dirtying pages, IO's of size UFS cluster size (1MB, maxcontig tunefs parameter, How fast a process can write() is limited by ufs_HW and the UFS clustersize. mmap() files are not impacted by ufs_HW and can dirty lot more pages than ufs_HW in the page cache. Write Throttling in UFS and ZFS File Systems (Doc ID 1470681.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Reclaiming memory or swapping •.Physical resource such as memory has a finite limit. When multiple processes/threads compete for scarce resource, it results in memory shortages. To keep up with the memory demand, kernel has to free physical memory pages. Otherwise, application or system will hang waiting on memory allocation. Paging can have a serious performance problem during memory shortages because system cannot able to service the memory allocation request of the application immediately and thus swap mechanism is invoked to get free memory pages. • Kernel borrow a page from the process and write it to the swap device. Now the system is capable of reusing the page frame to back the new virtual address that is being accessed. •The anon structure associated with the old page that is migrated to the swap device is updated with the page’s disk location. The exact place where the old page is written depends on the kind of swap space used. Solaris supports swap as an entire disk partition or a file on UFS file system. NOTE: When ZFS is used as root file system, swap device can be configured on zvol • When freemem (number of free pages) counter drops below the threshold lotsfree, page daemon starts scanning pages in attempt to free them to meet memory demand. Not all pages scanned can be freed. Page is freed if it is not referenced between clearing and checking of the reference bit. The scanner starts scanning at a rate of slowscan (100 pages /second). If free memory continues to drop and goes below threhold desfree (lotsfree/2). Scan rate increases to fastscan ( 64MB or 1/2 of memory) •Tools: vmstat (sr), vmstat-p (api/apf/apo), kstat –n system_pages, prstat-ml (DFL) Monitoring Memory and Swap Usage to Avoid A Solaris Hang (Doc ID 1348585.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Swap or no swap • Solaris can run without physical disk swap. • Having large swap helps process to malloc() larger amount of memory than configured in the system. Without swap, virtual memory that can be used by process is limited to the amount of physical memory • Program with memory leaks may last longer with swap configured considering memory that is not referenced by process for some time will be migrated to the swap device during memory pressure. • You can not hibernate the system without configuring a swap device • Having no physical disk prevents malicious attacks. A malicious program can perform huge malloc() and then allocate memory causing critical application pages to be swapped out that could result in slow performance or outages. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Large Page Out Of Box (lpoob) • Large page out-of-box (lpoob) feature was added into Solaris 10 to provide larger page size for process heap, stack and private anonymous memory. With this feature enabled, process resident pages could be multiple of large pages, instead of 8k and that may consume more memory but provide better performance. • To find page size supported on the platform, use: # pagesize –a Large pages can improve performance: • TLB entries will cover a larger part of the address space when use large pages, there will be fewer TLB misses due to larger reach. TLB has fixed number of slots and each slot can be programmed with different page sizes ( up to 2G page size is supported on some platform) depending on the architecture. • Having a TLB slot programmed to map large pages means bigger address space that can fit into TLB without getting a TLB miss and that may result in better performance for some workload • Reducing page faults. A single page fault can bring big chunk of data into memory at once • Reduce cost of virtual to physical address translation. Better TLB hits eliminate translation all together • Large pages is a way to reduce memory foot prints. It makes a big difference for application that caches large amount of data such as: Oracle Database cache (SGA) can reach over 100GB. Large Pages reduces the use of translation tables and thus saves memory. It also can have a big impact on system performance. • trapstat(1M) can be used to measure degree of performance improvement with large pages. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Page Coalescing • Request for large page is serviced by searching local and remote mnode free list. If large page is found, Solaris returns the large page to the application. This is very quick and thus has little or no overhead. • When system runs out of large pages due to higher demand of smaller size pages (file system is a major consume of 4k/8k pages), kernel attempts to coalesce smaller size pages into a large chunk to meet workload large page demand. • Building a large page can be an expensive operation on large NUMA servers considering it requires searching page ranges and relocating (if required) pages and that can result in high CPU kernel usage due to lock contention and that can effect application performance negatively. • To minimize or eliminate the impact of coalescing activity on application performance, consider disabling page coalescing feature by setting tunable in /etc/system file and reboot set pg_contig_disable=1 • Sample stack showing page coalescing activity as reported during high kernel CPU usage by lockstat: unix:page_trylock_cons+0xc(0x7000189e880?, 0x1, 0x2a103790bd8) unix:page_get_mnode_freelist+0x19c(0x0, 0x3, 0x0, 0x0, 0x0, 0x0) unix:page_get_replacement_page+0x30c(0x7000448f880, 0x0, 0x0) unix:page_claim_contig_pages+0x178(0x7000448f800, 0x1, 0x20000, 0x20000, 0x1) unix:page_geti_contig_pages+0x6a4(0x0, 0xd, 0x1, 0x20000, , 0x1fffff, 0x0) unix:page_get_contig_pages+0x160(0x0, 0xd, 0x0, 0x1, 0x0) unix:page_get_freelist+0x428(0x60021525e40, 0x0, 0x300457104f0, 0x1006d0000, 0x10000, 0x0 … Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Shared Memory • Inter Process Communication (IPC) in Solaris is the way of sharing data and synchronization of events among processes. • Shared memory provides efficient data sharing among multiple processes since data does not require to be moved across multiple process address spaces. Shared memory is the fastest form of IPC available. One copy of the data in the memory can be shared among multiple processes • Shared memory allows sharing of physical memory pages by multiple processes. That means multiple processes can attach or have mapping to same physical memory segment. • Access to shared memory can be performed by simple pointer dereference in the code. Oracle uses shared memory to cache frequently used data blocks (buffer cache) and to facilitate communication (shared pool) among oracle processes (pmon, smon, dbwr, lgwr, and oracle shadow proceses). • Semaphores are used to control concurrency between processes when data in shared memory needs to be modified. • To allocate or find a shared memory segment, process calls shmget(). Attaching to the shared memory segment is done by shmat() • To see shared memory segment attached to a process, use: # pmap –x <PID>. User can monitor shared memory segments allocated system-wide using ipcs utility Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Type of Shared Memory Solaris supports three types of Shared Memory: • Pageable Shared Memory: This is the same as any other anonymous memory (heap, stack). Page fault into address space results in allocation of physical memory. Pages can be swapped out during memory shortages. Kernel needs to acquire lock on pages when writing data cached in these pages. This can induce kernel lock contention and can result in slow performance. • Intimate Shared Memory (ISM): ISM is an optimized shared memory that deals with the shortcoming of pageable shared memory. Unlike non-ISM memory, ISM memory is automatically locked by the kernel and does not require disk swap reservation. This ensures that ISM pages are not paged out during memory shortages. Also, IO performed on pages in ISM segment uses a fast locking mechanism that reduces kernel lock contention and in turn CPU kernel overhead. • Dynamic Intimate Shared Memory (DISM): DISM is similar to ISM, but it can be dynamically resized depending on the application demandWith ISM, it is not possible to change the size of the segment once it has been created. • For example, Oracle Database needs to be restarted if the buffer cache size needs to be increased. DISM was introduced as a RAS feature that allows Oracle Database to deal with dynamic reconfiguration events such as adding or removing memory from the system. With this feature, a large DISM segment can be created when the database is started (see doc: Dynamic SGA). • The database can selectively lock and unlock sections of this segment as memory requirements change. Unlike ISM, where the kernel locks the shared memory segment, responsibility for locking and unlocking (mlock(3C)) is placed on the database or application. This provides flexibility to adjust memory requirements dynamically. • Once DISM is locked properly by the database, it behaves in the same way as ISM, both in functionality and performance. That means availability benefits of DISM can be realized without compromising performance. ISM or DISM Misconfiguration can Slow Down Oracle Database Performance (Doc ID 1472108.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved mmap – memory mapped IO • memory mapped IO offers faster access to file. • Once the file is mmap, no read, write or lseek() to access the file is needed: • File is mapped into process address space and thus access to data is done by simple pointer dereference in code instead of read or write. • Takes advantage of the paging mechanism by associating virtual addresses with the data. • mmap() can be used: • To perform memory mapped IO to the regular file • To access device memory from user space • To provide shared memory between parent/child or unrelated processes Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Kernel Memory Allocation • Kernel and application compete for physical memory. Similar to application, kernel also use virtual memory to access physical memory. However, kernel memory allocation does not cause page fault and most of kernel memory is wired down (non-pageable) [Exception: light weight process thread stack part of kernel memory can be paged out] • Kernel drivers and modules are main consumer of kernel memory: • ZFS ARC uses kernel memory to cache file system blocks • Stream subsystem (network) uses kernel memory for allocating msgblk • Translation Tables use for virtual to physical memory translation consume kernel memory Slab Allocator: Kernel drivers request memory via slab allocator. Since kernel works in page size, it is wasteful to give the whole page of memory when driver needs only few bytes of buffer. Slab allocator manages different size of allocation. Slab allocator provide number of benefits: • Kernel commonly rely on the allocation of small objects that are allocated numerous times over the lifetime of the system. The slab cache allocator provides this through the caching of similarly sized objects, thus avoiding the fragmentation problems that commonly occur. • The slab allocator also supports the initialization of common objects, thus avoiding the need to repeatedly initialize an object for the same purpose. • The slab allocator supports hardware cache alignment and coloring, which allows objects in different caches to occupy the same cache lines for increased cache utilization and better performance. • To find the size of kernel memory use: # echo ::memstat|mdb –k. or # kstat –n system_pages. • To list memory allocated in various kernel caches maintained by slab allocator, run: # echo ::kmastat|mdb –k • Slab allocator offers auditing and debugging support for isolating memory leaks and driver bugs. How to Use Solaris kmem_flags to Analyze Kernel Memory Problems for Solaris 8 and Newer (Doc ID 1008944.1) Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Module 6 Analyzing Resource Contention and NUMA related latencies Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved SMP Architecture •Solaris kernel supports SMP systems •SMP is shared memory architecture where single kernel instance is shared by all processors and single memory address space •To support SMP architecture kernel must synchronize access to critical data to maintain data integrity, coherency and state. •Kernel synchronizes access by defining a lock for a particular kernel data structure or variable requiring code reading or writing the data must first acquire the appropriate lock and release it after updating it. •In-depth understanding of Solaris synchronization primitives is the key to avoid dead lock, hang and to develop a scalable and robust solution Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved SMP Architecture • Multi-threaded application share the same memory, globals and static variable within the process address space. • Same scenario applies to kernel where all kernel threads, drivers share single kernel address space and thus access the same memory, globals and static variables: Type of issue that need to be address in such environment: • Race Conditions: When the outcome depends on the relative timing of multiple tasks, resulting in unpredictable results and even data corruption. Preemption can also cause the race condition, even on uni-processor systems. By preempting one task during the critical region, we have open the door for race condition. Thread which preempts might run in the critical region and result in data corruption • Critical Region: The piece of code containing the concurrency issue is called a critical region. It should be minimized for increased concurrency • Deadlock: Deadlock occurs when two tasks waiting for each other to release the lock. Dead lock can occur even with a single CPU if preemption is enabled • The solution is to recognize when these simultaneous accesses occur, and use locks to make sure that only one process/thread can enter the critical region at any time. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved What is a lock •Concurrency is the key to scalability. Idea is to minimize locks without sacrificing the integrity of the system. • A lock is a piece of data at some memory location. In its simplest form, a lock is a single byte location in a RAM. Lock is consider held when all bits are 1’s (oxff) and available when it is 0x00. •Testing and setting bits of the lock require an atomic operation. Thus requires hardware support. Recent state of the lock must be globally visible to all processor •To implement SMP architecture hardware should support cache coherency. This is required to make sure all processor has the global visibility of the lock state when the lock is changed. • Having a hot lock in the application or kernel may hurt scalability. Once the critical section of the code is identified having a high lock contention, consider: • Using fine grain locks: Break down the large critical region into smaller regions protected by separate locks. •Use the correct locking primitive for the job Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Solaris Synchronization Primitives •Type of locks available in Solaris: •spinlock •mutex •Semaphores •Reader/Writer Lock: •spinlock: very simple single-holder lock: if you can't get the spinlock, you keep trying (spinning) until you can. Spinlocks are very small and fast, and can be used anywhere. •mutex: Similar to spinlock, but you may block holding a mutex. If you can't lock a mutex, your task will suspend itself, and be woken up when the mutex is released. This means the cpu can do something else while you are waiting. •semaphore: it can have more than one holder at any time (the number decided at initialization time), mostly used as a single-holder lock, like mutex. If you can't get a semaphore, your task will be suspended and later on woken up •Reader/Writer Lock: Divides lock usage into two groups: readers and writers. Since it is typically safe for multiple threads to read data concurrently, it allows multiple concurrent readers. However, for update only single writer is allowed. No reader is allowed in the critical section while writer is active. Useful when data is read frequently and modified occasionally Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-Process Applications Semaphores • Applications that use multi-process model typically use shared memory for data sharing among processes and use semaphores for synchronization among the processes • Oracle DB, for example, is a multi-process application that uses shared memory for caching data blocks and use semaphores to synchronize access among processes. Semaphores come in two flavors: Binary semaphores: Binary semaphores act much like simple mutexes, protects a single resource with single process/thread access Counting semaphores: Counting semaphores can be initialized to any arbitrary value which should depend on how many resources you have available for that particular shared data. Many threads can obtain the lock simultaneously until the limit is reached. Semaphores are common in multi-process application such as Databases, where semaphores act as a synchronize primitive to protect data in shared memory • Semaphores are heavily used by oracle to wakeup processes. Typically, there are bunch of oracle processes waiting for the oracle database transaction to finish. When transaction is done, oracle kicks each of these semaphores individually try to wake up all these processes. Solaris kernel has been enhanced to improve semaphores performance by breaking down locks in the kernel into finer locks that Implement semaphores. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Sleep Queue • Sleep queues hold threads that need to wait for an event or a system resource. When process blocks, it is removed from the cpu run queue and placed on a sleep queue. • A resource can be a mutex lock, read/writer lock or blocking on a system call (read, write). NOTE: Turnstiles is a type of sleep queue that is used to support priority inheritance when thread is blocked on mutex, reader/writer locks • Threads checks for a resource and if it is not available, it calls: cv_wait(), cv_wait_sig, cv_timedwait() and that places the thread into appropriate sleepq specified by thread • t_wchan (wait channel) reported by “ps –efl” (WCHAN) contains the address of the condition variable that the thread is blocked. NOTE: it is not the address of resource that the thread is blocked • When an event occur, one (cv_signal()) or all threads (cv_broadcast()) are woken up blocked on that event. cv_signal() is used to wake up one thread at a time. This is to avoids the condition called, thundering herd (all sleepers are woken up, but only one of them can get the resource and other go back to sleep) • To find where in the kernel process/thread is blocked, dump the associated kernel stack: # echo "0t<PID>::pid2proc |::walk thread |::findstack -v" | mdb –k • To dump process (application) stack use: pstack <pid> Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved poll and select • Applications that deals with multiple input and output streams (file descriptors) use poll(). • poll() is implemented using multiple sleep queues. poll() system call sleeps until event occurs on one of the sleep queues that wakes the thread up • User application supply number of file descriptors to poll and request the kernel to notify if requested event happen, such as: data is ready for read on a socket or space available in buffer to write etc.. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-threaded Application • A Thread is a sequence of instruction within a program that can be executed independently • All threads of the process runs in the same address space. • Thread contains its own stack, local variables, function arguments and return values, copy of cpu registers and some scheduling data structures. Other data is shared within the process with other threads. • Thread libraries schedule threads on LWP, Lightweight Process, kernel scheduling entity. Mapping of Threads and LWPs in Solaris uses 1:1 model. That means every user threads is mapped to a Lightweight Process. • Multi-threaded function/libraries are marked as: • Thread-safe: Shared data is protected with locks • Reentrant: Allows more than one thread to execute concurrently • Async-safe: Can be called from signal handler, i-e reentrant while handling a signal • Concurrency: means multiple task can be run in parallel in any order • Parallelism: Simultaneous running of code (not possible in uni-processor system) Single Threaded and Multi-Threaded Process Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-Threading Principle and Design • How scalable your multi-threaded application is dependent on the nature of the application and design pattern used to implement it. • If program is sequential, then adding threads would not provide much performance gain • If the 90% of the program execution time is spent in 10% of the code, then adding parallelism in 90% of the code would not provide much benefit • Carefully analyzing and profiling the code may be required to find code paths that may benefit by breaking down into separate independent tasks. • Ideally, thread should operate on disparate data, but often have to access the shared data. Concurrency may suffer, if there are more shared data structure need to be accessed by multiple threads too frequently. • Multi-Threaded Application can be designed as: • One thread dispatches work to pool of threads, called worker threads that are pre-allocated • Each thread works at different stage in a pipe line. Each thread works on data processed by the previous thread and hands it off to the next thread. Distribution of work should be equal to avoid pipeline stalls. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Multi-threaded Application Performance Consideration Adding threads in the application can substantially improve performance when done properly and in the right problem context. Consider the following when analyzing your program for potential bottlenecks: Lock granularity:The more fine-grained you make locks, the more concurrency you can gain, but at the cost of more overhead and potential deadlocks. Instead of locking your whole structure consider locking fields of the structure? Lock ordering: Avoid obtaining locks in an out-of-order fashion. Always acquire locks in an agreed order Lock frequency: Locking too frequently adds overhead and reduce concurrency. Find out where locks are needed. Consider using the right type of lock for the problem. Critical sections: Minimize, if possible, critical sections that can cause high contention Thread Pool: Pre-allocate threads instead of creating threads on demand that helps process the user request efficiently and less resource intensive. Number of threads: Number of threads can improve application throughput, if there is no or little contention. However, higher number of threads can result in negative performance if there is already contention. Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Analyzing Latency Issues - Java • Locks (aka. monitors) in java applications should be examined to find where java threads are blocking. • As the Java application runs, threads will enter and exit monitors, wait or contend on monitors. Tools are available to monitor: • wait and notification events, • contended monitor entry and exit events. A contended monitor entry is the situation where a thread attempts to enter a monitor when another thread is already in the monitor. • All monitor events provide the thread ID, a monitor ID, and the type of the class of the object as arguments. Thread and the class will help map back to the Java program • Java heap and Time based CPU profiling agents can be used to analyze threads waiting to enter a monitor and CPU utilization. When the process exits or receives a control signal, the profiling agent will dump results in java.hprof.txt file: $ java –agentlib:hprof=monitor=y <Application> $ java –agentlib:hrpof=cpu=times <Application> • One can measure application pauses due to GC (and the objects that resulted in the collection being performed) by adding options to java command line: “-XX: +PrintGCApplicationStoppedTime”, Xloggc:gc.log https://blogs.oracle.com/jonthecollector/ Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Analyzing Latency Issues - Java • Tools available to monitor lock contention in java process: • jstack prints the name of each thread, thread id, the full class name, thread state and java stack trace of all threads in java process. $ jstack <pid> • jconsole utility can be used to summarize heap utilization, system utilization, class loader activity, etc.. • jstat utility to print statistics such as : memory utilization, class load activity, hotspot compiler stats, and the reason why a GC (garbage Collector) event occurred. • visualgc: It is free download. It can be used to graphically monitor the java run time subsystems: Classloader Garbage Collector and hotspot compiler activities: $ $visualgc <PID> http://www.oracle.com/technetwork/java/javase/tools6-unix-139447.html http://prefetch.net/blog/index.php/category/java/ http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html • Sun Studio has built-in DTrace capabilities to debug java programs http://docs.oracle.com/cd/E19205-01/820-4221/ • There are DTrace probes in JVM HotSpot that can used for profiling and debugging: http://docs.oracle.com/javase/6/docs/technotes/guides/vm/dtrace.html http://prefetch.net/presentations/DebuggingJavaPerformance.pdf Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Analyzing Kernel Contention lockstat(1M) • Analyze lock contention in kernel. • There are two types of events that can be captured: -C: contention or block events -H: hold events • It dumps stack traces, caller, lock and number of events. This helps in isolating code path used to acquire the lock, function name and kernel overhead • One can use lockstat filtering capabilities: • Watch a particular lock by specifying symbolic name. • Watch lock events associated to function (symbolic name) • lockstat(1M) in Solaris 10 is a DTrace consumer. # lockstat -o lock.out -CH -i 997 -s 10 -D20 -n 75000 sleep 5 A Primer On Lockstat (Doc ID 1005868.1) _acquire lock _contended <wait> _acquired <hold> _release unlock Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Analyzing application lock contention plockstat(1M) • Application lock contention should be monitored using application specific tools: • Oracle Database: statspack and AWR report a good detail about top lock (called latch) contention in the database. • Java application lock contention (called monitors) can be monitored using Sun Studio or enabling DTrace probes • plockstat(1M) can be also be used for displaying user-level lock contention. • Collect similar type of events (Contention, Hold) as collected by lockstat(1M), but applies to application locks. • prstat –mL (LCK) reports application level lock contention. When prstat shows application is spending 80-100% time blocking on application level locks, recommend customer to use plockstat(1M). Data collected may not of any use for Oracle Support if it’s a 3rd Party Application. # plockstat –C –e 60 –v –s 10 –p <PID> Using process tools to diagnose common issues (Doc ID 1429797.1) _acquire lock _contended <wait> _acquired <hold> _release unlock Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Numa Consideration • To scale vertical, it is required to have more CPU and memory configured in the system. Large systems that can provide uniform memory access to all CPU are difficult to build. • Large systems are NUMA, it means some memory is local to a CPU and thus accessible with least latency and highest throughput, while other memory is remote and has higher latencies and thus possible throughput is less. • In the picture, if the process is running on CPU 3, memory allocation in Node2 will be local and thus low latency, but memory allocation in Node 1, 3, and 4 are remote with high latency. With node 4 may exhibit highest latency. • NUMA lowers hardware costs but increase work that must be done in software or OS. to optimize performance. http://pollywog.com/paul.echeverri/portfolio/mtpodg/p3.html Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Numa Nodes lgroup • CPUS and RAM chips are grouped in local and remote “nodes”. Typically, A NUMA system connects CPUs and memory banks as: • Some CPUs are connected directly (local) to some memory banks • Bus interconnect is provided to access data in memory banks (remote) not directly connected to the memory banks. Even a single boards system with several cores and integrated memory controller are considered NUMA • ccNUMA (cache coherent Numa) system offers more throughput than SMP because aggregate memory bandwidth increases as more numa nodes are added to the system • On NUMA machine, it is beneficial to have all the required resources: CPU, cache, Memory, IO to be co-located. This helps minimize memory latencies. To enable co-location, the Solaris kernel uses lgroup. • In Memory Placement Optimization (MPO) framework used by Solaris , set of cpus and local memory is called a locality group (lgroup). When a thread is created, it is assigned a "home" lgroup and the Solaris scheduler tries to run the thread on a cpu in its lgroup, whenever possible. Thread private memory (anon, stack, heap) is also allocated from the home lgroup, whenever possible. Shared memory is, however, stripped across lgroups. • On T-series physical address space is interleaved across processors at a 1GB granularity. This mapping information is passed to Solaris so that it can optimize application memory placement. Solaris uses MPO framework that allocates stack and heap on the same processor where process/thread is created thus taking advantage of lower latency to local memory. • Solaris scheduler is both NUMA and Multi-Core (CMT) aware and tries to spread threads across as many hardware threads (strands) in the processor as possible. The scheduler spread threads first across cores, 1 thread per core until every core has one. than 2 threads per core until every core has two, and so on. Within each core scheduler balances the thread across two pipelines within a core. https://blogs.oracle.com/dave/entry/solaris_scheduling_and_cpuids Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Solaris Memory Allocation Policies • Solaris default memory allocation policy is to allocate memory from the local node (home lgroup) where the process is running. Once the process is running on a given node, Solaris prefers to keep it on the same lgroup that is close to the allocated memory to keep memory latency low. • This default Solaris memory allocation policy offers reasonable scaling on a NUMA system for short lived processes with small memory foot prints, where individual processes are spread out to run across all nodes, allocating memory locally as they run and load balance across the system by moving new process to less loaded nodes. • For workload that consists of number of long running processes using large amount of shared memory, example, database, default memory allocation policy may not scale well. • To deal with the complexity of memory allocation and scheduling for large long running workloads, one can change the default allocation policy system-wide, or per-application basis using pmadvice(1) [Recommended] or using library madv.so.1 (less flexible than pmadvice) • pmadvice(1) applies rules to a process that define how process uses memory. One can apply advice to a specific memory segment: private, shared, heap, stack etc.. • System-wide default memory allocation policy can be changed by setting the tunable lgrp_mem_default_policy in /etc/system file to any value below and then reboot: • lgrp_mem_default_policy=0 | LGRP_MEM_POLICIY_DEFAULT (default) • lgrp_mem_default_policy=1 | LGRP_MEM_POLICY_NEXT ; next to allocating thread’s lgroup • lgrp_mem_default_policy=2 | LGRP_MEM_POLICY_RANDOM_PROC ; randomly across process • lgrp_mem_default_policy=3 | LGRP_MEM_POLICY_RANDOM_PSET ; randomly across processor set • lgrp_mem_default_policy=4 | LGRP_MEM_POLICY_RANDOM ; randomly across all lgroups • lgrp_mem_default_policy=5 | LGRP_MEM_POLICY_ROUNDROBIN; round robin across all lgroups • lgrp_mem_default_policy=6 | LGRP_MEM_POLICY_NEXT_CPU ; near next CPU to touch memory http://pollywog.com/paul.echeverri/portfolio/mtpodg/p3.html Am erAther 12/03/2012
  • Copyright © 2012 Amer Ather. All Rights Reserved Numa Aware Tools • In Solaris lgroup is a set of CPUs that have same view of memory. Solaris 11 is bundled with several NUMA aware tools: • lgrpinfo(1): Perl script that displays lgroup hierarchy, contents, and characteristics • plgrp(1): proc tool for observing and affecting home lgroup and lgroup affinities • pmadvise(1): proc tool for applying advice with madvise(3C) • pmap(1) extensions to display lgroup containing physical memory backing given a virtual address in a specified process • ps(1) extensions for lgroups • prstat(1) extensions for lgroups • One can also capture per lgroup statistics using: # kstat –m lgrp lgrpTools: http://docs.oracle.com/cd/E19963-01/html/820-1691/gevrx.html lgrp API:http://docs.oracle.com/cd/E19963-01/html/820-1691/lgroups-1.html http://dsc.sun.com/solaris/articles/scalable/ https://blogs.oracle.com/dave/entry/thread_placement_policies_on_numa/ Am erAther 12/03/2012
  • Misc. OLD SLIDES FROM 2008 Am erAther 12/03/2012
  • Guds Analysis Amer Ather Senior Staff Engineer Am erAther 12/03/2012
  • Performance Performance is the system’s ability to meet well defined criteria. System Performance can be evaluated and measured across following system resources:  Memory usage  CPU utilization  IO pattern  Network bandwidth Am erAther 12/03/2012
  • Performance Issues Define problem Slow database query, slow Interactive response, batch job taking longer to complete, slow backups etc.. Set expectation Sun Service role is to identify the performance inhibitor. Resolution may require Sun PS or 3rd party vendor engagement See articles:http://blogs.sun.com/hippy/entry/what_s_the_answer_to Am erAther 12/03/2012
  • Questions to ask When the performance issue was first observed? How performance is measured? Historical data showing good or expected performance for comparison purposes. What has changed: workload, software, patches, tuning, hardware, storage, new install etc.. Where is the problem. Identify application having issues: Oracle, SAP etc.. Am erAther 12/03/2012
  • Solaris Perf. Analysis Tools vmstat iostat mpstat sar kstat prstat ps lockstat mdb DTrace trapstat truss plockstat tnf (prex) netstat nfsstat Ndd ipcs proc utilities: pstack, pfiles etc.. Note: only highlighted tools will be discussed Am erAther 12/03/2012
  • Guds • Both Guds and Explorer data are required when escalating performance issue to TSC Kernel •Sun standard tool for performance data collection GUDS - A Script for Gathering Solaris Performance Data (Doc ID 1285485.1)  Explorer: http://www.oracle.com/us/support/systems/premier/services-tools-bundle-sun- systems-163717.html Important: Run guds when system is exhibiting performance degradation. Am erAther 12/03/2012
  • Guds – What it collects vmstat mpstat iostat (zpool – iostat) kstat ps prstat sar lockstat Ipcs netstat nfsstat memstat – level 3 kmastat - level 3 threadlist pbind, psrset, psrinfo and much more ... Am erAther 12/03/2012
  • Guds - options Guds is a shell script. When running interactively it prompts you for various options. One can also pass options as arguments guds –q –s SR_NUM –c 15 –i 15 –n 5 –w 0 –X 2 Where: q – quite; s – directory name; c – count; i – interval; n – sets (each set: 15 samples at 15 second interval) w – How long to wait before starting a next set X - Extended option – means additional data level 0: Nothing additional level 1: lockstat contention and hold events level 2: level 1 + trapstat, lockstat profiling events, threadlist, tnf tracing (RECOMMENDED) level 3: level 2 + kmastat, kmauser, memstat Am erAther 12/03/2012
  • vmstat - Virtual Memory Stats • Reports system-wide virtual memory and cpu usage statistics. • Data can be helpful in finding: Workload profile: cpu, memory, io bound Virtual/physical memory requirements System-wide cpu Utilization Paging activity Note: Due to microstate accounting cpu utilization may be inflated on Sol10 when compared to previous releases Am erAther 12/03/2012
  • vmstat: r – threads in cpu runq Threads ready or waiting to run on the cpu. Measure of cpu saturation! Threads are prioritized according to their scheduling priority and class. Threads running in kernel have higher priority than threads running application code. Exception: RT class Guds captures “prstat-ml.out” data. Column “LAT” represents scheduling latency - percent of time process/thread spent waiting on a cpu dispatch queue. Am erAther 12/03/2012
  • vmstat: b – block for IO Threads waiting for IO completion are counted as block threads – only disk IO are considered Guds captures tnf (prex) traces containing IO probes. Useful to analyze application IO pattern, latency and block events Note: iostat is a better tool for monitoring IO latencies Am erAther 12/03/2012
  • vmstat: w – swapped out threads When VM threshold freemem drops below minfree, pages of idle processes are pushed out (swap out) to the swap device. Multiple factors can contribute to memory pressure: –Memory requirement of the workload is larger –Application or kernel memory leak or excessive tuning Tools to use: Application memory allocation: pmap(1), ps(1) prstat(1) Application memory leak: Libumem(3LIB) Kernel memory allocation: kstat, kmastat, memstat Kernel memory leak: set kmem_flags in /etc/system file NOTE: upgrading to Solaris 10 can increase the application memory foot prints due to lpoob feature. Am erAther 12/03/2012
  • Swap – swap space available Swap is a backing store for anonymous memory such as: process stack, heap, cow pages. Virtual swap is reserved when process calls malloc()/sbrk() Anon pages are allocated when the virtual page is faulted resulting in physical page mapping. Physical swap is allocated when anon pages are paged out to the swap device – memory shortage  vmstat -p - page in/out activity to swap device  Vmstat and swap -s - available virtual swap  swap -l - physical disk swap – not virtual swapAm erAther 12/03/2012
  • vmstat: free – free memory freemem in kernel keeps track of free pages in the system. Both free list and cache list pages are counted  Free list: pages that have no identity – no vnode and offset information  cache list: unmapped non-dirty file system pages with valid vnode,offset memstat reports system memory usage as: Kernel, anon, application (libraries), page cache (modified), free and cache list pages kstat -n system_pages reports memory usageAm erAther 12/03/2012
  • vmstat: re – pages reclaim Cache list contains list of free file system pages that have valid vnode/offset mapping. Page is mapped into process address space if it is found in the cache list. This column is updated whenever page is reclaimed from the cache list. Am erAther 12/03/2012
  • vmstat: mf – minor fault minor fault: An attempt to access a virtual memory location by a process/thread that resides within a segment and the page is in physical memory but no MMU translation is established for it. major fault: An attempt to access a virtual memory location that is mapped by a segment but does not have a physical page of memory mapped to it, and the page does not exist in physical memory. For a major fault, the kernel has to either create a new page in the case of first access, or retrieve the page from backing store (paged out due to memory shortage) NOTE: Segmentation violation if faulted virtual address is not mapped by any segment of the process Am erAther 12/03/2012
  • vmstat: pi/po – kilobytes paged-in and out File system and program text pages are paged-in from the disk. Anonymous pages such as: process stack and heap are allocated dynamically and thus have no reference (name) in the file system. Anon pages use swap device as a backing store. One can monitor page-in and page-out activities using vmstat -p Am erAther 12/03/2012
  • vmstat: fr – kilobytes freed Amount of memory freed by page daemon. When freemem (number of free pages) counter drops below threshold lotsfree page daemon starts scanning pages in attempt to free them to meet memory demand. Not all pages scanned can be freed. Page is freed if it is not referenced between clearing and checking of the reference bit. Page scanner:  front hand - set the refence bit  back hand - check the reference bit.  Distance between front and back hand of scanner is set to handspreadpages - 1/2 of memory or 8,192 page vmstat -p reports type of pages freed: apf, epf and fpfAm erAther 12/03/2012
  • vmstat: de – anticipated short-term memory shortfall • It is a small buffer factor, that is normally set to zero or small number, allows scanner to free few pages above lotsfree threshold in anticipation of more memory requests. When scanner starts freeing the pages, it stops when free memory (freemem) goes above lotsfree + de range. Lotsfree is set to 1/64 of total memory. Am erAther 12/03/2012
  • vmstat: sr – pages scan by page daemon  Number of pages scanned by the scanner (page daemon) - Sign of memory shortage. Page scanner is activated when: freemem < lotsfree  The scanner starts scanning at a rate of slowscan (100 pages /second). If free memory continues to drop and goes below threhold desfree (lotsfree/2). Scan rate increases to fastscan ( 64MB or 1/2 of memory)  Guds captures “prstat-ml.out” data. Column “DFL” represents percentage of time the process has spent processing data page faults. It is a sign that memory shortage (some time coalescing of large pages) affecting application performance. Am erAther 12/03/2012
  • vmstat: In - device interrupts Interrupts are normally initiated by hardware devices such as: disk, network card, tape etc.. to signal completion of an I/O or some other activity. Interrupts are assigned cpu and levels (1-15) at boot time. Higher level means higher priority Higher priority interrupts mask lower priority by setting PIL. Tools for displaying Interrupts:  kstat –n intrstat – count and time in interrupts/cpu  Intrstat <interval> - interrupt activities per device  echo ::interrupts|mdb –k – interrupt binding/numbers See Article: 209619 – Busy Interrupts, problems and solutions Am erAther 12/03/2012
  • vmstat: sys – system calls This is how application interfaces with kernel and requests access to kernel managed resources and services such as: allocate memory - malloc()), read()/write() a file. All system calls invokes a special SPARC trap instruction called Tcc. That changes the processor execution mode from user to kernel mode plus several other things. syscall() handler pass the return value or set the errno and reverses the processor mode to the user from kernel at syscall completion. That returns execution control back to the calling process and process resumes. truss(1M) and tnf (prex) reports type and duration of system calls generated by application. DTrace() can also be used.Am erAther 12/03/2012
  • vmstat: us – cpu user time cpu time spends in running user code that does n't require kernel resources. When application manages its own memory then cpu runs in user mode and cpu time is charged against user time.  example: Oracle manages it database blocks in shared memory segment without kernel overhead. Process is charged a kernel time when it make a system call.  example: Semaphores to synchronize access to shared memory resources and Reading/writing database blocks to disk require kernel resources. Application level profiling such as: oracle statspack or AWR report and DTrace(1M) should be used for monitoring user level routines dominating cpu. Am erAther 12/03/2012
  • vmstat: sy – cpu sys time cpu time spend in running kernel code. Kernel overhead of running a workload varies from application to application. Kernel events such as: page fault, device interrupts, thread migration, kernel lock contention, system calls, xcalls etc.. are serviced in kernel mode lockstat(1M) can be used to enable kernel profiling to find dominant kernel routines: lockstat –kIW –i 977 –D 20 –s 40 sleep 10 Am erAther 12/03/2012
  • vmstat – idl When there is no runnable threads, cpu calls idle routine or runs idle loop depending on the platform. Idle routine peeks into other cpu dispatch queues to find any runnable thread that can be migrated.  Benefit: load balance runnable threads across all cpus.  Downside: cold cache resulting in cpu cache miss If nothing to do then halt/park the cpu core/strand to save power. Machine specific halt code is invoked by idle() For CMT servers, idle() may not give a complete picture. Use corestat that reports cpu core usage: http://blogs.sun.com/roller/resources/travi/corestat_v1.0.tar.gz Am erAther 12/03/2012
  • mpstat  Displays per cpu kernel stat usage.  Useful to find:  If the workload needs more or faster CPU - single vs. multi threaded application  Type of kernel overhead generated by workload  High kernel lock contention that can hurt system performance and scalability  Higher cpu kernel mode usage can potentially degrade application performance considering kernel threads a higher priority than user threads Am erAther 12/03/2012
  • minf – minor fault per cpu mjf – major fault per cpu •Same as reported by vmstat (mf column) except it is per cpu stats instead of system wide stats: minor fault: An attempt to access a virtual memory location by a process/thread that resides within a segment and the page is in physical memory but no MMU translation is established for it. major fault: An attempt to access a virtual memory location that is mapped by a segment but does not have a physical page of memory mapped to it, and the page does not exist in physical memory. For a major fault, the kernel has to either create a new page in the case of first access, or retrieve the page from backing store. Am erAther 12/03/2012
  • xcal - inter-processor cross-calls • Cross calls in kernel has two functions:  To Implement inter-processor interrupts. It is used for activities like: dispatcher preemption and start/stop a process/thread on another processor via proc(1M) interface  To synchronize MMU on CPUs as process exit and user or kernel pages in TLB caches need invalidating. Am erAther 12/03/2012
  • Intr – device interrupts ithr – Interrupt as thread • Same as reported by vmstat (mf column) except it is per cpu stats instead of system wide stats: • Tools for displaying Interrupts:  kstat –n intrstat – count and time spend in servicing interrupts/cpu  Intrstat <interval> - monitors interrupt activities per device  echo ::interrupts|mdb –k – Displays device interrupt binding and interrupt numbers See knowledge article: 209619 – Busy Interrupts, problems and solutions Am erAther 12/03/2012
  • csw: voluntary context switch icsw: involuntary context switch Process of moving threads on and off the cpu is called context switching. Thread voluntary context switch when blocks on condition variable or synchronization primitives. Pre-emption due to higher priority runnable thread or time quantum expiration trigger icsw. Frequent icsw can hurt application performance.  Use prstat -mL to see if process of interest is experiencing high icsw. use nice(1) to change process prority or priocntl(1M) to change class. Am erAther 12/03/2012
  • migr – thread migration Scheduler use thread migration to load balance runnable threads evenly across all cpus. Use of soft affinity by scheduler makes the threads run on the same cpu to achieve better cache hit. Thread is migrated if it is sitting in cpu dispq for over 3 ticks controlled by rechoose_interval kernel tunable. Am erAther 12/03/2012
  • smtx: spin on mutex locks srw: spin on reader/writer lock Mutex locks are most commonly used in Solaris kernel for exclusive access to shared resource or critical section of the code. Reader/writer locks: Multiple reader, single writer locks: Concurrent access to reader, Exclusive access to writer. High kernel lock contention hurts scalability. Lockstat(1M) is the tool for monitoring kernel lock contention events: Lockstat –CcwP –n 50000 –D 20 –s 40 sleep 10 -o file NOTE: For user application lock contention use plockstat(1M) Am erAther 12/03/2012
  • syscl- system calls • Same as reported by vmstat (sy) column except mpstat reports number of system calls per cpu instead of system wide. • Use Truss(1M), Prex(1M) and DTrace(1M) to capture information about number, type and duration of system calls generated by the workload. Am erAther 12/03/2012
  • cpu usage: usr, sys, wt, idl • Similar information as reported by vmstat except it is per cpu stats instead of system wide. • Truth about WAIT IO (wt): • wt column is updated when thread running on a cpu blocks for an IO and thus resulted in context switch. Block thread does n’t consume any cpu and thus Wait IO value should be regarded as idle time because it is calculated only when cpu is idle and there is a pending IO. That is the reason it is hard wired to zero on Solaris 10. Vmstat also count wait IO as idle time. • See knowledge article: • 205017 – How Wait I/O is calculated and What it meansAm erAther 12/03/2012
  • iostat – Displays IO stats Points out uneven IO distribution. Converts Internal kstat name to useful names such as:  sd0 -> c0t0d0  nfs1 -> hostname:/export Can monitor I/O to the swap device. IO to the swap device indicates memory shortages. Displays DiskSuite metadisk stats Can measure average I/O size used at the sd/ssd layer Displays SCSI soft, hard, transport errors down to partition level Am erAther 12/03/2012
  • w/s - Number of Writes IOPS r/w - Number of read IOPS Number of read/writes operation per second completed by the device. Helpful in monitoring type of I/Os (read/write) generated by the workload. Can identify the typical IO size used by sd/ssd layer. To find application IO type, access pattern and size use: truss(1), prex(1) and DTrace: # truss -t read,write,pread,pwrite,kaio -p <pid> Am erAther 12/03/2012
  • kr/s - kilobytes read per second kw/s – kilobyges write per second Kilobytes read/write per seconds High number of bytes written is not an issue if the service time is staying low: 10ms – 25ms range. Am erAther 12/03/2012
  • Wait – Transactions in driver queue actv - Outstanding transactions Transaction waiting in the device driver (sd/ssd) cache or queue. Throttle values (sd_throttle_max) at the LUN and HBA levels limit outstanding transactions. Exceeding this throttle limit force the driver to queue the transaction until the value drops below threshold. Non-zero values is an indication of slow or over stressed disk subsystem. Consider balanced data layout to reduce the load. See more about throttling: https://blogs.oracle.com/chrisg/entry/throttling_disks Am erAther 12/03/2012
  • svc_t - average service time (ms) Transaction completion time. Depends on the driver (wait) and device queue lengths (actv) Service time depends on:  Resident time: driver/device queue length  Response time: mechanical head movement to read the requested block NOTE: Large built-in cache in RAID boxes and high RPM disks can improve response time Am erAther 12/03/2012
  • %w - driver queue is non-zero %b - device queue is non-zero • If you continue to see:  %w (% of time driver queue is occupied) > 70%  %b (% of time device queue is occupied) > 70% consider distributing the load to other disks and LUNS Am erAther 12/03/2012
  • sar – system activity reporter Records historical performance data. Valuable in identifying the usage pattern Enable by smf(5) and customized by cron(1) Svcadm enable system/sar Crontab –e sys It collects some useful data:  cpu (-u, -q), virtual memory(-g, -p), and IO usage (-d).  kernel (-k), physical memory (-r) and file system buffer cache usage (-v, -b) NOTE: Several stats reported by sar are out dated. Better tools are available! Am erAther 12/03/2012
  • Kstat – display kernel and driver stats Stats are reported by kernel drivers and modules Each stats is identified by module field, instance number Display kstat by specifying class, instance, module and name Monitoring tools, vmstat (1), mpstat(1), iostat(1), netstat, etc.. use kstat (3KSTAT) library (libkstat) functions to extract these kernel stats. Am erAther 12/03/2012
  • Kstat – useful kstats kstat –n system_pages: Reports system memory stats: free memory (freemem), kernel size (pp_kernel), locked pages etc.. kstat –n arcstats: ZFS cache stats: Amount of kernel memory allocated by ZFS cache. Cache hits etc. kstat –n eri0: network interface stats: nocanput, norcvbuf, link speed etc.. NOTE: cpu, virtual memory, nfs, device interrupts etc.. stats are available via kstat Am erAther 12/03/2012
  • prstat Reports process stats similar to ps(1) Improved sorting capabilities like: cpu usage, virtual and resident memory usage, execution time and priority Report only process bound to processor set (–C), project (–j), zone –(zZ). Can also report per lwp statistics (–L). Microstate (–m) process accounting. Stats reported are similar to mpstat(1) but instead of system-wide it is per process. Am erAther 12/03/2012
  • prstat Microstate accounting data reported with (-m) option is useful to find:  LAT - % of time process is waiting in cpu dispq  LCK - % of time process is waiting on user level locks. Prior to Solaris 10 only way to monitor user level lock contention is to profile libraries and application. plockstat(1M) in Solaris 10 can be used to monitor lock contention in user code.  DFL - % of time process is waiting on page fault – sign of memory shortage  ICX – Number of involuntary context switches. It can hurt application performance. Consider changing process class (FX, FSS) or increase priority. Consider Solaris 10 resource mgmt. features: zones, project, resouce pool, cpu shares. Am erAther 12/03/2012
  • lockstat Reports kernel lock and profiling statistics Gathers kernel lock contention, hold and profile events When mpstat (smtx) reports value > 1000/cpu, consider analyzing kernel lock contention events. This will show kernel contended lock and routines actively trying to acquire that lock Analyze profile events when cpu %sys value in vmstat/mpstat is constantly over 40%. This will report dominant kernel routines. See knowledge article: 208158 – A primer on lockstatAm erAther 12/03/2012
  • trapstat Reports processor exceptions due to TLB miss on UltraSPARC systems. Reports number and time spend in servicing TLB misses Report MMU page size seeing TLB miss. Objective is to reduce user- mode data TLB (dtlb-miss) misses. That accounts to process heap and stack segments. Sol10 lpoob feature and ppgsz(1M) on Sol9 can be used to change process page size. Reports miss in user and kernel mode (-t). –T option breaks down TLB miss on per-page size Page size information can dumped using: pmap –xs <pid> Am erAther 12/03/2012
  • plockstat Reports lock contention in user code and libraries. Request customer to collect application lock contention data via plockstat(1M) when prstat (LCK) column reports value > 50%. It is a % of time user threads waiting on user level lock. Lockstat(1M) monitors kernel lock contention only Example: Watch contention and hold events (-A) for a minute (-e) and dump cpu stack (-s) plockstat – A –s20 – v –e 60 –p <pid> Am erAther 12/03/2012
  • Truss – system call tracing Reports syscall types, count (-c), duration (-E), start of a system call (-d) and time delta (-D) between two events Reports syscall return value or errno (in case or error) of the process being traced. Display content of structure passed as argument (-v), if possible  Example: poll() timeout value, stat structure and read/write buffers System calls can be traced across fork() (-f) and per lwp( -l) Process can be stopped (-T) on a system call for later debugging  It can be used for tracing library calls in user application:  truss –u libc:: <pid>  truss -d -u "*::*malloc*" <pid> 滘Note: Higher overhead! Consider using DTrace(1M) and prex(1M). Am erAther 12/03/2012
  • Proc tools Collect process information using proc(4) (/proc) Prints process user level stack (pstack), process address space mapping (pmap), and open file descriptors (pfile)  pfile also prints socket fd peer IP addresses Process can be stopped (pstop) and resumed (prun) for debugging purposes. It can also be used for timing the command (ptime) Prints dynamic linked libraries (pldd) and signal (psig) deposition of the process prstat(1) and ps(1) use /proc for collecting process information. /proc interface has a higher overhead. http://blogs.sun.com/clive/entry/too_much_proc_is_bad Am erAther 12/03/2012
  • Tnf (prex) TNF are static probes that are embedded in the kernel code to aid with analysis. TNF probes can be enabled and traced using prex(1). pfilter command in prex can be used for specific PID tracing TNF probes can be used to trace:  System calls, block events, and paging activities  Dispatcher activities: ONPROC, RUN, SLEEP See knowledge article:  228769 – Solaris OS: All TNF probes in the kernel Am erAther 12/03/2012
  • ipcs Reports inter-process communication (IPC) statistics about: semaphore (-s), shared memory (-m) and message queue (-q) usage Most commonly used IPC are: shared memory and semaphores. For example:  Shared memory is used for sharing and caching data blocks among multiple database processes.  Semaphores are used to synchronize threads/processes updating data blocks in shared memory segment ipcs reports: shared memory key, size of shared memory segment, number of process attaching the segment, segment type:ISM or SHM, and among other useful stats. See knowledge article about setting Oracle shared memory and semaphore tunables 208623 – Solaris OS: System V IPC resource controls Am erAther 12/03/2012
  • mdb: memstat and kmastat  memstat reports system memory usage by dividing it into four major pool:  Application: Anon, executable memory  Kernel: Pages attached to Kernel Vnode (kvp)  File System cache: Dirty pages and mapped  Free memory: free list and cache list  kmastat reports memory allocated in various kernel caches used by kernel modules  collect these stats when high kernel memory allocation or leak is suspected  set kmem_flags to enable kernel memory auditing to find driver/module allocating kernel memory and other useful information. Am erAther 12/03/2012
  • Conclusion Guds is the first pass tool for collecting base line data for performance issue Helps us identify the latent risks and system wide resource usage of the workload Guds has some overhead. It is a one shot attempt to gather potentially useful data When analyzing performance issues use top down approach instead of bottom up. Focus on business matrices: response time, throughput. Ignore cpu utilization, IO service time, and random peaks in performance graphs. Am erAther 12/03/2012
  •  Answer one question:  Where on the object?  What objects have we got?  Then we can rule in or out the IO system.  As the source of poor performance  As a place where we can invest to improve performance Observing the Solaris IO stack Am erAther 12/03/2012
  • Section 1 Observing System Calls Am erAther 12/03/2012
  • What is a.....  System call?  The method an application program calls into the kernel  An LWP  Light Weight Process: The “thing” that every process has that has a kernel thread that the kernel schedules. Each process can have more than 1. Am erAther 12/03/2012
  • What system calls do IO?  Synchronous IO:  read(), write(), readv(), writev(), pread(), pread64(), pwrite(), pwrite64()  mmap() allows IO to happen but does not do it.  Asynchronous IO:  aio_read(), aio_write(), aioread(), aiowrite(), lio_listio() Am erAther 12/03/2012
  • Observing the system call interface  Using truss to look at all the LWPs in a process. w # truss -t pread,pwrite -dDEp 1175 Base time stamp: 1223475669.2599 [ Wed Oct 8 15:21:09 BST 2008 ] /4: 0.0134 0.0134 0.0006 pread(0, " U U U U U U U U0000".., 131072, 14811136) = 131072 /6: 0.0162 0.0162 0.0005 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 0x03900000) = 131072 /8: 0.0160 0.0160 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x03CE0000) = 131072 /10: 0.0248 0.0248 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x05EC0000) = 131072 /5: 0.0251 0.0251 0.0004 pwrite(0, " U U U U U U U U0000".., 131072, 0x04B80000) = 131072 /4: 0.0269 0.0135 0.0003 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 4980736) = 131072 /7: 0.0290 0.0290 0.0003 pwrite(0, " U U U U U U U U0000".., 131072, 0x016E0000) = 131072 /6: 0.0356 0.0194 0.0009 pread(0, " U U U U U U U U0000".., 131072, 0x01C80000) = 131072 LWP number is listed in the first column if there is more than one LWP in the process Am erAther 12/03/2012
  • Observing the system call interface  Using truss: w # truss -t pread,pwrite -dDEp 1175 Base time stamp: 1223475669.2599 [ Wed Oct 8 15:21:09 BST 2008 ] /4: 0.0134 0.0134 0.0006 pread(0, " U U U U U U U U0000".., 131072, 14811136) = 131072 /6: 0.0162 0.0162 0.0005 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 0x03900000) = 131072 /8: 0.0160 0.0160 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x03CE0000) = 131072 /10: 0.0248 0.0248 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x05EC0000) = 131072 /5: 0.0251 0.0251 0.0004 pwrite(0, " U U U U U U U U0000".., 131072, 0x04B80000) = 131072 /4: 0.0269 0.0135 0.0003 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 4980736) = 131072 /7: 0.0290 0.0290 0.0003 pwrite(0, " U U U U U U U U0000".., 131072, 0x016E0000) = 131072 /6: 0.0356 0.0194 0.0009 pread(0, " U U U U U U U U0000".., 131072, 0x01C80000) = 131072  -E flag to truss gives the elapsed time of the system call Am erAther 12/03/2012
  • Observing the system call interface  Using truss: w # truss -t pread,pwrite -dDEp 1175 Base time stamp: 1223475669.2599 [ Wed Oct 8 15:21:09 BST 2008 ] /4: 0.0134 0.0134 0.0006 pread(0, " U U U U U U U U0000".., 131072, 14811136) = 131072 /6: 0.0162 0.0162 0.0005 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 0x03900000) = 131072 /8: 0.0160 0.0160 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x03CE0000) = 131072 /10: 0.0248 0.0248 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x05EC0000) = 131072 /5: 0.0251 0.0251 0.0004 pwrite(0, " U U U U U U U U0000".., 131072, 0x04B80000) = 131072 /4: 0.0269 0.0135 0.0003 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 4980736) = 131072 /7: 0.0290 0.0290 0.0003 pwrite(0, " U U U U U U U U0000".., 131072, 0x016E0000) = 131072 /6: 0.0356 0.0194 0.0009 pread(0, " U U U U U U U U0000".., 131072, 0x01C80000) = 131072 0.0134 -D flag gets you the time since the “Base Timestamp” Am erAther 12/03/2012
  • Observing the system call interface  Using truss: w # truss -t pread,pwrite -dDEp 1175 Base time stamp: 1223475669.2599 [ Wed Oct 8 15:21:09 BST 2008 ] /4: 0.0134 0.0134 0.0006 pread(0, " U U U U U U U U0000".., 131072, 14811136) = 131072 /6: 0.0162 0.0162 0.0005 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 0x03900000) = 131072 /8: 0.0160 0.0160 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x03CE0000) = 131072 /10: 0.0248 0.0248 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x05EC0000) = 131072 /5: 0.0251 0.0251 0.0004 pwrite(0, " U U U U U U U U0000".., 131072, 0x04B80000) = 131072 /4: 0.0269 0.0135 0.0003 pread(0, "AAAAAAAAAAAAAAAA 2 2 1 1".., 131072, 4980736) = 131072 /7: 0.0290 0.0290 0.0003 pwrite(0, " U U U U U U U U0000".., 131072, 0x016E0000) = 131072 /6: 0.0356 0.0194 0.0009 pread(0, " U U U U U U U U0000".., 131072, 0x01C80000) = 131072 0.0134 + 0.0135 = -d flag gets you the delta time since the last system call for that LWP Am erAther 12/03/2012
  • Observing the system call interface  A full truss gives better timing results: w # truss -t all,pread,pwrite -dDEp 1175 Base time stamp: 1223477049.1000 [ Wed Oct 8 15:44:09 BST 2008 ] /1: 0.0271 0.0271 0.0000 lwp_unpark(6) = 0 /1: 0.0299 0.0028 0.0000 lwp_unpark(7) = 0 /7: 0.0299 0.0299 0.0000 lwp_park(0x00000000, 0) = 0 /7: 0.0305 0.0006 0.0003 pwrite(0, " U U U U U U U U0000".., 131072, 0x02B80000) = 131072 /1: 0.0918 0.0619 0.0000 lwp_unpark(8) = 0 /8: 0.0918 0.0918 0.0000 lwp_park(0x00000000, 0) = 0 /8: 0.0933 0.0015 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x05B00000) = 131072 /1: 0.0975 0.0057 0.0000 lwp_unpark(10) = 0 /10: 0.0975 0.0975 0.0000 lwp_park(0x00000000, 0) = 0 /6: 0.0981 0.0981 0.0000 lwp_park(0x00000000, 0) Am erAther 12/03/2012
  • Observing the system call interface  A full truss gives better timing results: w # truss -t all,pread,pwrite -dDEp 1175 Base time stamp: 1223477049.1000 [ Wed Oct 8 15:44:09 BST 2008 ] /1: 0.0271 0.0271 0.0000 lwp_unpark(6) = 0 /1: 0.0299 0.0028 0.0000 lwp_unpark(7) = 0 /7: 0.0299 0.0299 0.0000 lwp_park(0x00000000, 0) = 0 /7: 0.0305 0.0006 0.0003 pwrite(0, " U U U U U U U U0000".., 131072, 0x02B80000) = 131072 /1: 0.0918 0.0619 0.0000 lwp_unpark(8) = 0 /8: 0.0918 0.0918 0.0000 lwp_park(0x00000000, 0) = 0 /8: 0.0933 0.0015 0.0003 pread(0, " U U U U U U U U0000".., 131072, 0x05B00000) = 131072 /1: 0.0975 0.0057 0.0000 lwp_unpark(10) = 0 /10: 0.0975 0.0975 0.0000 lwp_park(0x00000000, 0) = 0 /6: 0.0981 0.0981 0.0000 lwp_park(0x00000000, 0) /8: 0.0918 0.0918 0.0000 lwp_park(0x00000000, .... /8: 0.0933 0.0015 0.0003 pread(0, " U U .... Am erAther 12/03/2012
  • Observing the system call interface  Using the syscall provider.  Allows you to look at the whole system  Combined with the fds array you can watch individual files syscall::read*:entry, syscall::write*:entry, syscall::pread*:entry, syscall::pwrite*:entry /fds[arg0].fi_pathname == "/var/tmp/xxxx" / { self->fd = arg0; self->start = timestamp; } syscall::read*:return, syscall::write*:return, syscall::pread*:return, syscall::pwrite*:return /self->start/ { @[fds[self->fd].fi_pathname] = quantize(timestamp - self->start); 㕲 self->fd = 0; self->start = 0; 㸘 } With dtrace Am erAther 12/03/2012
  • Observing the system call interface • syscall::read*:entry, syscall::write*:entry, • syscall::pread*:entry, syscall::pwrite*:entry • /fds[arg0].fi_pathname == "/var/tmp/xxxx" / • { • self->fd = arg0; • self->start = timestamp; • } • syscall::read*:return, syscall::write*:return, • syscall::pread*:return, syscall::pwrite*:return • /self->start/ • { • @[fds[self->fd].fi_pathname] = quantize(timestamp - self- >start); • self->fd = 0; • self->start = 0; • } Am erAther 12/03/2012
  • Observing system calls with Dtrace • $ pfexec /usr/sbin/dtrace -s prw.d • -n 'tick-10ms { exit(0) }' • dtrace: script 'prw.d' matched 18 probes • dtrace: description 'tick-10ms ' matched 1 probe • CPU ID FUNCTION:NAME • 1 78594 :tick-10ms • /var/tmp/xxxx • value ------------- Distribution ------------- count • 32768 | 0 • 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9 • 131072 | 0 Am erAther 12/03/2012
  • KAIO presents a slight challenge  The system call is kaio which is not documented  Script has to store state.  The documented interfaces are:  aioread/aiowrite, aio_read/aio_write  Can be traced using fbt probes but this may not be portable.  Can be traced using dtrace pid provider  would be portable  Can only trace a single process Am erAther 12/03/2012
  • KAIO presents a slight challenge • fbt::arw:entry • { • kaios[pid, args[5]] = timestamp; • rw[pid, args[5]] = args[6]; • file[pid, args[5]] = fds[args[0]].fi_pathname; • } • fbt::aiowait:entry • { self->rval = args[2]; } • fbt::aiowait:return • / self->rval / • { • this->arg = (aio_result_t *)*(self->rval); • self->rval = 0; • } • fbt::aiowait:return • /kaios[pid, this->arg] != 0/ • { • this->rw = rw[pid, this->arg] == 0x1 ? "aioread" : "aiowrite"; • @[this->rw, file[pid, this->arg]] = quantize(timestamp - kaios[pid, this->arg]); • kaios[pid, this->arg] = 0; • rw[pid, this->arg] = 0; • file[pid, this->arg] = 0; • } Am erAther 12/03/2012
  • KAIO presents a slight challenge • fbt::arw:entry • { • kaios[pid, args[5]] = timestamp; • rw[pid, args[5]] = args[6]; • file[pid, args[5]] = fds[args[0]].fi_pathname; • } Am erAther 12/03/2012
  • KAIO presents a slight challenge • fbt::aiowait:entry • { self->rval = args[2]; } • fbt::aiowait:return • / self->rval / • { • this->arg = (aio_result_t *)*(self->rval); • self->rval = 0; • } Am erAther 12/03/2012
  • KAIO presents a slight challenge • fbt::aiowait:return • /kaios[pid, this->arg] != 0/ • { • this->rw = rw[pid, this->arg] == 0x1 ? "aioread" : "aiowrite"; • @[this->rw, file[pid, this->arg]] = quantize(timestamp - kaios[pid, this->arg]); • kaios[pid, this->arg] = 0; • rw[pid, this->arg] = 0; • file[pid, this->arg] = 0; • } Am erAther 12/03/2012
  • But the results are pleasing to the eye • aiowrite <none> • value ------------- Distribution ------------- count • 8388608 | 0 • 16777216 | 6 • 33554432 |@@@ 52 • 67108864 |@@@@@@ 111 • 134217728 |@@@@@@@@@@@@@@@ 274 • 268435456 |@@@@@@@@@@@ 195 • 536870912 |@@@@ 64 • 1073741824 | 8 • 2147483648 | 0 Am erAther 12/03/2012
  • Continued.... • • aioread <none> • value ------------- Distribution ------------- count • 8388608 | 0 • 16777216 | 16 • 33554432 |@@@ 93 • 67108864 |@@@@@@@@ 282 • 134217728 |@@@@@@@@@@@@@@@@@ 611 • 268435456 |@@@@@@@@@@ 367 • 536870912 |@ 34 • 1073741824 | 0 Am erAther 12/03/2012
  • Trace Normal Form (TNF)  A good idea in Solaris 2.5  If you have Dtrace forget TNF!  However if you don't it can offer a view into the kernel: # cat /tmp/syscall.tnf buffer alloc 100m enable name=syscall_end enable name=syscall_start trace name=syscall_end trace name=syscall_start ktrace on # Am erAther 12/03/2012
  • Trace Normal Form (TNF)  A good idea in Solaris 2.5  If you have Dtrace forget TNF!  However if you don't it can offer a view into the kernel: # cat /tmp/syscall.tnf buffer alloc 100m enable name=syscall_end enable name=syscall_start trace name=syscall_end trace name=syscall_start ktrace on # prex -k < /tmp/syscall.tnf Type "help" for help ... Buffer of size 104857600 bytes allocated # Am erAther 12/03/2012
  • TNF • # tnfxtract /tmp/tnfbuffer • # tnfdump -x /tmp/tnfbuffer | head -10 • probe tnf_name: "syscall_end" tnf_string: "keys syscall thread;file ../../sparc/os/syscall.c;line 797;" • probe tnf_name: "syscall_start" tnf_string: "keys syscall thread;file ../../sparc/os/syscall.c;line 494;" • ---------------- ---------------- ----- ----- ---------- --- ------------------------ - ------------------------ • Elapsed (ms) Delta (ms) PID LWPID TID CPU Probe Name Data / Description . . . • ---------------- ---------------- ----- ----- ---------- --- ------------------------ - ------------------------ • 0.000000 0.000000 23129 1 0x30022c49920 0 syscall_end rval1: 0 rval2: 3298581834704 errno: 0 • 0.030750 0.030750 23129 1 0x30022c49920 0 syscall_start sysnum: 3 • 0.047741 0.016991 23129 1 0x30022c49920 0 syscall_end rval1: 0 rval2: 8193 errno: 0 • 0.098648 0.050907 23129 1 0x30022c49920 0 syscall_start sysnum: 6 • 0.140308 0.041660 23129 1 0x30022c49920 0 syscall_end rval1: 0 rval2: -2168717396 errno: 0 • # Syscall numbers have to be matched with values from /etc/name_to_sysnum # awk '$NF == 3' /etc/name_to_sysnum read 3 # Start and End Matching up PID and LWP Am erAther 12/03/2012
  • TNF  Remember to clean up # prex -k Type "help" for help ... prex> ktrace off prex> disable name=syscall_end prex> disable name=syscall_start prex> untrace name=syscall_end prex> untrace name=syscall_start prex> buffer dealloc buffer deallocated prex> quit # Am erAther 12/03/2012
  • Section 2 Observing Target Drivers Am erAther 12/03/2012
  • Observing target drivers  iostat iostat -D t -xn ssd286 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 12.7 99.7 754.5 604.0 0.0 1.9 0.0 16.7 0 23 c8t600A0B800019C911000015FB48BF9EA9d0 Mon Oct 13 14:48:24 2008 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.9 343.9 40.2 1331.9 0.0 2.0 0.0 5.7 0 21 c8t600A0B800019C911000015FB48BF9EA9d0 Am erAther 12/03/2012
  • Some easier maths extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 26.0 0.0 3014.2 0.0 8.2 2.0 315.8 76.9 100 100 c5d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 40.6 0.0 5006.0 0.0 8.2 2.0 202.5 49.3 100 100 c5d0 100% busy and 100% wait makes the maths easier However 2 active commands for 1 second Service time 49.3 ms gives 1000/49.3 == 20.3 commands/sec Gives us 2 * 20.3 = 40.6 IO/sec Am erAther 12/03/2012
  • How fast is the IO layer?  Dtrace io provider: pfexec /usr/sbin/dtrace -n 'io:::start / args[1]->dev_statname == "ssd286" /{ start[args[0]->b_edev, args[0]->b_blkno] = timestamp ṩ } io:::done / start[args[0]->b_edev, args[0]->b_blkno] / { @[args[1]->dev_statname] = quantize(timestamp - start[args[0]->b_edev, args[0]- >b_blkno]) }' -n 'tick-1s { printf("%Y", walltimestamp); printa(@); clear(@) }' Am erAther 12/03/2012
  • How fast is the IO layer?  The same LUN's IO times: 0 65784 :tick-1s 2008 Oct 13 14:48:24 ssd286 value ------------- Distribution ------------- count 262144 | 0 ṩ 524288 |@ 14 1048576 |@@@@@@ 52 2097152 |@@@@@@@@@ 85 4194304 |@@@@@@@@@@@@@@ 133 8388608 |@@@@@@@ 64 16777216 |@@ 19 33554432 | 0 3 67108864 | 4 134217728 | 3 268435456 | 0  Worst case took between 0.1 & 0.2 seconds! Am erAther 12/03/2012
  • Wait Qs and Throttles  LUNS can only handle a finite number of commands in parallel.  The target driver (sd(7D), ssd(7D), cmdk(7D)) will throttle the number of commands to a LUN.  sd(7D) has a global throttle value and a per LUN throttle value extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 60.0 232.0 270.0 1136.1 18.3 13.2 62.8 45.4 80 85 c0t1d0  For 80% of the sample time the driver queued IO  With an average wait Q of 18.3  And an average queue time of 62.8 ms  Total average latency is over 0.1s. Am erAther 12/03/2012
  • Maximum queue depth?  Maximum queue depth? $ pfexec /usr/sbin/dtrace -n 'fbt:sd:sd_start_cmds:entry { @[arg0] = max(args[0]->un_ncmds_in_driver) }' dtrace: description 'fbt:sd:sd_start_cmds:entry ' matched 1 probe ^C ṩ 3298535600128 46 #  http://tinyurl.com/max-commands # dtrace -qCs /var/tmp/max_sd.d -n 'tick-5sec { exit(0) }' sd2 180 # Am erAther 12/03/2012
  • The Evils of disksort  By default IO on the waitq is sorted by LBA  Let me show you an example Am erAther 12/03/2012
  • Disk Queues LBA N Target Driver Disk Drive LBA N LBA N+1 LBA N+2 LBA N+1 LBA N+2 LBA N+3 LBA N+3 LBA N+4 LBA N+4 LBA N-1 LBA N+1 LBA N+3 LBA N-1 LBA N+2 LBA N+4 LBA N+5 LBA N+5 LBA N+4 LBA N+5 LBA N+5 LBA N+6 LBA N+6 Now what happens? LBA N+6Am erAther 12/03/2012
  • The danger of disksort  Disksort can be disabled:  http://tinyurl.com/scsi-conf  And should be off by default for “intelligent arrays” Am erAther 12/03/2012
  • Why a throttle of 1 is particularly bad  Each IO has to complete before the next can be transported.  So either the storage is busy or the interconnect is busy never both.  Low throttles should be avoided where possible.  Where they can't they must be per lun not system wide. Am erAther 12/03/2012
  • TNF?  Again avoid if you have dtrace. # cat /tmp/io.tnf buffer alloc 100m enable name=strategy enable name=biodone ṩ trace name=strategy trace name=biodone ktrace on # prex -k < /tmp/io.tnf Type "help" for help ... # tnfxtract /tmp/tnfbuffer Am erAther 12/03/2012
  • Section 3 Observing Host Bus Adapters Tags: dtrace, scsi, scsi.d scsa Am erAther 12/03/2012
  • SCSA Target Driver Host bus adapter driver 1 scsi_destroy_pkt scsi_transport Sun Common SCSI Architecture (SCSA) Am erAther 12/03/2012
  • Sun Common SCSI Architecure  Sun Common SCSI Architecture  Interface between SCSI target drivers and HBA drivers.  Target drivers:  sd(7D), st(7D), ses(7D), sgen(7D), etc  HBA (host bus adapter)  isp(7d), qus(7D), mpt(7D), glm(7D), fcp(7D), iscsi(7D), qlc(7D) etc  Nexus drivers:  scsi_vhci(7d)  Also used to allow USB, firewire etc drivers to use SCSI target drivers  scsa2usb(7D), scsa1394(7D)  See WDD on docs.sun.com Am erAther 12/03/2012
  • How to trace SCSA  SCSI commands are sent via scsi_transport(9F)  There is no completion routine in SCSA however:  All the completion routines call scsi_destroy_pkt(9F) fbt::scsi_transport:entry { start_time[args[0]->pkt_scbp] = timestamp; } fbt::scsi_destroy_pkt:entry / start_time[args[0]->pkt_scbp] / { @["times"] = quantize(timestamp -start_time[args[0]- >pkt_scbp]); start_time[args[0]->pkt_scbp] = 0; } Am erAther 12/03/2012
  • How to trace SCSA • CPU ID FUNCTION:NAME • 0 48617 :tick-1min • times • value ------------- Distribution ------------- count • 524288 | 0 • 1048576 |@ 1 • 2097152 | 0 • 4194304 |@@@@@@ 8 • 8388608 |@@@@@@@@ 11 • 16777216 |@@@@@@@@@@@@@ 18 • 33554432 |@@@@@@@@ 11 • 67108864 |@@@@@ 7 • 134217728 | 0 Am erAther 12/03/2012
  • scsi.d  http://blogs.sun.com/chrisg/tags/scsi.d pfexec /usr/sbin/dtrace -Cs scsi.d -D QUIET -D PERF_REPORT -D REPORT_TARGET -D REPORT_LUN -n tick-1m {printa(@); clear(@); exit(0) } Hit Control C to interrupt qus 1 value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@ 25 524288 |@@@@@@@@@@@@ 68 1048576 |@@@@@@ 34 2097152 | 2 4194304 |@@@ 19 8388608 |@@@@@ 29 16777216 |@@@@ 22 홮 33554432 |@@@@@@ 35 67108864 | 1 134217728 | 0 Am erAther 12/03/2012
  • scsi_vhci aka MPxIO  scsi_vhci is an SCSA nexus driver that sits below the target driver. iostat -xnY ssd286 1 extended device statistics 0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 14.2 3.3 640.8 52.6 0.0 0.2 0.0 9.6 0 10 c8t600A0B800019C911000015FB48BF9EA9d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8t600A0B800019C911000015FB48BF9EA9d0.t200600a0b819c870 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8t600A0B800019C911000015FB48BF9EA9d0.t200600a0b819c870.fp3 14.2 3.3 640.9 52.6 0.0 0.0 0.0 0.0 0 0 c8t600A0B800019C911000015FB48BF9EA9d0.t200700a0b819c870 14.2 3.3 640.9 52.6 0.0 0.0 0.0 0.0 0 0 c8t600A0B800019C911000015FB48BF9EA9d0.t200700a0b819c870.fp0 Am erAther 12/03/2012
  • scsi_vhci aka MPxIO  scsi.d : e2big.eu TS 81 $; pfexec /usr/sbin/dtrace -Cs scsi.d -D QUIET -D PERF_REPORT -D REPORT_TARGET -D REPORT_LUN -n tick-1m {printa(@); clear(@); exit(0) } Hit Control C to interrupt Am erAther 12/03/2012
  • scsi_vhci aka MPxIO • scsi_vhci 0 • value ------------- Distribution ------------- count • 131072 | 0 • 262144 |@ 588 • 524288 |@@@@@ 4807 • 1048576 |@@@@@@ 5423 • 2097152 |@@@@@@@ 6609 • 4194304 |@@@@@@@@@ 8627 • 8388608 |@@@@@@@@@ 8641 • 16777216 |@@@ 3289 • 33554432 |@ 1088 • 67108864 | 239 • 134217728 | 97 • 268435456 | 2 • 536870912 | 0 • Am erAther 12/03/2012
  • Some NLAs • SCSI - Small Computer System Interface • CDB – Command Descriptor Block • HBA – Host Bus Adapter • DMA – Direct Memory Access • WDD – Writing Device Drivers • TNF – Trace normal formAm erAther 12/03/2012