Where to start? - the first 2 hours of performance troubleshooting
• The performance cheat sheet: cover all the basics before you start
• Data collections and mining the logs
• Common techniques to improve performance
2. About me
• Microsoft Premier Field Engineer
• Into SharePoint for ages
• Psychobilly enthousiast
• Eddy Merckx fanatic
3. A big thanks to our sponsors
Platinum Sponsors
Gold Premium Sponsors Venue Sponsor
Gold Sponsors
4. Agenda
• The first minutes
• Common scenarios
• Shoud I virtualize ?
• HW considerations
• SQL server considerations
• Memory leaks for the admin (couldn’t prevent myself)
• Caching
6. QUESTIONS TO ASK
• Where is a bottleneck?
• Are all pages/sites/Web applications/servers affected?
• Any strange patterns?
• Is the issue intermittent?
• Does the issue occur for a subset of users?
• Any errors or unexpected status codes in any if the logs?
• Are there any customizations in place?
• Have any software boundaries and limits been breached?
• What does analysis of common performance counters show?
7. COMMON SCENARIOS – SLOW PAGE LOAD
• Issue: A single page is always slow to load, no other pages in the site are slow
• Likely causes:
• Poor custom code/customizations
• Page payload is large or has multiple round-trips
• A custom Web part is performing badly
• Operations involving large lists (most likely throttled)
• Caching is not working correctly for content served on the page
9. COMMON SCENARIOS – SLOW PAGE LOAD
• Issue: Multiple pages are slow to load but the issue is intermittent
• Likely causes:
• Poor custom code/customizations
• Page payload is large or has multiple round-trips
• A custom Web part is performing badly
• Operations involving large lists (most likely throttled)
• Caching is not working correctly
• Load balancer device incorrectly configured or a WFE is experiencing problems
• Load on WFEs is too high (could be NIC, CPU, memory etc.)
11. COMMON SCENARIOS – SLOW SITE
• Issue: A single site is consistently slow
• Likely causes:
• Poor custom code/customizations
• Page payload is large or has multiple round-trips
• A custom Web part is performing badly
• Caching is not working correctly
12. COMMON SCENARIOS – MULTIPLE SLOW SITES
• Issue: Multiple sites are consistently slow
• Likely causes:
• Poor customized codes
• Web Application/Farm scoped customizations
• Caching is not working correctly
• SQL Server blocking due to large lists/databases
• Load balancer device incorrectly configured or a WFE is experiencing problems
• Load on WFEs is too high (could be NIC, CPU, memory etc.)
14. WEB ROLE
• Responsible for rendering of content
• Low amount of disk activity
• Multiple web role servers are common for redundancy and
scalability
• Best Practices
• Be sure to keep all components, applications, and patch levels the
same
• Network Load Balancing (NLB)
• Hardware -> Offload NLB to dedicated resources
• Software -> CPU and Network usage on WFE
• For minimum availability split your load balanced virtual web servers over two
physical hosts
15. QUERY ROLE
• Process search queries
• Requires propagated copy of the index
• 10%- 30% of total size of documents indexed
• Best Practice
• Large Indexes – Prefer dedicated physical LUN on SAN over dynamic
expanding virtual hard disk
• Don’t put your query and index servers on the same underlying
physical disk
• Combine or split Web/Query role?
• It depends on your environment.
• Web and Query performance requirements
16. INDEX ROLE
• Memory, CPU, Disk I/O and network intensive
• Best Practices
• Give most amount of RAM out of front ends
• Potentially keep as physical machine in larger environments
• Use Index server to be dedicated crawl server. Avoids hop.
• Use fixed-size VHDs or physical LUN on iSCSI SAN for best performance
17. OTHER ROLES
• Excel Services, PerformancePoint Services, Access Services, Visio Services, etc. are good
candidates for virtualization
• Additional servers can simply be added into the farm
• No additional hardware investment required
18. DATABASE ROLE
• SQL Server 2005/ 2008 virtualization fully supported
• Memory, CPU, Disk I/O and network intensive
• Assess first using Microsoft Assessment and Planning Toolkit (www.microsoft.com/map).
• SQL Alias flexibility
• Argument for Physical:
• SQL Server is already a consolidation layer
• Disk I/O activity
• Performance, performance, performance!
• Longer response times impacts ALL downstream roles in a SharePoint farm
19. DATABASE ROLE
• If you decide to virtualize database layer:
• Assign as much RAM and CPU as possible
• Offload the Disk I/O from the virtual machines
• Use fixed-size VHDs or physical LUN on an iSCSI SAN
• SQL Clustering: When virtualizing, consider making use of Guest
Clustering in Hyper-V
• SQL Database Mirroring: Fully supported in SharePoint 2010 in
physical or virtual database role environments
20. CPU BEST PRACTICES
PHYSICAL
• Performance is governed by processor efficiency, power draw and heat output
• Faster versus efficient processor – hidden power consumption cost
• Beware of built in processor software such as performance throttle for thermal thresholds
• Prefer higher number of processors and multi core
• Prefer PCI Express to limit bus contention & CPU utilization
21. CPU BEST PRACTICES
VIRTUAL
• Configure a 1-to-1 mapping of virtual CPU to physical
CPU for best performance
• Be aware of the virtual processor limit for different
guest operating systems and plan accordingly
• Beware of “CPU bound” issues, the ability of
processors to process information for virtual devices
will determine the maximum throughput of such a
virtual device. Example: Virtual NICS
22. DISK BEST PRACTICES
PHYSICAL
• Ensure you are using the fastest SAN infrastructure: Attempt to provide each virtual
machine with its own IO channel to shared storage using dual or quad ported HBAs and Gigabit
Ethernet adapters.
• Use iSCSI SANs for if considering guest clustering
• Ensure your disk infrastructure is as fast as it can be. (RAID 10; 15000 RPM) – Slow disk
causes CPU contention as Disk I/O takes longer to return data.
• Put virtual hard disks on different physical disks than the hard disk that the host operating
system uses
23. DISK BEST PRACTICES
VIRTUAL
• Prefer SCSI controller to IDE controller.
• Prefer fixed size to dynamically expanding
• Prefer direct iSCSI SAN access for disk-bound roles
• Beware of underlying disk read write contention between
different virtual machines to their virtual hard disks
• Ensure SAN is configured and optimized for virtual disk
storage. Understand that a number of LUNs can be provisioned
on the same underlying physical disks
24. NETWORK BEST PRACTICES
PHYSICAL
• Use Gigabit Ethernet adaptors and Gigabit switches
• Increasing network capacity – Add a number of NICs to host.
25. NETWORK BEST PRACTICES
VIRTUAL
• Ensure that integration components (“enlightenments”) are installed on
the virtual machine
• Use the Network Adapter instead of the Legacy Network Adapter when
configuring networking for a virtual machine
• Prefer synthetic to emulated drivers as they are more efficient, use a
dedicated VMBus to communicate to the Virtual NIC and result in lower
CPU and network latency.
• Use virtual switches and VLAN tagging for security and performance
improvement and create and internal network between virtual
machines in your SharePoint farm. Associate SharePoint VMs to the
same virtual switch.
26. IMPORTANT
• Understand the impact of your virtualization vendor feature set!
• Don’t let governance slip in your virtualized SharePoint environment
• Snapshots are not supported
• Beware of over subscribing host servers
• Do not exceed physical server RAM by more than 15% if using Hyper-V’s dynamic memory
• Host is a single point of failure
27. SQL SERVER CONFIGURATION
• Little or no configuration of SQL Server is a common problem that causes performance issues
• Optimize performance by:
• Pre-growing data files
• Setting growth factor to a fixed value not a percentage
• Optimizing storage configuration and RAID levels for databases
• Including the number of data files to allocate for tempdb and content databases
• Providing a dedicated VLAN for SharePoint to SQL Server communications
• Setting max degree of parallelism (MAXDOP) to 1
• Providing additional SQL Server instances or servers
28. SQL SERVER MAINTENANCE
• SharePoint databases require constant maintenance otherwise performance will degrade
• Performance issues frequently arise due to:
• Out-of-date statistics
• Fragmented indices
• There are Health Analyzer rules that are responsible for updating statistics and reorganizing or
rebuilding indices
• Ensure these are running frequently and set to repair automatically
29. MAXDOP
• SQL Server can utilize the amount of processors that are available to execute the queries in
parallel
• PG has tested a lot with variable settings and came to the conclusion to NOT to use MAXDOP
is the most stable and performing way
• To suppress parallel plan generation, set max degree of parallelism to 1
31. AUTO_UPDATE_STATISTICS &
AUTO_CREATE_STATISTICS
• we recommended disabling AUTO_UPDATE_STATISTICS
• In SharePoint 2010, both should be set to be disabled. For SharePoint 2007, it is
recommended to have them both enabled.
• Product team introduced a new timerjob called “Database statistics” which itself takes care in
updating the statistics for the databases.
32. MEASURING PERFORMANCE
• What is deemed acceptable?
• Are there any agreed upon metrics?
• What are you trying to measure?
• Common examples:
• Requests per second (RPS)
• Page load time – Time-to-Last-Byte (TTLB)
• Measuring specific operations
• Indexing performance
• What are you hoping to prove?
• Are there any agreed upon tools for measuring performance?
34. THE DETECTION
• Avoid Task Manager
• Track the private bytes
• A steady increase in private bytes value that means a memory leak issue
35. WHERE IS THE $*%µ& MEMORY LEAK?
•
•
•
Type Description
Warning GdiPlus.dll is responsible for 399.54 KBytes worth of outstanding allocations.
The following are the top 2 memory consuming functions:
GdiPlus!GpMalloc+16: 399.54 KBytes worth of outstanding allocations.
36. WHERE IS THE $*%µ& MEMORY LEAK?
• Monitor memory (#bytes in all heaps !)
• DisposeChecker again ?
• Tweak ULS logs
• DebugDiag reports
• WinDbg, ADPlus and the SOS.DLL
37. CACHING
• A poor or no caching strategy may impact performance as
usage increases
• Caching will alleviate round-trips to SQL Server, increasing
performance by allowing content to be rendered quickly
• Three types of caches:
• BLOB cache
• Output cache
• Object cache
• Simply enabling caching is not enough, settings will need
tweaking based on planning and monitoring
38. BLOB CACHE
• Tools
• Fiddler/httpWatch
• Procmon
• Perfmon
• DecodeBlob (2007)
• Avoid flushing the cache at all costs
• It causes performance issues due to write lock held during index writes
• Limit the blob cache using more restrictive RegExp like:
((?<!_gif).gif|(?<!_jpg).jpg|(?<!_png).png|.css|.js)$
Which excludes specific image pattern *_gif.gif *_jpg.jpg *_png.png
• Or regexp like [/]shared documents[/].+.(gif|jpg|png|css|js)$ to limit to a certain
library or subweb or site collection
39. BLOB CACHE
• ULS logs
• Enable Publishing Cache to Verbose
• 2010 has improved logging
• IIS logs with time taken and client side trace fiddler/http
watch
• Cache-Control: public, max-age=86400
• 304 responses with if-none-match and Etag headers
• for streaming
• Accept-Range: bytes in response
• Content-Range: bytes headers in request
40. We need your feedback!
Scan this QR code or visit
http://svy.mk/sps2012be
Our sponsors:
Editor's Notes
When troubleshooting performance in a SharePoint 2010 environment, the following questions should be asked before you attempt to perform any in-depth analysis with the troubleshooting tools.Where is the bottleneck?You should identify where the bottleneck is occurring, is a single page, site or an entire Web application affected? Or is the issue sporadic, indicating a server issue or disk Sub -System issue? Once you have identified the scope of the bottleneck, you can start looking for patterns.Any strange patterns?Does the issue occur every day at a fixed time? Or is the issue completely intermittent? Does the issue only affect a subset of users? Once you know both the scope of the issue and any patterns in its occurrence, you can start looking for the potential cause.Any errors or unexpected status codes in any if the logs?This seems quite obvious, however performance problems, especially in SharePoint will often be masked by error messages that do not clearly explain why an entire site collection is slow, or why a single page is slow.Are there any customizations in place?As already covered, customizations are one of the key causes of performance issues in SharePoint. If customizations are in place, are other sites/pages with these customizations experiencing the same issues? What happens if you temporarily disable the customizations?Have any software boundaries and limits been breached?Large lists, content databases, or generally any breached software boundary should be immediate cause for investigation. Large lists and content databases in particular are known to cause performance problems for SharePoint. What happens if you edit list views or split content databases?What does analysis of common performance counters show?Are there any indicators of issues caused by CPU, memory, disk or network bottlenecks? If so, what appears to be causing these? What about SQL Server specific performance counters?
What: - We will first try to investigate what is the type of memory leak, is it a managed memory leak or an unmanaged memory leak.How: - What is really causing the memory leak. Is it the connection object, some kind of file who handle is not closed etc?Where: - Which function / routine or logic is causing the memory leak.
So the first thing we need to ensure what is the type of memory leak is it managed leak or unmanaged leak. In order to detect if it’s a managed leak or unmanaged leak we need to measure two performance counters. The first one is the private bytes counter for the application which we have already seen in the previous session. The second counter which we need to add is ‘Bytes in all heaps’. So select ‘.NET CLR memory’ in the performance object, from the counter list select ‘Bytes in all heaps’ and the select the application which has the memory leak.
So the first thing we need to ensure what is the type of memory leak is it managed leak or unmanaged leak. In order to detect if it’s a managed leak or unmanaged leak we need to measure two performance counters. The first one is the private bytes counter for the application which we have already seen in the previous session. The second counter which we need to add is ‘Bytes in all heaps’. So select ‘.NET CLR memory’ in the performance object, from the counter list select ‘Bytes in all heaps’ and the select the application which has the memory leak.
On a heavily accessed site, caching frequently accessed pages, objects in a page, and binary large objects for even a short amount of time can result in substantial throughput gains. For example, while a page is cached by the output cache, subsequent requests for that page are served from the output page without executing the code that created it, for the specified duration of the cache. Or in the case of binary large objects, when a request for a file that is not cached is handled by a front-end Web server, the disk-based cache gets the file from SQL Server, saves it to disk, and serves the file to the client that requested it. Future requests for the same file that are handled by that front-end Web server are then served from the file that is stored on the disk, instead of being served from SQL Server.A well-planned caching strategy increases performance and available capacity on given sites. However, careful planning and monitoring is required in order to tweak cache settings correctly.For example, you can use the Publishing Cache Hit Ratio performance counter to monitor the cache hit ratio. You should aim for 90% or above and raise the memory allocated to the object cache if it is not meeting the target. However, a site with a lot of read/write activity should expect to have a lower cache hit ratio.