Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Master tuning

4,778 views

Published on

Published in: Technology
  • Dating for everyone is here: ❶❶❶ http://bit.ly/2F4cEJi ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/2F4cEJi ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Master tuning

  1. 1. Thomas Kejser thomas@kejser.org http://blog.kejser.org @thomaskejser Super Scaling SQL Server Diagnosing and Fixing Hard Problems
  2. 2. Thomas Kejser • Formerly SQLCAT • Tuning SQL Server since 6.5 • 15+ Years of database experience • http://blog.kejser.org • CTO Fusion-io Europe
  3. 3. Image(s): FreeDigitalPhotos.net VS. VS.
  4. 4. Performance Scalabilityvs. Response Time Ressource Use Adding more of a HW ressource makes things faster You can scale without having performance (ex: HADOOP) You can perform without having scalability (ex: In Memory Engines)
  5. 5. Our Reasonably Priced Server • 2 Socket Xeon E3645 • 2 x 6 Cores • 2.4Ghz • NUMA enabled, HT off • 12 GB RAM • 1 ioDrive2 Duo • 2.4TB Flash • 4K formatted • 64K AUS • 1 Stripe • Power Save Off • Win 2008R2 • SQL 2012 Image Source: DeviantArt
  6. 6. Between disk and Memory Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 1ns 10ns 100ns 100us 10ms10us
  7. 7. The “cache out curve” Data Size Throughput/thread Cache Size Service Time + Wait Time
  8. 8. NUMA Nodes CPU L 3 L 2 L 2 C C CPU L 3 L 2 L 2 C C Can I write? Bus Transfer Bus Transfer
  9. 9. There are several of these curves Throughput Touched Data Size CPU Cache TLB NUMA Remote Storage
  10. 10. Response time = Service Time + Wait Time Algorithms and Data Structures “Bottlenecks”
  11. 11. • DBA tasks • Installation of OS and SQL • Basic Memory Configuration • Basic Perfmon style monitoring • Backup/Restore and HA setup • Basic reading a Query Plan • Basic understanding of database structures • Adding Indexes to tables • Running a Profiler trace What you ALREADY know
  12. 12. Below the Surface
  13. 13. What we Need • Free tools from MS • Windows SDK • In Win8: The “ADK” • Need .NET 4 to install
  14. 14. Where Did the Time Go? Service Time + Wait Time Xperf –on Base –f Base.etl SELECT TOP 100000 * FROM LINEITEM INNER JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY SQLCMD –E –S. –i “Select.sql” Xperf –stop
  15. 15. BASE profile with xperf Service Time + Wait Time
  16. 16. Right Click – Summary Table Service Time + Wait Time
  17. 17. What exactly is SQLNCLI? Service Time + Wait Time
  18. 18. Quantifying just how stupid XML is SELECT TOP 1000000 * FROM ORDERS JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY FOR XML RAW ('OUTPUT') Xperf –on Base –f Base.etl With XML “Native” Format
  19. 19. Which CPU cycles are Expensive? “App” tier Web Server Licensing >3K USD Blades Database Tier Core Licensing >10K USD <XML> ? Service Time + Wait Time
  20. 20. • What about the time INSIDE the process? • What if the EXE won’t tell us? Diving even Deeper
  21. 21. What is a Debug Symbol? mov ax,10 mov bx,20 mov cx,3 push ax push bx push cx call <address> <address> push bp mov bp,sp mov ax,[bp+8] mov bx,[bp+6] mov cx,[bp+4] add ax,bx div cx mov dx,ax ret HeaderdoStuff(10,20,3) … int doStuff(int a, int b, int c) { return (a + b) / c } myProg.exe Machine Code <address> = doStuff Symbol table myProg.pdb Service Time + Wait Time
  22. 22. Where do you get PDB files? _NT_SYMBOL_PATH=SRV*C:Symbols*http://msdl.microsoft.com/download/symbols _NT_SYMCACHE_PATH=C:SymCache • Public Symbol Server • Configure Environment • Dbghelp.dll Service Time + Wait Time
  23. 23. • Auto Generated by Visual Studio: Your Own Debug Symbols Service Time + Wait Time
  24. 24. • Symbols are indexed. Have to add them Adding and Checking Your Symbols Cd Bin/x64/Release/ symstore add /f *.pdb /s C:/Symbols /t ‚MyExe‛ • Validate that the Symbols can resolve Cd Bin/x64/Release/ symchk MyExe.exe /V
  25. 25. • Standard Xperf works fine for you own native code • BUT: Before Windows 8, stack walking is broken for x64 .NET • If you have .NET with 64 bit code. You must NGEN first: Got .NET and x64? Ngen install Bin/x64/Release/MyExe.exe (ngen lives here: %Windir%Microsoft.NETframework64<Version>Ngen.exe Service Time + Wait Time
  26. 26. • Free tool from MS: .NET tracing is a pain, get a tool! • Not to be confused with xperfview • Same trace API and file format • Helps set obscure .NET specific trace flags Service Time + Wait Time
  27. 27. And Finally, You can do Very Cool Things Did I tell you about interlocked operations?... Whiteboard time! Service Time + Wait Time
  28. 28. • Consider again our LINEITEM table What is SQL Server REALLY doing? • How expensive is it to read from that? • Think ETL code and DW/BI queries CREATE TABLE LINEITEM ( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL ) BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
  29. 29. SQLCMD – Native code Test SQLCMD.EXE Where does the time go? Service Time + Wait Time
  30. 30. Standard Reading of Data xperf -on base -stackwalk profile -f stackwalk.etl SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch" 55sec xperf -stop xperf –merge stackwalk.etl stackwalkmerge.etl Service Time + Wait Time
  31. 31. Details of the Time – Padding? Service Time + Wait Time
  32. 32. More Details – Conversion Work?
  33. 33. An Educated guess about improvements CREATE TABLE [dbo].[LINEITEM_native]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] money NOT NULL, [L_EXTENDEDPRICE] money NOT NULL, [L_DISCOUNT] money NOT NULL, [L_TAX] money NOT NULL, [L_RETURNFLAG] int NOT NULL, [L_LINESTATUS] int NOT NULL, [L_SHIPDATE] int NOT NULL, [L_COMMITDATE] int NOT NULL, [L_RECEIPTDATE] int NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] int NOT NULL, [L_COMMENT] char(44) NOT NULL ) CREATE TABLE [dbo].[LINEITEM]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL, ) Before After Service Time + Wait Time
  34. 34. Getting Rid of Useless Work Additional parameters for SQLCMD: -a32767 -W -s";" -f437 x1.5 Service Time + Wait Time
  35. 35. Unicode – 10% overhead? Service Time + Wait Time
  36. 36. Lets try that with Native and Unicode … x5 Service Time + Wait Time
  37. 37. • SQLNCLI is one of these in disguise • ODBC • OLEDB • Pick good data types • MONEY over NUMERIC • UNICODE of data arrives like this • Native protocols vs. flexibility Summary Moving Data
  38. 38. • Get • Windows 8 ADK • Windows 7 SDK • Set up Symbol Paths • Xperf –on Base • Standard trace for time, narrow to process and DLL/EXE • Xperf –on Base –stackwalk Profile • Get to the call stack, find the offending function(s) • Ease of use for .NET: perfview.exe Summary – Xperf Service Time + Wait Time
  39. 39. Response time = Service Time + Wait Time
  40. 40. Introducing TPC-H Service Time + Wait Time
  41. 41. Loop Join n row B-tree Log(n) reads Complexity: O(m * log(n)) Service Time + Wait Time m row result 1 43 13 7 3
  42. 42. Linked List Tree Linked List vs. Tree Service Time + Wait Time 0 1 2 3 4 5 6 7 8 n 8 134 62 1510 16141197531 Log2(n)
  43. 43. Cluster on O_ORDERKEY Index on O_ORDERKEY Basic argument for Cluster Indexes Service Time + Wait Time CREATE UNIQUE CLUSTERED INDEX CIX_Key ON ORDERS_Cluster (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Cluster WHERE O_ORDERKEY = 3000000 CREATE UNIQUE INDEX IX_Key ON ORDERS_Heap (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Heap WHERE O_ORDERKEY = 3000000 Table 'ORDERS_Heap'. Scan count 0, logical reads 3 , physical reads 0, read-ahead reads 0 Table 'ORDERS_Cluster'. Scan count 0, logical reads 4 , physical reads 0, read-ahead reads 0
  44. 44. Cluster on O_ORDERKEY heap + Index on O_ORDERKEY But what if we do this a lot? CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY) WITH (FILLFACTOR = 100) CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Heap WHERE O_CUSTKEY = 47480 SELECT * FROM ORDERS_Cluster WHERE O_CUSTKEY = 47480 Table 'ORDERS_Cluster'. Scan count 1 , logical reads 27, physical reads 0 Table 'ORDERS_Heap'. Scan count 1 , logical reads 11, physical reads 0 Service Time + Wait Time
  45. 45. How many LOOP joins/sec/core? 7 Sec Service Time + Wait Time
  46. 46. What did we just measure? Xperf –on Base –stackwalk profile About 40%... Service Time + Wait Time
  47. 47. • The query language itself • Why so many ExecuteStmt? • …With so much CPU use? What is sqllang.dll? Service Time + Wait Time
  48. 48. A different way to Measure Loops 1 Sec Service Time + Wait Time
  49. 49. VS. What does THAT look like? Takeaway: The T-SQL language itself is expensive Service Time + Wait Time
  50. 50. • Sample from LINEITEM • Force loop join with index seeks • Do 1.4M seeks Test: Singleton Row Fetch
  51. 51. Singleton seeks – Cost of compression Compression Seek (1.4M seeks) CPU Load None - Memory 13 sec 100% one core PAGE - Memory 24 sec 100% one core None – I/O 21 sec 100% one core PAGE – I/O 32 sec 100% one core Function % Weight CDRecord::LocateColumnInternal 0.82% DataAccessWrapper::DecompressColumnValue 0.47% SearchInfo::CompareCompressedColumn 0.28% PageComprMgr::DecompressColumn 0.24% AnchorRecordCache::LocateColumn 0.18% ScalarCompression::AddPadding 0.04% ScalarCompression::Compare 0.11% Additional Runtime of GetNextRowValuesInternal 0.14% Total Compression 2.28% Total CPU (single core) 8.33% Compression % 27.00% xperf –on base –stackwalk profile
  52. 52. Modern CPU CPU L3 Cache 4MB Inst Cache 32KB Core Data Cache 32KB L2 Uni Cache 256K Inst Cache 32KB Core Data Cache 32KB L2 Uni Cache 256K Bus Service Time + Wait Time
  53. 53. The B+ Tree Service Time + Wait Time B+ Tree
  54. 54. Hekaton Style “Loop” Lookup Table (hash) Service Time + Wait Time
  55. 55. Merge Join m row result 1 1 2 3 n row result 1 2 3 4 4 43 43 Sorted Sorted Complexity: O(m + n) Service Time + Wait Time
  56. 56. Merge Join – What is Fastest? Service Time + Wait Time SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM LINEITEM INNER MERGE JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY …or SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM ORDERS INNER MERGE JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY
  57. 57. Comparing the Query Plans Service Time + Wait Time
  58. 58. Digging in Deeper Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3265 ms, elapsed time = 3357 ms. Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 2469 ms, elapsed time = 2607 ms. Service Time + Wait Time
  59. 59. We can beat SQL Server at this game SELECT MAX(O_ORDERDATE), MAX(MAX_P) FROM (SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P FROM LINEITEM GROUP BY L_ORDERKEY) b INNER MERGE JOIN ORDERS ON O_ORDERKEY = b.L_ORDERKEY Service Time + Wait Time
  60. 60. Hash Join m row result 1 43 13 7 n row join table Hash(1) n row hash table Complexity: O(m + 2n) 3 Service Time + Wait Time
  61. 61. When Hash Joins hurt you Service Time + Wait Time 0 5 10 15 20 25 30 050100150200250300350400 Hash Memory (MB) Runtime (seconds) Spill Zone!
  62. 62. Hash Joins Don’t Scale in MSSQL
  63. 63. The Bottleneck Curve
  64. 64. ACCESS_METHODS_DATASET_PARENT: “Used to synchronize child dataset access to the parent dataset during parallel operations.” Books Online Story… Image: FreeDigitalPhotos.net
  65. 65. Using XPERF to find documentation xperf –on base+cswitch+dispatcher –stackwalk profile+readythread+cswitch
  66. 66. Lets dig in… xperf -on base -stackwalk profile -f stackwalk.etl
  67. 67. What LATCH pattern do we see? GetNextRangeForChildScan Inside: TableScanNew
  68. 68. • Partition the table by a “random” value • Modulo the Key for example • Use SQL Server partition function/schema The Fix?… 0 1 2 3 4 5 6 253 254 255 hash
  69. 69. Closer…
  70. 70. …But no Cigar
  71. 71. What is the Problem here?
  72. 72. Anti Scale Patterns
  73. 73. CPU Caches 0 100 200 300 400 500 600 700 800 900 1,000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 MillionPages/sec Size of Accessed memory (MB) Random Pages Sequential Pages Single Page Service Time + Wait Time
  74. 74. Goals: • Compressed • Prefetch Friendly • Cache Resident Code Example, Column Stores ID Value 1 Beer 2 Beer 3 Vodka 4 Whiskey 5 Whiskey 6 Vodka 7 Vodka ID Customer 1 Thomas 2 Thomas 3 Thomas 4 Christian 5 Christian 6 Alexei 7 Alexei Product Customer ID Date 1 2011-11-25 2 2011-11-25 3 2011-11-25 4 2011-11-25 5 2011-11-25 6 2011-11-25 7 2011-11-25 Date ID Sale 1 2 GBP 2 2 GBP 3 10 GBP 4 5 GBP 5 5 GBP 6 10 GBP 7 10 GBP Sale Service Time + Wait Time
  75. 75. Compression is Easy ID Value 1-2 Beer 3 Vodka 4-5 Whiskey 6-7 Vodka ID Customer 1-3 Thomas 4-5 Christian 6-7 Alexei Product’ Customer’ ID Date 1-7 2011-11-25 Date’ ID Sale 1-2 2 GBP 3 10 GBP 4-5 5 GBP 6-7 10 GBP Sale’ RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka RL Customer 3 Thomas 2 Christian 2 Alexei Product’ Customer’ RL Date 7 2011-11-25 Date’ RL Sale 2 2 GBP 1 10 GBP 4 5 GBP 2 10 GBP Sale’ Service Time + Wait Time
  76. 76. Squeezing it even more RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka Product’ RL Value 2 1 1 2 2 3 2 2 Product’ Beer = 1 Vodka = 2 Whiskey = 3 ID Value 1-2 Beer 3-3 Vodka 4-5 Whiskey 6-7 Vodka Product’ 4+4+4+2 = 14B + 4+4+5+2 = 15B + 4+4+7+2 = 17B + 4+4+5+2 = 15B = 61B 4+4+2 = 10B + 4+5+2 = 11B + 4+7+2 = 13B + 4+5+2 = 11B = 45B 4+4 = 8B + 4+4 = 8B + 4+4 = 8B + 4+4 = 8B = 32B RL Value 2 0x01 1 0x10 2 0x11 2 0x10 Product’ 4 = 4B + 4 = 4B + 4 = 4B + 4 = 4B + 4 x 2b = 2B = 18B Service Time + Wait Time
  77. 77. RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka RL Customer 3 Thomas 2 Christian 2 Alexei Product’ Customer’ 2 steps with Beer 2 steps with Thomas Beer Thomas Beer Thomas SELECT Product, Customer FROM Table 1 step with Vodka 1 step with Thomas Vodka Thomas 2 step with Whiskey 2 step with Christian Whiskey Christian Whiskey Christian 2 step with Vodka (Note: Repeated value) 2 step with Alexei Vodka Alexei Vodka Alexei Service Time + Wait Time
  78. 78. Hash Joining with Column Stores RL Key 2 Beer 1 Vodka 2 Whiskey 2 Vodka Table Key Type Beer Soft Vodka Strong Whiskey Strong Vodka Strong Dim Product SELECT … FROM Table JOIN DimProduct ON Key WHERE Type = ‘Strong’ 1 Compute bloom filter of Keys belonging to ‘strong’ 2 Read RL = 2, Beer from Table 3 Compute bloom value of Beer. 4 Equal to filter value from 1? Yes. Output two rows (RL=2) 5 Compute bloom value for Vodka 6 Equal to filter value from 1? No. Do nothing 7 Compute bloom value for Whiskey 8 Equal to filter value from 1? No. Do nothing Can pre fetch data (news RLE) Can calculate match/no match using only local CPU cache Wont work for OLTP! Service Time + Wait Time
  79. 79. Why is it so hard to get joins right? n m Time Loop Join Merge Join Hash Join Service Time + Wait Time
  80. 80. Desired Join Join Hint Query Hint LOOP [INNER | LEFT | CROSS | FULL] LOOP JOIN OPTION (LOOP JOIN) MERGE [INNER | LEFT | CROSS | FULL] MERGE JOIN OPTION (MERGE JOIN) HASH [INNER | LEFT | CROSS | FULL] HASH JOIN OPTION (HASH JOIN) LOOP with Seek WITH FORCESEEK WITH ( INDEX (index = <name>) ) N/A Controlling Joins Note: Join hints force the order of the ENTIRE join tree! Service Time + Wait Time
  81. 81. What Type of Workload? BigSmall Small Big DataReturned Data Touched OLTP BI/DW Simulation ETL Service Time + Wait Time
  82. 82. How to Classify? OLTP BI/DW Simulation ETL Full Scan/sec Range Scans/sec Probe Scans/sec Index Search/sec Range Scans/sec Full Scan/sec Range Scans/sec Bulk Copy Rows/sec ?
  83. 83. There should ALWAYS be a fully indexed path to the data. OLTP System Basic Query Pattern BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
  84. 84. 1. Find worst CPU consuming query with sys.dm_exec_query_stats 2. Add OPTION (LOOP JOIN) to offending query 3. Check estimated query plan 4. If table spool found: add index to remedy and GOTO 3 5. Happy? If not, GOTO 1 The Super Quick OLTP Tuning Guide Service Time + Wait Time
  85. 85. The query will not be (much) worse than a full scan of a fact partition DW/BI System Basic Query Pattern BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
  86. 86. 1. Find offending query 2. Add OPTION (HASH JOIN) to query 3. Does dimension tables have indexed path to build hash? If not, add index 4. Do you get a fact table scan and hash build of all dimensions? If not, check statistics (especially on facts and skewed) 5. Optimize Fact table scans 1. Partition and partition elimination 2. Column store if you have it 3. Aggregate Views 4. Bitmap index pushdown (statistics!) 5. Composite indexes (last resort!) The Super Quick DW tuning Guide Service Time + Wait Time
  87. 87. The expected DW Query Plan Partial Aggregate Fact CSI Scan Dim Scan Dim Seek Batch Build Batch Build Hash Join Hash Join HashStream Aggregate
  88. 88. • At least enough RAM to hold the hash tables of the largest dimension • De-normalisation helps… a LOT • Especially for the large/large joins • Likely: need to scan fast from disk if RAM is not big enough to hold the fact • Compression REALLY matters Things that Follow from desired DW Plan Service Time + Wait Time
  89. 89. Coffee Break
  90. 90. Response time = Service Time + Wait Time
  91. 91. Where EVERY Server wide diagnosis starts SELECT * FROM sys.dm_os_wait_stats WHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC Service Time + Wait Time
  92. 92. • Shows up as waits for PAGEIOLATCH • You can dig into details with: Common Problems - PAGEIO Service Time + Wait Time SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL) • Can also Xevent your way to it per query CREATE EVENT SESSION [TraceIO] ON SERVER ADD EVENT sqlserver.file_read_completed( ACTION (sqlserver.database_id,sqlserver.session_id))
  93. 93. • I/O, like memory, is a GLOBAL resource for the machine • When does it make sense to partition a global resource? • When you deeply know the workload • When the workload is ALREADY partitioned • When neither of those are true: DON’T partition • If you have NAND/SSD – Why bother? The general I/O Guidance Service Time + Wait Time
  94. 94. A good way to Think of Spindle I/O
  95. 95. JBOD SAME LUN Seq. LUN Seq. LUN Seq. RAID system Large LUN Seq. Seq. Seq. RANDOM I/O Service Time + Wait Time
  96. 96. Stripe vs. Concatenation RAID 10 RAID 10 Concatenated LUN RAID 10 RAID 10 Striped LUN Service Time + Wait Time
  97. 97. OLTP • One big SAME setup • data files • Tempdb • Dedicate • Transaction log • DRAM: • Enough to hold most of DB Data Warehouse • JBOD setup • Data Files • 1-2 per LUN • SAME setup • Tempdb • Dedicate • Transaction Log • DRAM: • Enough to hold largest partition of largest table Rules of Thumb – Spindle I/O and DRAM Service Time + Wait Time
  98. 98. • Short Stroking • Elevator Sort • Sequential vs. Random • Weaving You can do a bit better… or worse Service Time + Wait Time
  99. 99. • Intentionally use lower % of total space • Tradeoff: • Space for Speed • Test: • 15K rpm • SAS spindle • 300GB Short Stroking Disks 150 200 250 300 350 400 0% 20% 40% 60% 80% 100% IOPS % Capacity Used Service Time + Wait Time
  100. 100. Full Stroked Short Stroked Why does Short Stroking Work? Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the disk wont use the full platter either. The result: less head movement Service Time + Wait Time
  101. 101. Adding Elevator Sorting 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked Latency IOPS 8K random I/O IOPS Avg. Latency Max Latency Bat powered disk!
  102. 102. Why Chase Sequential I/O? 0 10 20 30 40 50 60 70 80 1 10 100 1000 10000 100000 Sequential Full Stroke Random Latency(ms) Log(IOPS) 8K Block Pattern IOPS Avg Latency Max Latency Service Time + Wait Time
  103. 103. • One SATA disk • Two partitions • One file on each • Sequential read on each file But all is not well! File1 File2 Service Time + Wait Time
  104. 104. I/O Weaving in action 0 2 4 6 8 10 12 14 16 18 0 50 100 150 200 250 300 64K Random 64K Dual Sequential Latency(ms) IOPS IOPS Avg Latency Source: Michael Anderson Service Time + Wait Time
  105. 105. Storage Pool and Weaving DataLog DataLog DataLog Massive, then Provisioned Pool Seq Ran Seq Ran Seq Ran RANDOM! Service Time + Wait Time
  106. 106. The SAN will properly handle Sharing! Green: Checkpoint, Red: tx/sec, Black: Disk Latency Service Time + Wait Time
  107. 107. Numbers to Remember - Spindles Characteristic Typical Units Throughput / Bandwidth 90-125MB/sec But ONLY if sequential access! Operations per Sec 10K RPM Spindle: 100-130 IOPS 15K RPM Spindle: 150-180 IOPS Can get about 2x if short stroking (more later) Latency 3-5ms (compare DRAM: 100ns) Capacity 100s of GB to single digit TB 2012 numbers, will change in future Service Time + Wait Time
  108. 108. • Few hundreds of IOPS • Faster if short stroked • Trade latency for speed with elevator sort • Sequential is hard to get right Summary so far.. Single Disk Service Time + Wait Time
  109. 109. • Wider Stripes neat • But scale not linear • Very deep queues help • But add latency • Shared Components Why does a big RAID pile not solve this? Service Time + Wait Time
  110. 110. RAID Scale? Your Mileage WILL vary with the hardware
  111. 111. Before After Getting rid of Sharing Switch HBA HBA HBA HBA Storage Port Storage Port Switch LUN LUN Cache Disk CPU Switch HBA HBA HBA HBA Storage Port Storage Port Switch LUN LUN Cache Disk CPU x2
  112. 112. 4K PN N NAND Flash Basics 112 PN N Oxide Layer Floating Gate Electrons trapped Control Gate NAND Die Pack Blocks 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K PN N PN NPN N PN NPN N PN NPN N Pages
  113. 113. NAND Flash Problems • Erase Cycles • Around 100K • Rebalancing and reclaim/trim • Voltage measurement • Gets worse with density • Changes over time • Depends on how you program • Bit Rot • Must refresh even on read • SLC easier to manage than MLC • But much more expensive! 113 Voltage 00 01 10 11
  114. 114. Lessons Learned: Try to Avoid Sharing BAD BETTER BEST Service Time + Wait Time
  115. 115. The Network
  116. 116. • Only partially diagnosed as waits in sys.dm_os_wait_stats • Task Manager gives a bit more information • Need: transparency to the deep level latencies and packets! Common Problems: ASYNC_NETWORK, OLEDB Service Time + Wait Time
  117. 117. A common Wait Type The database is really slow! The code takes forever to run! Service Time + Wait Time
  118. 118. • We may not always have insight into what is going on at the client… Xperf Diagnosing the Network xperf –on latency+network Summary Table Service Time + Wait Time
  119. 119. Timeline of the network Traffic
  120. 120. ASYNC_NETWORK_IO, the typical issue Service Time + Wait Time
  121. 121. Handling network is EXPENSIVE xperf –on latency ? Service Time + Wait Time
  122. 122. Short Story on DPC/ISR handling CPU Core Core L1-L3 Cache PCI BUS IRQ HALT execution Fire ISR Routine if (my interrupt) { <Mark Handled> Queue DPC } NIC Work Done DPC <Do work needed> <Wake Application> Core can run other stuff again Service Time + Wait Time
  123. 123. It looks like this… DPC ISR Service Time + Wait Time
  124. 124. • Option 1: Use the HW vendors tool • Option 2: Use interrupt Affinity Policy Tool from MS Setting Interrupt Affinity Service Time + Wait Time
  125. 125. • Standard Payload Network (MTU): • 1500 B • Jumbo Frames • 9014 B(MTU) Jumbo Frame and SQL Packets • Standard SQL payload • 4096 B • Largest • 32767 B SELECT session_id, net_packet_size FROM sys.dm_exec_connections Server=foo;Packet size=32767 Service Time + Wait Time
  126. 126. Single Threaded
  127. 127. Core Evolution Moore’s “Law”: “The number of transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”
  128. 128. • Never faster than a single core • Smaller servers are faster than bigger ones • Large L2 caches and more clock speed help • The algorithm dictates speed • Latency of Wait Time sets upper limit • Examples from MSSQL land: • Formula Engine in MSAS • Transaction Log Writes • INSERT/UPDATE/DELETE (as we shall see) Single Threaded
  129. 129. VLF files • When switching to new VLF – it has to be ”formatted” with 8K sync write • While this happens, transactions are blocked • Too many VLF = Too much blocking • Lesson: Preallocate the database log file in big chunks • Up to 128 Log Buffers per database • Spawned on demand, will not be released once spawned • Transactions will wait for LOGBUFFER is no buffer available • Think of this like a pipeline of commits waiting… VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K <=60K X 128
  130. 130. Transaction Log Background Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126
  131. 131. • Speed is determined by Latency and Code Path • Max Log Write Size: 60K Zooming to the Log Writer Log Writer Async I/O Completion Port Signal thread which issued commit Latency Writer Queue
  132. 132. Long Distance Replication… Log Entry Log Entry Network Log Entry Send log Ack Log Primary Secondary Write Write Executive Summary: The speed of light ( c ) is not fast enough!
  133. 133. • Perfmon will only show millisec • What if we want microseconds? Getting to the Real Latency xperf –on latency
  134. 134. It’s in Memory, so it must be fast? VS. Latency: 15-30us Latency: <5us RAM DISK 1.5sec 1.5sec
  135. 135. No, Because… This adds up to one core… it is doing all it can with the CPU it has
  136. 136. The Effect on UPDATE Naïve UPDATE MyBigTable SET c6 = 43 Parallel UPDATE MyBigTable SET c6 = 43 WHERE key BETWEEN 10**9 * n AND 10**9 * (n+1) -1CX Runtime (smaller is faster)
  137. 137. Multi Threaded
  138. 138. What is Scalable? 0 500 1000 1500 2000 2500 3000 0 4 8 12 16 20 24 Throughput Some Hardware Resource Good So so Bad We want to live here
  139. 139. Amdahl’s Law of gated speedup 1 6 11 16 21 26 31 0 8 16 24 32 40 48 56 64 SpeedupFactor Number of cores P = 100% P = 95% P = 90% P = 80% P = Part of program that can be made Parallel (Note that this may be 0... or 1) N = Number of CPU cores available Speedup =
  140. 140. Introducing Contention – Locks Table A Table B Table C INSERT TableA … INSERT TableB … INSERT TableC … LCK LCK LCK LCK LCK LCK LCK LCK Wait Stat: LCK_<X>
  141. 141. But those rows have to be stored… Table A Table B Table C LCK LCK LCK LCK LCK LCK LCK LCK Data File File Group
  142. 142. It all Starts with Wait Stats SELECT * FROM sys.dm_os_wait_stats WHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC DBCC PAGE
  143. 143. PFS – Hidden Single Page Contention Data File GAM/ SGAM PFS 64MB PFS PFS 64MB PFS 64MB PFS B B B B B B B B B B B B B B B B 8K 10010010 INSERT TableA … Allocated bit
  144. 144. Data File Data File Data File More Files Table A Table B Table C LCK LCK LCK LCK LCK LCK LCK LCK Data File File Group • Round Robin between files • More files, more structures • No affinity
  145. 145. How many more Files? 1 10 100 1000 10000 100000 1000000 10000000 260 280 300 320 340 360 380 400 0 8 16 24 32 40 48 PAGELATCH Runtime # Data Files Runtime PAGELATCH_UP
  146. 146. • Shared, physical MEMORY structures can cause bottlenecks (ex: PFS) • SQL Server must sync too… • Understanding where structure resides leads to tuning fix • Theory of engine! Concurrency: What we learned so far
  147. 147. • Commonly misdiagnosed • CXPACKET does NOT (always) mean that your DOP is “too high” CXPACKET 0 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 160,000,000 180,000,000 200,000,000 10.015.020.025.030.035.040.0 CXPACKETWaits Throughput (MB/sec) CXPACKET waits / Throughput 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 1 11 21 31 41 Throughput(MB/sec( DOP Throughput / DOP
  148. 148. CXPACKET = Issue may be elsewhere…
  149. 149. • What happens when you get things like: LATCH_<x> PAGELATCH_<x> Step 1: Dig into: Diagnosing Latches SELECT * FROM sys.dm_os_latch_stats Service Time + Wait Time
  150. 150. Digging into Latches Again…
  151. 151. Zooming into the Ready Thread
  152. 152. Post Fix Pattern GetNextRangeForChildScan GetNextRangeForChildScan GetNextRangeForChildScan
  153. 153. • Before: 6GB/sec • After: 20GB/sec • This sometimes works on cluster indexes too… …Whiteboard Speedup with Hash Partition of Heap
  154. 154. UPDATE Hotspot Page (8K) ROW ROW ROW LCK_U LCK_U PAGELATCH_EX
  155. 155. Before ALTER TABLE HotUpdates ADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT („X‟) After UPDATE Hack on Small Tables Page (8K) ROW LCK_U PAGELATCH_EX CHAR(5000) Page (8K) ROW ROW ROW LCK_U LCK_U PAGELATCH_EX
  156. 156. Test: Updates of pages Compression Update 1.4M CPU Load None - Memory 13 sec 100% one core PAGE - Memory 54 sec 100% one core None – I/O 17 sec 100% one core PAGE – I/O 59 sec 100% one core L_QUANTITY is NOT NULL i.e. in place UPDATE
  157. 157. Function CPU % qsort 0.86 CDRecord::Resize 0.84 CDRecord::LocateColumnInternal 0.36 perror 0.36 Page::CompactPage 0.36 ObjectMetadata::`scalar deleting destructor' 0.27 SearchInfo::CompareCompressedColumn 0.24 CDRecord::InitVariable 0.19 CDRecord::LocateColumnWithCookie 0.18 memcmp 0.16 PageDictionary::ValueToSymbol 0.16 Record::DecompressRec 0.14 PageComprMgr::DecompressColumn 0.14 CDRecord::InitFixedFromOld 0.1 SOS_MemoryManager::GetAddressInfo64 0.08 AnchorRecordCache::LocateColumn 0.08 CDRecord::GetDataForAllColumns 0.08 ScalarCompression::Compare 0.07 PageComprMgr::CompressColumn 0.07 Record::CreatePageCompressedRecNoCheck 0.06 memset 0.05 PageComprMgr::ExpandPrefix 0.04 PageRef::ModifyColumnsInternal 0.04 Page::ModifyColumns 0.03 DataAccessWrapper::ProcessAndCompressBuffer 0.03 SingleColAccessor::LocateColumn 0.03 CDRecord::BuildLongRegionBulk 0.02 ChecksumSectors 0.02 Page::MCILinearRegress 0.02 DataAccessWrapper::DecompressColumnValue 0.02 SOS_MemoryManager::GetAddressInfo 0.02 CDRecord::FindDiff 0.02 AnchorRecordCache::Init 0.02 PageComprMgr::CombinePrefix 0.01 Total 5.17 UPDATE Compression burners Out of 8.55 … Approx: 60%
  158. 158. Compression and Locks Xevent Trace Lock Acquire/Release High Res Timer
  159. 159. How long are locks held? 0 100 200 300 400 500 600 PAGE NONE CPU KCycles Lock Held Cycle Count Avg StdDev
  160. 160. • Sharing is generally bad for scale (but may be good for performance) • PAGELATCH and LATCH diagnosis starts in sys.dm_os_latch_stats • CXPACKET • Only important if throughput drops when DOP goes up • If this happens, look for another wait/latch • Table partitioning can be used to work around concurrency issues Summary Concurrency – So Far..
  161. 161. The Paul Randal INSERT test 160M rows, executing at concurrency Commit every 1K: EASY tuning?
  162. 162. All is as Expected?
  163. 163. But Page Splits are Bad, right? = BAD! = Better!...
  164. 164. WRITELOG gone? Faster? ? ? sys.dm_os_wait_stats
  165. 165. And the Score Is… 0 5000 10000 15000 20000 25000 30000 35000 newguid() newsequentialid() IDENTITY Time in Seconds
  166. 166. What is going on here??? Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
  167. 167. Tricks to Work Around this 0 -1000 1001 - 2000 2001 - 3000 3001 - 4000 INSERT INSERT INSERT INSERT
  168. 168. All Cores at 100% 0 5000 10000 15000 20000 25000 30000 35000 newguid( ) newsequ entialid() IDENTITY IDENTITY +Unique IDENTITY +Unique +Hash8 IDENTITY +Hash24 IDENTITY +Hash48 SPID+ Offset Seconds Runtime in Seconds 600K Inserts/sec 830K Inserts/sec All Cores at ~100%
  169. 169. • Don’t use Sequential Keys • Page Splitting isn’t so bad • Neither are GUID • Generate keys wisely. Ideally in the app server • For “transparent” speedup, consider our old hash trick Takeaways, INSERT workload
  170. 170. • Minimally Logged • Single, large execution (thousands) • Unsorted data • Concurrent Loaders BULK INSERT Workload Heap Bulk Insert Bulk Insert
  171. 171. Measure: SELECT * FROM sys_dm_os_latch_stats Observe waits on ALLOC_FREESPACE_CACHE Theory (just read BOL): “Used to synchronize the access to a cache of pages with available space for heaps and binary large objects (BLOBs). Contention on latches of this class can occur when multiple connections try to insert rows into a heap or BLOB at the same time. You can reduce this contention by partitioning the object.” When does BULK INSERT scale break? 0.0 50.0 100.0 150.0 200.0 250.0 0 5 10 15 20 25 30 MB/Sec Concurrent BULK INSERT 1 2 3
  172. 172. What is Happening here? Free Page information (PFS/GAM/SSGAM) HOBT Cache Fat Chunks Alloc new pages!Bulk Insert ALLOC_FREESPACE_CACHE This is in DRAM and L2
  173. 173. • Break Up table by “some key” • Optional: Switch out partitions • Spin up multiple bulks • Linear scale • 3GB/sec • 16M LINEITEM/sec Breaking Through the Bottleneck 425 555 215 200 101 453 666 Area Bulk Insert Bulk Insert Bulk Insert
  174. 174. BULK INSERT - Reloaded • Thomas, you might have gotten 16M rows/sec at 3GB/sec insert speed • But this was on heaps, I have a clustered table • Alright then, let us hit a cluster index 1-1000 Clustered and partitioned 1001-2000 2001-3000 3001-4000 X Lock X Lock X Lock X Lock
  175. 175. Cluster Bulking – It seemed so plausible! 1 2 3
  176. 176. Cluster Bulking – Stage and Switch 1 2 3
  177. 177. Coffee Break
  178. 178. SPIN LOCKS
  179. 179. • Context Switching is expensive • Typically 10K or more CPU cycles • If you expect the ressource to be held only shortly, why fall asleep? What is a Spinlock? spin_acquire(int* s) { while(*s==1) *s = 1; } Spin_release(int* s) { *s = 0; }
  180. 180. • Acquire can be very expensive • SQL Server implements a backoff mechanism What is a backoff? spin_acquire(int* s) { int spins = 0; while(*s==1) { spins++; if (spins > threshold) { <Sleep and WaitForRessource> } } *s = 1; } SELECT * FROM sys.dm_os_spinlock_stats DBCC SQLPERF(spinlockstats) Backoff
  181. 181. Life at 600K INSERT/sec
  182. 182. WRITELOG is I/O – right? Should be the same as this… or? No! Because:
  183. 183. • Step 1: Copy sqlserver.pdb to the BINN directory • Step 2: DBCC TRACEON (3656, -1) • Step 3: Steal script from: http://www.microsoft.com/en- us/download/details.aspx?id=26666 Note for 2012, you additionally need: • sqlmin.pdb, sqllang.pdb, sqldk.pdb Diagnosing a Spinlock the Cool way!
  184. 184. Spinlock Walkthrough – Extended Events Script --Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE') --create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack) where type = 144 --SOS_CACHESTORE) add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE) --Run this section to measure the contention alter event session spin_lock_backoff on server state=start --wait to measure the number of backoffs over a 1 minute period waitfor delay '00:01:00' --To view the data --1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1) --Get the callstacks from the bucketize target select event_session_address, target_name, execution_count, c ast (target_data as XML) from sys.dm_xe_session_targets xst inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address) where xs.name = 'spin_lock_backoff' --clean up the session alter event session spin_lock_backoff on server state=stop drop event session spin_lock_backoff on server
  185. 185. Of course, you can just use 2012…
  186. 186. How to improve a spinlock? CPU Core Core L1-L3 Cache CPU Core Core L1-L3 Cache spin_acquire Int s spin_acquire Int s spin_acquire Int s Transfer cache line Transfer cache line CPU CPU
  187. 187. CoreInfo.Exe – where are my cores? CoreInfo.exe
  188. 188. Revisiting the TLOG Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126
  189. 189. I/O Affinity Mask! 0 50 100 150 200 250 SPID + Offset SPID + Affinity sp_configure „AffinityIOMask‟
  190. 190. Bulking at Concurrency • What’s that spin? xperf –on latency –stackwalk profile xperf –d trace.etl xperview trace.etl SELECT * FROM sys.dm_os_spinlock_stats ORDER BY spins_count DBCC SQLPERF (spinlockstats) ?
  191. 191. SOS_OBJECT_STORE at high INSERT • Observed: This Spin happens when inserting • Need: Reduce locking overhead • Fixes that work well here: 8x throughput Bonus
  192. 192. • Lets try something really silly: • Run lots of: EXEC emptyProc • This should be infinitely scalable, right? Diagnosing another Spinlock CREATE PROCEDURE emptyProc AS RETURN
  193. 193. Initial Diagnosis MUTEX ??? … what Mutex?
  194. 194. Using the Spinlock Script gives us Some cache Which one?
  195. 195. Validating the Theory CREATE PROCEDURE emptyProc0 AS RETURN GO CREATE PROCEDURE emptyProc1 AS RETURN GO … CREATE PROCEDURE emptyProc31 AS RETURN
  196. 196. What is the SOS_OBJECT_STORE? Security Check?
  197. 197. Validating the new “fix”…
  198. 198. DECLARE @ParmDef NVARCHAR(500) DECLARE @sql NVARCHAR(500) SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6)' SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3)) SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5 INT, @p6 BIGINT' DECLARE @constDate DATETIME = '1974-12-22' DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN SET @i = 1 WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, @ParmDef , @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13 SET @i = @i + 1 END COMMIT TRAN Consider this Test harness code…
  199. 199. Spinning on MUTEX Diagnose with trace flag shows spins stack offender: CSecurityContext::GetUserTokenFromCache This is REALLY expensive at scale: WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END Initialize a new execution context on every loop!
  200. 200. Fixing the MUTEX spin • Instead of: WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END • Write: SET @sql = N' DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN WHILE @i <= 1000 BEGIN INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6) SET @i = @i + 1 END COMMIT TRAN END EXEC sys.sp_executesql @sql, @ParmDef 4x throughput Bonus
  201. 201. • When all other bottlenecks are gone, sharing happens in the most unlikely places • You can use spinlock Xevents inside SQL Server • Remember symbol files in BINN • Trace flag 3656 • This can also be done in XPERF for non SQL apps • Ex: Analysis Services Concurrency, Spinlock Summary
  202. 202. • Control of buffers and NUMA for Xperf setting • By default: • 4MB mem • Spool to disk at root of C-drive • Can do buffer/file control: • -buffersize and –maxbuffers • -maxfile and –FileMode Circular Xperf controlling buffers
  203. 203. • Round robin between NUMA nodes • Inside the NUMA: Pick the one that looks the least busy • This is NOT a perfect system How SQL Server assigns threads
  204. 204. Xperf -on Latency+CSWITCH+DISPATCHER - stackWalk CSwitch+ReadyThread+ThreadCreate+Pr ofile -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular REG ADD "HKLMSystemCurrentControlSetControl Session ManagerMemory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f Super Xperf
  205. 205. • All the tuning wont help you if your model is wrong • Tunings gets your far, but to really scale, you need a good data model • This is what my other courses are about But does the Data Model Work?
  206. 206. &
  207. 207. Problem Statement Queue Structure Msg Msg Msg Msg Msg Ordered Push Pop 300B msg
  208. 208. The Naïve Approach • Push • Seek First Row • INSERT Row • Pop • Seek Last Row • DELETE/Output Key Max Msg Min Max Msg Min Msg
  209. 209. Why this doesn’t Scale! Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
  210. 210. NextPrev Virtual Root LATCH HOBT_VIRTUAL_ROOT LCK PAGELATCH PAGELATCH PAGELATCH B-Tree Root Pages
  211. 211. Summarising the Problem • Hot stuff • Root • Min page • Max page • Intermediate pages • Alloc/Dealloc • BUT: We Must have order!
  212. 212. Cooling it down
  213. 213. What if… • Push • Seek first value page • UPDATE Reference Count • Pop • Seek last value page • UPDATE Reference Count Min Max Msg++ Min Max Msg--
  214. 214. Dissipate the Heat Min Msg-- Max Msg++ Min Msg-- Max Msg++ Min Msg-- Max Msg++ Last Digit = 0 Last Digit = 1 Last Digit = 2
  215. 215. Eliminating Thread Contention Queue Structure Ordered PushSequence++PopSequence++ 87654 VERY fast!
  216. 216. Ring Buffers Queue Structure Ordered PushSequence++ Mod 100 PopSequence++ Mod 100 Slot: 8 Msg: 108 Slot: 7 Msg: 107 Slot: 6 Msg: 106 Slot: 5 Msg: 105 Slot:4 Msg:104
  217. 217. Summing Up Message Queue Hack • UPDATE • instead of INSERT/DELETE • More partitions = More B-Trees • Ring buffer using modulo • Find Sweet spot concurrency

×