Master tuning

3,909
-1

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,909
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
67
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • For a great introductory course I recommend the Paul Randal course found here: http://www.sqlskills.com/T_ImmersionInternalsDesign.asp
  • To get a good runtime, we up the count of rows to 1M
  • Hint: NGEN lives in %Windir%\Microsoft.NET\framework64\<Version>Doc on NGEN: http://msdn.microsoft.com/en-us/magazine/cc163610.aspx
  • Get perfview here: http://www.microsoft.com/en-us/download/details.aspx?id=28567
  • http://msdn.microsoft.com/en-us/library/6t9t5wcf(v=vs.80).aspx
  • Different data structures have different time complexities that lend themselves to more or less efficient service times.
  • Concurrency of JOIN even when single threaded
  • The B+ tree is a data structure that seeks to block fetch large areas of data (typically, but not always 8K) before seeking through the pages in memory. There exist many different ways to lay out the data pages of a B-tree, some of them more friendly to memory prefetch than others. The B-tree also allows you to seek the leaf nodes in a linear manner, without paying the log-proportional price to seekThis allows a logarithmic time to seek individual pages while still allowing linear time to range scan. When the expensive price of fetching a page (I/O) has been paid, the parsing of the page can also be made cheap by making use of the memory structures
  • Highlight spill warning
  • In the course material I have a query that will help you do 1 in this list.If you are curious about way to optimize the BEST index only plan, I recommend the book by Dan Tow called : “SQL Tuning”
  • We will get into WHY the transaction log needs to be dedicated
  • Elevator sorts orders the I-O requests before sending them to the spindle. Depending on the buffering, this ordering can increase IOPS per spindle quite signficantly. However, it comes at cost in increased latency.
  • Add the spindle illustration here
  • Hardware vendors have different implementation of RAID. It really depends on the gear you have and there is really only ONE way to get the true, unbiased answer… Which leads us to the next slide
  • http://blogs.msdn.com/sqlcat/archive/2008/09/18/scaling-heavy-network-traffic-with-windows.aspxhttp://msdn.microsoft.com/en-us/library/windows/hardware/gg463378.aspx
  • The jumbo settings vary by vendor.
  • http://blogs.msdn.com/b/sqlserverstorageengine/archive/2006/07/08/under-the-covers-gam-sgam-and-pfs-pages.aspx
  • Certain scenarios for shallow B-Trees (BizTalk Spool) row padding can shift the latch to internal structure  @ACCESS_METHODS_HOBT_VIRTUAL_ROOT
  • Root splits are expensive, although it will only affect one partition at a time. It is when many transactions cause page splits. We are suggesting the partitioning is better.
  • Master tuning

    1. 1. Thomas Kejser thomas@kejser.org http://blog.kejser.org @thomaskejser Super Scaling SQL Server Diagnosing and Fixing Hard Problems
    2. 2. Thomas Kejser • Formerly SQLCAT • Tuning SQL Server since 6.5 • 15+ Years of database experience • http://blog.kejser.org • CTO Fusion-io Europe
    3. 3. Image(s): FreeDigitalPhotos.net VS. VS.
    4. 4. Performance Scalabilityvs. Response Time Ressource Use Adding more of a HW ressource makes things faster You can scale without having performance (ex: HADOOP) You can perform without having scalability (ex: In Memory Engines)
    5. 5. Our Reasonably Priced Server • 2 Socket Xeon E3645 • 2 x 6 Cores • 2.4Ghz • NUMA enabled, HT off • 12 GB RAM • 1 ioDrive2 Duo • 2.4TB Flash • 4K formatted • 64K AUS • 1 Stripe • Power Save Off • Win 2008R2 • SQL 2012 Image Source: DeviantArt
    6. 6. Between disk and Memory Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 1ns 10ns 100ns 100us 10ms10us
    7. 7. The “cache out curve” Data Size Throughput/thread Cache Size Service Time + Wait Time
    8. 8. NUMA Nodes CPU L 3 L 2 L 2 C C CPU L 3 L 2 L 2 C C Can I write? Bus Transfer Bus Transfer
    9. 9. There are several of these curves Throughput Touched Data Size CPU Cache TLB NUMA Remote Storage
    10. 10. Response time = Service Time + Wait Time Algorithms and Data Structures “Bottlenecks”
    11. 11. • DBA tasks • Installation of OS and SQL • Basic Memory Configuration • Basic Perfmon style monitoring • Backup/Restore and HA setup • Basic reading a Query Plan • Basic understanding of database structures • Adding Indexes to tables • Running a Profiler trace What you ALREADY know
    12. 12. Below the Surface
    13. 13. What we Need • Free tools from MS • Windows SDK • In Win8: The “ADK” • Need .NET 4 to install
    14. 14. Where Did the Time Go? Service Time + Wait Time Xperf –on Base –f Base.etl SELECT TOP 100000 * FROM LINEITEM INNER JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY SQLCMD –E –S. –i “Select.sql” Xperf –stop
    15. 15. BASE profile with xperf Service Time + Wait Time
    16. 16. Right Click – Summary Table Service Time + Wait Time
    17. 17. What exactly is SQLNCLI? Service Time + Wait Time
    18. 18. Quantifying just how stupid XML is SELECT TOP 1000000 * FROM ORDERS JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY FOR XML RAW ('OUTPUT') Xperf –on Base –f Base.etl With XML “Native” Format
    19. 19. Which CPU cycles are Expensive? “App” tier Web Server Licensing >3K USD Blades Database Tier Core Licensing >10K USD <XML> ? Service Time + Wait Time
    20. 20. • What about the time INSIDE the process? • What if the EXE won’t tell us? Diving even Deeper
    21. 21. What is a Debug Symbol? mov ax,10 mov bx,20 mov cx,3 push ax push bx push cx call <address> <address> push bp mov bp,sp mov ax,[bp+8] mov bx,[bp+6] mov cx,[bp+4] add ax,bx div cx mov dx,ax ret HeaderdoStuff(10,20,3) … int doStuff(int a, int b, int c) { return (a + b) / c } myProg.exe Machine Code <address> = doStuff Symbol table myProg.pdb Service Time + Wait Time
    22. 22. Where do you get PDB files? _NT_SYMBOL_PATH=SRV*C:Symbols*http://msdl.microsoft.com/download/symbols _NT_SYMCACHE_PATH=C:SymCache • Public Symbol Server • Configure Environment • Dbghelp.dll Service Time + Wait Time
    23. 23. • Auto Generated by Visual Studio: Your Own Debug Symbols Service Time + Wait Time
    24. 24. • Symbols are indexed. Have to add them Adding and Checking Your Symbols Cd Bin/x64/Release/ symstore add /f *.pdb /s C:/Symbols /t ‚MyExe‛ • Validate that the Symbols can resolve Cd Bin/x64/Release/ symchk MyExe.exe /V
    25. 25. • Standard Xperf works fine for you own native code • BUT: Before Windows 8, stack walking is broken for x64 .NET • If you have .NET with 64 bit code. You must NGEN first: Got .NET and x64? Ngen install Bin/x64/Release/MyExe.exe (ngen lives here: %Windir%Microsoft.NETframework64<Version>Ngen.exe Service Time + Wait Time
    26. 26. • Free tool from MS: .NET tracing is a pain, get a tool! • Not to be confused with xperfview • Same trace API and file format • Helps set obscure .NET specific trace flags Service Time + Wait Time
    27. 27. And Finally, You can do Very Cool Things Did I tell you about interlocked operations?... Whiteboard time! Service Time + Wait Time
    28. 28. • Consider again our LINEITEM table What is SQL Server REALLY doing? • How expensive is it to read from that? • Think ETL code and DW/BI queries CREATE TABLE LINEITEM ( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL ) BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
    29. 29. SQLCMD – Native code Test SQLCMD.EXE Where does the time go? Service Time + Wait Time
    30. 30. Standard Reading of Data xperf -on base -stackwalk profile -f stackwalk.etl SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch" 55sec xperf -stop xperf –merge stackwalk.etl stackwalkmerge.etl Service Time + Wait Time
    31. 31. Details of the Time – Padding? Service Time + Wait Time
    32. 32. More Details – Conversion Work?
    33. 33. An Educated guess about improvements CREATE TABLE [dbo].[LINEITEM_native]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] money NOT NULL, [L_EXTENDEDPRICE] money NOT NULL, [L_DISCOUNT] money NOT NULL, [L_TAX] money NOT NULL, [L_RETURNFLAG] int NOT NULL, [L_LINESTATUS] int NOT NULL, [L_SHIPDATE] int NOT NULL, [L_COMMITDATE] int NOT NULL, [L_RECEIPTDATE] int NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] int NOT NULL, [L_COMMENT] char(44) NOT NULL ) CREATE TABLE [dbo].[LINEITEM]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL, ) Before After Service Time + Wait Time
    34. 34. Getting Rid of Useless Work Additional parameters for SQLCMD: -a32767 -W -s";" -f437 x1.5 Service Time + Wait Time
    35. 35. Unicode – 10% overhead? Service Time + Wait Time
    36. 36. Lets try that with Native and Unicode … x5 Service Time + Wait Time
    37. 37. • SQLNCLI is one of these in disguise • ODBC • OLEDB • Pick good data types • MONEY over NUMERIC • UNICODE of data arrives like this • Native protocols vs. flexibility Summary Moving Data
    38. 38. • Get • Windows 8 ADK • Windows 7 SDK • Set up Symbol Paths • Xperf –on Base • Standard trace for time, narrow to process and DLL/EXE • Xperf –on Base –stackwalk Profile • Get to the call stack, find the offending function(s) • Ease of use for .NET: perfview.exe Summary – Xperf Service Time + Wait Time
    39. 39. Response time = Service Time + Wait Time
    40. 40. Introducing TPC-H Service Time + Wait Time
    41. 41. Loop Join n row B-tree Log(n) reads Complexity: O(m * log(n)) Service Time + Wait Time m row result 1 43 13 7 3
    42. 42. Linked List Tree Linked List vs. Tree Service Time + Wait Time 0 1 2 3 4 5 6 7 8 n 8 134 62 1510 16141197531 Log2(n)
    43. 43. Cluster on O_ORDERKEY Index on O_ORDERKEY Basic argument for Cluster Indexes Service Time + Wait Time CREATE UNIQUE CLUSTERED INDEX CIX_Key ON ORDERS_Cluster (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Cluster WHERE O_ORDERKEY = 3000000 CREATE UNIQUE INDEX IX_Key ON ORDERS_Heap (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Heap WHERE O_ORDERKEY = 3000000 Table 'ORDERS_Heap'. Scan count 0, logical reads 3 , physical reads 0, read-ahead reads 0 Table 'ORDERS_Cluster'. Scan count 0, logical reads 4 , physical reads 0, read-ahead reads 0
    44. 44. Cluster on O_ORDERKEY heap + Index on O_ORDERKEY But what if we do this a lot? CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY) WITH (FILLFACTOR = 100) CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY) WITH (FILLFACTOR = 100) SELECT * FROM ORDERS_Heap WHERE O_CUSTKEY = 47480 SELECT * FROM ORDERS_Cluster WHERE O_CUSTKEY = 47480 Table 'ORDERS_Cluster'. Scan count 1 , logical reads 27, physical reads 0 Table 'ORDERS_Heap'. Scan count 1 , logical reads 11, physical reads 0 Service Time + Wait Time
    45. 45. How many LOOP joins/sec/core? 7 Sec Service Time + Wait Time
    46. 46. What did we just measure? Xperf –on Base –stackwalk profile About 40%... Service Time + Wait Time
    47. 47. • The query language itself • Why so many ExecuteStmt? • …With so much CPU use? What is sqllang.dll? Service Time + Wait Time
    48. 48. A different way to Measure Loops 1 Sec Service Time + Wait Time
    49. 49. VS. What does THAT look like? Takeaway: The T-SQL language itself is expensive Service Time + Wait Time
    50. 50. • Sample from LINEITEM • Force loop join with index seeks • Do 1.4M seeks Test: Singleton Row Fetch
    51. 51. Singleton seeks – Cost of compression Compression Seek (1.4M seeks) CPU Load None - Memory 13 sec 100% one core PAGE - Memory 24 sec 100% one core None – I/O 21 sec 100% one core PAGE – I/O 32 sec 100% one core Function % Weight CDRecord::LocateColumnInternal 0.82% DataAccessWrapper::DecompressColumnValue 0.47% SearchInfo::CompareCompressedColumn 0.28% PageComprMgr::DecompressColumn 0.24% AnchorRecordCache::LocateColumn 0.18% ScalarCompression::AddPadding 0.04% ScalarCompression::Compare 0.11% Additional Runtime of GetNextRowValuesInternal 0.14% Total Compression 2.28% Total CPU (single core) 8.33% Compression % 27.00% xperf –on base –stackwalk profile
    52. 52. Modern CPU CPU L3 Cache 4MB Inst Cache 32KB Core Data Cache 32KB L2 Uni Cache 256K Inst Cache 32KB Core Data Cache 32KB L2 Uni Cache 256K Bus Service Time + Wait Time
    53. 53. The B+ Tree Service Time + Wait Time B+ Tree
    54. 54. Hekaton Style “Loop” Lookup Table (hash) Service Time + Wait Time
    55. 55. Merge Join m row result 1 1 2 3 n row result 1 2 3 4 4 43 43 Sorted Sorted Complexity: O(m + n) Service Time + Wait Time
    56. 56. Merge Join – What is Fastest? Service Time + Wait Time SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM LINEITEM INNER MERGE JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY …or SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM ORDERS INNER MERGE JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY
    57. 57. Comparing the Query Plans Service Time + Wait Time
    58. 58. Digging in Deeper Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3265 ms, elapsed time = 3357 ms. Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 2469 ms, elapsed time = 2607 ms. Service Time + Wait Time
    59. 59. We can beat SQL Server at this game SELECT MAX(O_ORDERDATE), MAX(MAX_P) FROM (SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P FROM LINEITEM GROUP BY L_ORDERKEY) b INNER MERGE JOIN ORDERS ON O_ORDERKEY = b.L_ORDERKEY Service Time + Wait Time
    60. 60. Hash Join m row result 1 43 13 7 n row join table Hash(1) n row hash table Complexity: O(m + 2n) 3 Service Time + Wait Time
    61. 61. When Hash Joins hurt you Service Time + Wait Time 0 5 10 15 20 25 30 050100150200250300350400 Hash Memory (MB) Runtime (seconds) Spill Zone!
    62. 62. Hash Joins Don’t Scale in MSSQL
    63. 63. The Bottleneck Curve
    64. 64. ACCESS_METHODS_DATASET_PARENT: “Used to synchronize child dataset access to the parent dataset during parallel operations.” Books Online Story… Image: FreeDigitalPhotos.net
    65. 65. Using XPERF to find documentation xperf –on base+cswitch+dispatcher –stackwalk profile+readythread+cswitch
    66. 66. Lets dig in… xperf -on base -stackwalk profile -f stackwalk.etl
    67. 67. What LATCH pattern do we see? GetNextRangeForChildScan Inside: TableScanNew
    68. 68. • Partition the table by a “random” value • Modulo the Key for example • Use SQL Server partition function/schema The Fix?… 0 1 2 3 4 5 6 253 254 255 hash
    69. 69. Closer…
    70. 70. …But no Cigar
    71. 71. What is the Problem here?
    72. 72. Anti Scale Patterns
    73. 73. CPU Caches 0 100 200 300 400 500 600 700 800 900 1,000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 MillionPages/sec Size of Accessed memory (MB) Random Pages Sequential Pages Single Page Service Time + Wait Time
    74. 74. Goals: • Compressed • Prefetch Friendly • Cache Resident Code Example, Column Stores ID Value 1 Beer 2 Beer 3 Vodka 4 Whiskey 5 Whiskey 6 Vodka 7 Vodka ID Customer 1 Thomas 2 Thomas 3 Thomas 4 Christian 5 Christian 6 Alexei 7 Alexei Product Customer ID Date 1 2011-11-25 2 2011-11-25 3 2011-11-25 4 2011-11-25 5 2011-11-25 6 2011-11-25 7 2011-11-25 Date ID Sale 1 2 GBP 2 2 GBP 3 10 GBP 4 5 GBP 5 5 GBP 6 10 GBP 7 10 GBP Sale Service Time + Wait Time
    75. 75. Compression is Easy ID Value 1-2 Beer 3 Vodka 4-5 Whiskey 6-7 Vodka ID Customer 1-3 Thomas 4-5 Christian 6-7 Alexei Product’ Customer’ ID Date 1-7 2011-11-25 Date’ ID Sale 1-2 2 GBP 3 10 GBP 4-5 5 GBP 6-7 10 GBP Sale’ RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka RL Customer 3 Thomas 2 Christian 2 Alexei Product’ Customer’ RL Date 7 2011-11-25 Date’ RL Sale 2 2 GBP 1 10 GBP 4 5 GBP 2 10 GBP Sale’ Service Time + Wait Time
    76. 76. Squeezing it even more RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka Product’ RL Value 2 1 1 2 2 3 2 2 Product’ Beer = 1 Vodka = 2 Whiskey = 3 ID Value 1-2 Beer 3-3 Vodka 4-5 Whiskey 6-7 Vodka Product’ 4+4+4+2 = 14B + 4+4+5+2 = 15B + 4+4+7+2 = 17B + 4+4+5+2 = 15B = 61B 4+4+2 = 10B + 4+5+2 = 11B + 4+7+2 = 13B + 4+5+2 = 11B = 45B 4+4 = 8B + 4+4 = 8B + 4+4 = 8B + 4+4 = 8B = 32B RL Value 2 0x01 1 0x10 2 0x11 2 0x10 Product’ 4 = 4B + 4 = 4B + 4 = 4B + 4 = 4B + 4 x 2b = 2B = 18B Service Time + Wait Time
    77. 77. RL Value 2 Beer 1 Vodka 2 Whiskey 2 Vodka RL Customer 3 Thomas 2 Christian 2 Alexei Product’ Customer’ 2 steps with Beer 2 steps with Thomas Beer Thomas Beer Thomas SELECT Product, Customer FROM Table 1 step with Vodka 1 step with Thomas Vodka Thomas 2 step with Whiskey 2 step with Christian Whiskey Christian Whiskey Christian 2 step with Vodka (Note: Repeated value) 2 step with Alexei Vodka Alexei Vodka Alexei Service Time + Wait Time
    78. 78. Hash Joining with Column Stores RL Key 2 Beer 1 Vodka 2 Whiskey 2 Vodka Table Key Type Beer Soft Vodka Strong Whiskey Strong Vodka Strong Dim Product SELECT … FROM Table JOIN DimProduct ON Key WHERE Type = ‘Strong’ 1 Compute bloom filter of Keys belonging to ‘strong’ 2 Read RL = 2, Beer from Table 3 Compute bloom value of Beer. 4 Equal to filter value from 1? Yes. Output two rows (RL=2) 5 Compute bloom value for Vodka 6 Equal to filter value from 1? No. Do nothing 7 Compute bloom value for Whiskey 8 Equal to filter value from 1? No. Do nothing Can pre fetch data (news RLE) Can calculate match/no match using only local CPU cache Wont work for OLTP! Service Time + Wait Time
    79. 79. Why is it so hard to get joins right? n m Time Loop Join Merge Join Hash Join Service Time + Wait Time
    80. 80. Desired Join Join Hint Query Hint LOOP [INNER | LEFT | CROSS | FULL] LOOP JOIN OPTION (LOOP JOIN) MERGE [INNER | LEFT | CROSS | FULL] MERGE JOIN OPTION (MERGE JOIN) HASH [INNER | LEFT | CROSS | FULL] HASH JOIN OPTION (HASH JOIN) LOOP with Seek WITH FORCESEEK WITH ( INDEX (index = <name>) ) N/A Controlling Joins Note: Join hints force the order of the ENTIRE join tree! Service Time + Wait Time
    81. 81. What Type of Workload? BigSmall Small Big DataReturned Data Touched OLTP BI/DW Simulation ETL Service Time + Wait Time
    82. 82. How to Classify? OLTP BI/DW Simulation ETL Full Scan/sec Range Scans/sec Probe Scans/sec Index Search/sec Range Scans/sec Full Scan/sec Range Scans/sec Bulk Copy Rows/sec ?
    83. 83. There should ALWAYS be a fully indexed path to the data. OLTP System Basic Query Pattern BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
    84. 84. 1. Find worst CPU consuming query with sys.dm_exec_query_stats 2. Add OPTION (LOOP JOIN) to offending query 3. Check estimated query plan 4. If table spool found: add index to remedy and GOTO 3 5. Happy? If not, GOTO 1 The Super Quick OLTP Tuning Guide Service Time + Wait Time
    85. 85. The query will not be (much) worse than a full scan of a fact partition DW/BI System Basic Query Pattern BigSmall Small Big OLTP BI/DW Simulation ETL Service Time + Wait Time
    86. 86. 1. Find offending query 2. Add OPTION (HASH JOIN) to query 3. Does dimension tables have indexed path to build hash? If not, add index 4. Do you get a fact table scan and hash build of all dimensions? If not, check statistics (especially on facts and skewed) 5. Optimize Fact table scans 1. Partition and partition elimination 2. Column store if you have it 3. Aggregate Views 4. Bitmap index pushdown (statistics!) 5. Composite indexes (last resort!) The Super Quick DW tuning Guide Service Time + Wait Time
    87. 87. The expected DW Query Plan Partial Aggregate Fact CSI Scan Dim Scan Dim Seek Batch Build Batch Build Hash Join Hash Join HashStream Aggregate
    88. 88. • At least enough RAM to hold the hash tables of the largest dimension • De-normalisation helps… a LOT • Especially for the large/large joins • Likely: need to scan fast from disk if RAM is not big enough to hold the fact • Compression REALLY matters Things that Follow from desired DW Plan Service Time + Wait Time
    89. 89. Coffee Break
    90. 90. Response time = Service Time + Wait Time
    91. 91. Where EVERY Server wide diagnosis starts SELECT * FROM sys.dm_os_wait_stats WHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC Service Time + Wait Time
    92. 92. • Shows up as waits for PAGEIOLATCH • You can dig into details with: Common Problems - PAGEIO Service Time + Wait Time SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL) • Can also Xevent your way to it per query CREATE EVENT SESSION [TraceIO] ON SERVER ADD EVENT sqlserver.file_read_completed( ACTION (sqlserver.database_id,sqlserver.session_id))
    93. 93. • I/O, like memory, is a GLOBAL resource for the machine • When does it make sense to partition a global resource? • When you deeply know the workload • When the workload is ALREADY partitioned • When neither of those are true: DON’T partition • If you have NAND/SSD – Why bother? The general I/O Guidance Service Time + Wait Time
    94. 94. A good way to Think of Spindle I/O
    95. 95. JBOD SAME LUN Seq. LUN Seq. LUN Seq. RAID system Large LUN Seq. Seq. Seq. RANDOM I/O Service Time + Wait Time
    96. 96. Stripe vs. Concatenation RAID 10 RAID 10 Concatenated LUN RAID 10 RAID 10 Striped LUN Service Time + Wait Time
    97. 97. OLTP • One big SAME setup • data files • Tempdb • Dedicate • Transaction log • DRAM: • Enough to hold most of DB Data Warehouse • JBOD setup • Data Files • 1-2 per LUN • SAME setup • Tempdb • Dedicate • Transaction Log • DRAM: • Enough to hold largest partition of largest table Rules of Thumb – Spindle I/O and DRAM Service Time + Wait Time
    98. 98. • Short Stroking • Elevator Sort • Sequential vs. Random • Weaving You can do a bit better… or worse Service Time + Wait Time
    99. 99. • Intentionally use lower % of total space • Tradeoff: • Space for Speed • Test: • 15K rpm • SAS spindle • 300GB Short Stroking Disks 150 200 250 300 350 400 0% 20% 40% 60% 80% 100% IOPS % Capacity Used Service Time + Wait Time
    100. 100. Full Stroked Short Stroked Why does Short Stroking Work? Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the disk wont use the full platter either. The result: less head movement Service Time + Wait Time
    101. 101. Adding Elevator Sorting 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked Latency IOPS 8K random I/O IOPS Avg. Latency Max Latency Bat powered disk!
    102. 102. Why Chase Sequential I/O? 0 10 20 30 40 50 60 70 80 1 10 100 1000 10000 100000 Sequential Full Stroke Random Latency(ms) Log(IOPS) 8K Block Pattern IOPS Avg Latency Max Latency Service Time + Wait Time
    103. 103. • One SATA disk • Two partitions • One file on each • Sequential read on each file But all is not well! File1 File2 Service Time + Wait Time
    104. 104. I/O Weaving in action 0 2 4 6 8 10 12 14 16 18 0 50 100 150 200 250 300 64K Random 64K Dual Sequential Latency(ms) IOPS IOPS Avg Latency Source: Michael Anderson Service Time + Wait Time
    105. 105. Storage Pool and Weaving DataLog DataLog DataLog Massive, then Provisioned Pool Seq Ran Seq Ran Seq Ran RANDOM! Service Time + Wait Time
    106. 106. The SAN will properly handle Sharing! Green: Checkpoint, Red: tx/sec, Black: Disk Latency Service Time + Wait Time
    107. 107. Numbers to Remember - Spindles Characteristic Typical Units Throughput / Bandwidth 90-125MB/sec But ONLY if sequential access! Operations per Sec 10K RPM Spindle: 100-130 IOPS 15K RPM Spindle: 150-180 IOPS Can get about 2x if short stroking (more later) Latency 3-5ms (compare DRAM: 100ns) Capacity 100s of GB to single digit TB 2012 numbers, will change in future Service Time + Wait Time
    108. 108. • Few hundreds of IOPS • Faster if short stroked • Trade latency for speed with elevator sort • Sequential is hard to get right Summary so far.. Single Disk Service Time + Wait Time
    109. 109. • Wider Stripes neat • But scale not linear • Very deep queues help • But add latency • Shared Components Why does a big RAID pile not solve this? Service Time + Wait Time
    110. 110. RAID Scale? Your Mileage WILL vary with the hardware
    111. 111. Before After Getting rid of Sharing Switch HBA HBA HBA HBA Storage Port Storage Port Switch LUN LUN Cache Disk CPU Switch HBA HBA HBA HBA Storage Port Storage Port Switch LUN LUN Cache Disk CPU x2
    112. 112. 4K PN N NAND Flash Basics 112 PN N Oxide Layer Floating Gate Electrons trapped Control Gate NAND Die Pack Blocks 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K 4K PN N PN NPN N PN NPN N PN NPN N Pages
    113. 113. NAND Flash Problems • Erase Cycles • Around 100K • Rebalancing and reclaim/trim • Voltage measurement • Gets worse with density • Changes over time • Depends on how you program • Bit Rot • Must refresh even on read • SLC easier to manage than MLC • But much more expensive! 113 Voltage 00 01 10 11
    114. 114. Lessons Learned: Try to Avoid Sharing BAD BETTER BEST Service Time + Wait Time
    115. 115. The Network
    116. 116. • Only partially diagnosed as waits in sys.dm_os_wait_stats • Task Manager gives a bit more information • Need: transparency to the deep level latencies and packets! Common Problems: ASYNC_NETWORK, OLEDB Service Time + Wait Time
    117. 117. A common Wait Type The database is really slow! The code takes forever to run! Service Time + Wait Time
    118. 118. • We may not always have insight into what is going on at the client… Xperf Diagnosing the Network xperf –on latency+network Summary Table Service Time + Wait Time
    119. 119. Timeline of the network Traffic
    120. 120. ASYNC_NETWORK_IO, the typical issue Service Time + Wait Time
    121. 121. Handling network is EXPENSIVE xperf –on latency ? Service Time + Wait Time
    122. 122. Short Story on DPC/ISR handling CPU Core Core L1-L3 Cache PCI BUS IRQ HALT execution Fire ISR Routine if (my interrupt) { <Mark Handled> Queue DPC } NIC Work Done DPC <Do work needed> <Wake Application> Core can run other stuff again Service Time + Wait Time
    123. 123. It looks like this… DPC ISR Service Time + Wait Time
    124. 124. • Option 1: Use the HW vendors tool • Option 2: Use interrupt Affinity Policy Tool from MS Setting Interrupt Affinity Service Time + Wait Time
    125. 125. • Standard Payload Network (MTU): • 1500 B • Jumbo Frames • 9014 B(MTU) Jumbo Frame and SQL Packets • Standard SQL payload • 4096 B • Largest • 32767 B SELECT session_id, net_packet_size FROM sys.dm_exec_connections Server=foo;Packet size=32767 Service Time + Wait Time
    126. 126. Single Threaded
    127. 127. Core Evolution Moore’s “Law”: “The number of transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”
    128. 128. • Never faster than a single core • Smaller servers are faster than bigger ones • Large L2 caches and more clock speed help • The algorithm dictates speed • Latency of Wait Time sets upper limit • Examples from MSSQL land: • Formula Engine in MSAS • Transaction Log Writes • INSERT/UPDATE/DELETE (as we shall see) Single Threaded
    129. 129. VLF files • When switching to new VLF – it has to be ”formatted” with 8K sync write • While this happens, transactions are blocked • Too many VLF = Too much blocking • Lesson: Preallocate the database log file in big chunks • Up to 128 Log Buffers per database • Spawned on demand, will not be released once spawned • Transactions will wait for LOGBUFFER is no buffer available • Think of this like a pipeline of commits waiting… VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K <=60K X 128
    130. 130. Transaction Log Background Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126
    131. 131. • Speed is determined by Latency and Code Path • Max Log Write Size: 60K Zooming to the Log Writer Log Writer Async I/O Completion Port Signal thread which issued commit Latency Writer Queue
    132. 132. Long Distance Replication… Log Entry Log Entry Network Log Entry Send log Ack Log Primary Secondary Write Write Executive Summary: The speed of light ( c ) is not fast enough!
    133. 133. • Perfmon will only show millisec • What if we want microseconds? Getting to the Real Latency xperf –on latency
    134. 134. It’s in Memory, so it must be fast? VS. Latency: 15-30us Latency: <5us RAM DISK 1.5sec 1.5sec
    135. 135. No, Because… This adds up to one core… it is doing all it can with the CPU it has
    136. 136. The Effect on UPDATE Naïve UPDATE MyBigTable SET c6 = 43 Parallel UPDATE MyBigTable SET c6 = 43 WHERE key BETWEEN 10**9 * n AND 10**9 * (n+1) -1CX Runtime (smaller is faster)
    137. 137. Multi Threaded
    138. 138. What is Scalable? 0 500 1000 1500 2000 2500 3000 0 4 8 12 16 20 24 Throughput Some Hardware Resource Good So so Bad We want to live here
    139. 139. Amdahl’s Law of gated speedup 1 6 11 16 21 26 31 0 8 16 24 32 40 48 56 64 SpeedupFactor Number of cores P = 100% P = 95% P = 90% P = 80% P = Part of program that can be made Parallel (Note that this may be 0... or 1) N = Number of CPU cores available Speedup =
    140. 140. Introducing Contention – Locks Table A Table B Table C INSERT TableA … INSERT TableB … INSERT TableC … LCK LCK LCK LCK LCK LCK LCK LCK Wait Stat: LCK_<X>
    141. 141. But those rows have to be stored… Table A Table B Table C LCK LCK LCK LCK LCK LCK LCK LCK Data File File Group
    142. 142. It all Starts with Wait Stats SELECT * FROM sys.dm_os_wait_stats WHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0 ORDER BY wait_time_ms DESC DBCC PAGE
    143. 143. PFS – Hidden Single Page Contention Data File GAM/ SGAM PFS 64MB PFS PFS 64MB PFS 64MB PFS B B B B B B B B B B B B B B B B 8K 10010010 INSERT TableA … Allocated bit
    144. 144. Data File Data File Data File More Files Table A Table B Table C LCK LCK LCK LCK LCK LCK LCK LCK Data File File Group • Round Robin between files • More files, more structures • No affinity
    145. 145. How many more Files? 1 10 100 1000 10000 100000 1000000 10000000 260 280 300 320 340 360 380 400 0 8 16 24 32 40 48 PAGELATCH Runtime # Data Files Runtime PAGELATCH_UP
    146. 146. • Shared, physical MEMORY structures can cause bottlenecks (ex: PFS) • SQL Server must sync too… • Understanding where structure resides leads to tuning fix • Theory of engine! Concurrency: What we learned so far
    147. 147. • Commonly misdiagnosed • CXPACKET does NOT (always) mean that your DOP is “too high” CXPACKET 0 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 160,000,000 180,000,000 200,000,000 10.015.020.025.030.035.040.0 CXPACKETWaits Throughput (MB/sec) CXPACKET waits / Throughput 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 1 11 21 31 41 Throughput(MB/sec( DOP Throughput / DOP
    148. 148. CXPACKET = Issue may be elsewhere…
    149. 149. • What happens when you get things like: LATCH_<x> PAGELATCH_<x> Step 1: Dig into: Diagnosing Latches SELECT * FROM sys.dm_os_latch_stats Service Time + Wait Time
    150. 150. Digging into Latches Again…
    151. 151. Zooming into the Ready Thread
    152. 152. Post Fix Pattern GetNextRangeForChildScan GetNextRangeForChildScan GetNextRangeForChildScan
    153. 153. • Before: 6GB/sec • After: 20GB/sec • This sometimes works on cluster indexes too… …Whiteboard Speedup with Hash Partition of Heap
    154. 154. UPDATE Hotspot Page (8K) ROW ROW ROW LCK_U LCK_U PAGELATCH_EX
    155. 155. Before ALTER TABLE HotUpdates ADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT („X‟) After UPDATE Hack on Small Tables Page (8K) ROW LCK_U PAGELATCH_EX CHAR(5000) Page (8K) ROW ROW ROW LCK_U LCK_U PAGELATCH_EX
    156. 156. Test: Updates of pages Compression Update 1.4M CPU Load None - Memory 13 sec 100% one core PAGE - Memory 54 sec 100% one core None – I/O 17 sec 100% one core PAGE – I/O 59 sec 100% one core L_QUANTITY is NOT NULL i.e. in place UPDATE
    157. 157. Function CPU % qsort 0.86 CDRecord::Resize 0.84 CDRecord::LocateColumnInternal 0.36 perror 0.36 Page::CompactPage 0.36 ObjectMetadata::`scalar deleting destructor' 0.27 SearchInfo::CompareCompressedColumn 0.24 CDRecord::InitVariable 0.19 CDRecord::LocateColumnWithCookie 0.18 memcmp 0.16 PageDictionary::ValueToSymbol 0.16 Record::DecompressRec 0.14 PageComprMgr::DecompressColumn 0.14 CDRecord::InitFixedFromOld 0.1 SOS_MemoryManager::GetAddressInfo64 0.08 AnchorRecordCache::LocateColumn 0.08 CDRecord::GetDataForAllColumns 0.08 ScalarCompression::Compare 0.07 PageComprMgr::CompressColumn 0.07 Record::CreatePageCompressedRecNoCheck 0.06 memset 0.05 PageComprMgr::ExpandPrefix 0.04 PageRef::ModifyColumnsInternal 0.04 Page::ModifyColumns 0.03 DataAccessWrapper::ProcessAndCompressBuffer 0.03 SingleColAccessor::LocateColumn 0.03 CDRecord::BuildLongRegionBulk 0.02 ChecksumSectors 0.02 Page::MCILinearRegress 0.02 DataAccessWrapper::DecompressColumnValue 0.02 SOS_MemoryManager::GetAddressInfo 0.02 CDRecord::FindDiff 0.02 AnchorRecordCache::Init 0.02 PageComprMgr::CombinePrefix 0.01 Total 5.17 UPDATE Compression burners Out of 8.55 … Approx: 60%
    158. 158. Compression and Locks Xevent Trace Lock Acquire/Release High Res Timer
    159. 159. How long are locks held? 0 100 200 300 400 500 600 PAGE NONE CPU KCycles Lock Held Cycle Count Avg StdDev
    160. 160. • Sharing is generally bad for scale (but may be good for performance) • PAGELATCH and LATCH diagnosis starts in sys.dm_os_latch_stats • CXPACKET • Only important if throughput drops when DOP goes up • If this happens, look for another wait/latch • Table partitioning can be used to work around concurrency issues Summary Concurrency – So Far..
    161. 161. The Paul Randal INSERT test 160M rows, executing at concurrency Commit every 1K: EASY tuning?
    162. 162. All is as Expected?
    163. 163. But Page Splits are Bad, right? = BAD! = Better!...
    164. 164. WRITELOG gone? Faster? ? ? sys.dm_os_wait_stats
    165. 165. And the Score Is… 0 5000 10000 15000 20000 25000 30000 35000 newguid() newsequentialid() IDENTITY Time in Seconds
    166. 166. What is going on here??? Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
    167. 167. Tricks to Work Around this 0 -1000 1001 - 2000 2001 - 3000 3001 - 4000 INSERT INSERT INSERT INSERT
    168. 168. All Cores at 100% 0 5000 10000 15000 20000 25000 30000 35000 newguid( ) newsequ entialid() IDENTITY IDENTITY +Unique IDENTITY +Unique +Hash8 IDENTITY +Hash24 IDENTITY +Hash48 SPID+ Offset Seconds Runtime in Seconds 600K Inserts/sec 830K Inserts/sec All Cores at ~100%
    169. 169. • Don’t use Sequential Keys • Page Splitting isn’t so bad • Neither are GUID • Generate keys wisely. Ideally in the app server • For “transparent” speedup, consider our old hash trick Takeaways, INSERT workload
    170. 170. • Minimally Logged • Single, large execution (thousands) • Unsorted data • Concurrent Loaders BULK INSERT Workload Heap Bulk Insert Bulk Insert
    171. 171. Measure: SELECT * FROM sys_dm_os_latch_stats Observe waits on ALLOC_FREESPACE_CACHE Theory (just read BOL): “Used to synchronize the access to a cache of pages with available space for heaps and binary large objects (BLOBs). Contention on latches of this class can occur when multiple connections try to insert rows into a heap or BLOB at the same time. You can reduce this contention by partitioning the object.” When does BULK INSERT scale break? 0.0 50.0 100.0 150.0 200.0 250.0 0 5 10 15 20 25 30 MB/Sec Concurrent BULK INSERT 1 2 3
    172. 172. What is Happening here? Free Page information (PFS/GAM/SSGAM) HOBT Cache Fat Chunks Alloc new pages!Bulk Insert ALLOC_FREESPACE_CACHE This is in DRAM and L2
    173. 173. • Break Up table by “some key” • Optional: Switch out partitions • Spin up multiple bulks • Linear scale • 3GB/sec • 16M LINEITEM/sec Breaking Through the Bottleneck 425 555 215 200 101 453 666 Area Bulk Insert Bulk Insert Bulk Insert
    174. 174. BULK INSERT - Reloaded • Thomas, you might have gotten 16M rows/sec at 3GB/sec insert speed • But this was on heaps, I have a clustered table • Alright then, let us hit a cluster index 1-1000 Clustered and partitioned 1001-2000 2001-3000 3001-4000 X Lock X Lock X Lock X Lock
    175. 175. Cluster Bulking – It seemed so plausible! 1 2 3
    176. 176. Cluster Bulking – Stage and Switch 1 2 3
    177. 177. Coffee Break
    178. 178. SPIN LOCKS
    179. 179. • Context Switching is expensive • Typically 10K or more CPU cycles • If you expect the ressource to be held only shortly, why fall asleep? What is a Spinlock? spin_acquire(int* s) { while(*s==1) *s = 1; } Spin_release(int* s) { *s = 0; }
    180. 180. • Acquire can be very expensive • SQL Server implements a backoff mechanism What is a backoff? spin_acquire(int* s) { int spins = 0; while(*s==1) { spins++; if (spins > threshold) { <Sleep and WaitForRessource> } } *s = 1; } SELECT * FROM sys.dm_os_spinlock_stats DBCC SQLPERF(spinlockstats) Backoff
    181. 181. Life at 600K INSERT/sec
    182. 182. WRITELOG is I/O – right? Should be the same as this… or? No! Because:
    183. 183. • Step 1: Copy sqlserver.pdb to the BINN directory • Step 2: DBCC TRACEON (3656, -1) • Step 3: Steal script from: http://www.microsoft.com/en- us/download/details.aspx?id=26666 Note for 2012, you additionally need: • sqlmin.pdb, sqllang.pdb, sqldk.pdb Diagnosing a Spinlock the Cool way!
    184. 184. Spinlock Walkthrough – Extended Events Script --Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE') --create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack) where type = 144 --SOS_CACHESTORE) add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE) --Run this section to measure the contention alter event session spin_lock_backoff on server state=start --wait to measure the number of backoffs over a 1 minute period waitfor delay '00:01:00' --To view the data --1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1) --Get the callstacks from the bucketize target select event_session_address, target_name, execution_count, c ast (target_data as XML) from sys.dm_xe_session_targets xst inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address) where xs.name = 'spin_lock_backoff' --clean up the session alter event session spin_lock_backoff on server state=stop drop event session spin_lock_backoff on server
    185. 185. Of course, you can just use 2012…
    186. 186. How to improve a spinlock? CPU Core Core L1-L3 Cache CPU Core Core L1-L3 Cache spin_acquire Int s spin_acquire Int s spin_acquire Int s Transfer cache line Transfer cache line CPU CPU
    187. 187. CoreInfo.Exe – where are my cores? CoreInfo.exe
    188. 188. Revisiting the TLOG Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126
    189. 189. I/O Affinity Mask! 0 50 100 150 200 250 SPID + Offset SPID + Affinity sp_configure „AffinityIOMask‟
    190. 190. Bulking at Concurrency • What’s that spin? xperf –on latency –stackwalk profile xperf –d trace.etl xperview trace.etl SELECT * FROM sys.dm_os_spinlock_stats ORDER BY spins_count DBCC SQLPERF (spinlockstats) ?
    191. 191. SOS_OBJECT_STORE at high INSERT • Observed: This Spin happens when inserting • Need: Reduce locking overhead • Fixes that work well here: 8x throughput Bonus
    192. 192. • Lets try something really silly: • Run lots of: EXEC emptyProc • This should be infinitely scalable, right? Diagnosing another Spinlock CREATE PROCEDURE emptyProc AS RETURN
    193. 193. Initial Diagnosis MUTEX ??? … what Mutex?
    194. 194. Using the Spinlock Script gives us Some cache Which one?
    195. 195. Validating the Theory CREATE PROCEDURE emptyProc0 AS RETURN GO CREATE PROCEDURE emptyProc1 AS RETURN GO … CREATE PROCEDURE emptyProc31 AS RETURN
    196. 196. What is the SOS_OBJECT_STORE? Security Check?
    197. 197. Validating the new “fix”…
    198. 198. DECLARE @ParmDef NVARCHAR(500) DECLARE @sql NVARCHAR(500) SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6)' SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3)) SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5 INT, @p6 BIGINT' DECLARE @constDate DATETIME = '1974-12-22' DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN SET @i = 1 WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, @ParmDef , @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13 SET @i = @i + 1 END COMMIT TRAN Consider this Test harness code…
    199. 199. Spinning on MUTEX Diagnose with trace flag shows spins stack offender: CSecurityContext::GetUserTokenFromCache This is REALLY expensive at scale: WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END Initialize a new execution context on every loop!
    200. 200. Fixing the MUTEX spin • Instead of: WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END • Write: SET @sql = N' DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN WHILE @i <= 1000 BEGIN INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6) SET @i = @i + 1 END COMMIT TRAN END EXEC sys.sp_executesql @sql, @ParmDef 4x throughput Bonus
    201. 201. • When all other bottlenecks are gone, sharing happens in the most unlikely places • You can use spinlock Xevents inside SQL Server • Remember symbol files in BINN • Trace flag 3656 • This can also be done in XPERF for non SQL apps • Ex: Analysis Services Concurrency, Spinlock Summary
    202. 202. • Control of buffers and NUMA for Xperf setting • By default: • 4MB mem • Spool to disk at root of C-drive • Can do buffer/file control: • -buffersize and –maxbuffers • -maxfile and –FileMode Circular Xperf controlling buffers
    203. 203. • Round robin between NUMA nodes • Inside the NUMA: Pick the one that looks the least busy • This is NOT a perfect system How SQL Server assigns threads
    204. 204. Xperf -on Latency+CSWITCH+DISPATCHER - stackWalk CSwitch+ReadyThread+ThreadCreate+Pr ofile -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular REG ADD "HKLMSystemCurrentControlSetControl Session ManagerMemory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f Super Xperf
    205. 205. • All the tuning wont help you if your model is wrong • Tunings gets your far, but to really scale, you need a good data model • This is what my other courses are about But does the Data Model Work?
    206. 206. &
    207. 207. Problem Statement Queue Structure Msg Msg Msg Msg Msg Ordered Push Pop 300B msg
    208. 208. The Naïve Approach • Push • Seek First Row • INSERT Row • Pop • Seek Last Row • DELETE/Output Key Max Msg Min Max Msg Min Msg
    209. 209. Why this doesn’t Scale! Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
    210. 210. NextPrev Virtual Root LATCH HOBT_VIRTUAL_ROOT LCK PAGELATCH PAGELATCH PAGELATCH B-Tree Root Pages
    211. 211. Summarising the Problem • Hot stuff • Root • Min page • Max page • Intermediate pages • Alloc/Dealloc • BUT: We Must have order!
    212. 212. Cooling it down
    213. 213. What if… • Push • Seek first value page • UPDATE Reference Count • Pop • Seek last value page • UPDATE Reference Count Min Max Msg++ Min Max Msg--
    214. 214. Dissipate the Heat Min Msg-- Max Msg++ Min Msg-- Max Msg++ Min Msg-- Max Msg++ Last Digit = 0 Last Digit = 1 Last Digit = 2
    215. 215. Eliminating Thread Contention Queue Structure Ordered PushSequence++PopSequence++ 87654 VERY fast!
    216. 216. Ring Buffers Queue Structure Ordered PushSequence++ Mod 100 PopSequence++ Mod 100 Slot: 8 Msg: 108 Slot: 7 Msg: 107 Slot: 6 Msg: 106 Slot: 5 Msg: 105 Slot:4 Msg:104
    217. 217. Summing Up Message Queue Hack • UPDATE • instead of INSERT/DELETE • More partitions = More B-Trees • Ring buffer using modulo • Find Sweet spot concurrency
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×