Master tuning

Thomas Kejser
thomas@kejser.org
http://blog.kejser.org
@thomaskejser
Super Scaling SQL Server
Diagnosing and Fixing Hard Problems

Thomas Kejser
• Formerly SQLCAT
• Tuning SQL Server since 6.5
• 15+ Years of database experience
• http://blog.kejser.org
• CTO Fusion-io Europe

Image(s): FreeDigitalPhotos.net
VS. VS.

Performance Scalabilityvs.
Response Time
Ressource Use
Adding more
of a HW ressource
makes things
faster
You can scale without having performance
(ex: HADOOP)
You can perform without having scalability
(ex: In Memory Engines)

Our Reasonably Priced Server
• 2 Socket Xeon E3645
• 2 x 6 Cores
• 2.4Ghz
• NUMA enabled, HT off
• 12 GB RAM
• 1 ioDrive2 Duo
• 2.4TB Flash
• 4K formatted
• 64K AUS
• 1 Stripe
• Power Save Off
• Win 2008R2
• SQL 2012
Image Source: DeviantArt

Between disk and Memory
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us

The “cache out curve”
Data Size
Throughput/thread
Cache Size
Service Time + Wait Time

NUMA Nodes
CPU
L
3
L
2
L
2
C
C
CPU
L
3
L
2
L
2
C
C
Can I
write?
Bus Transfer
Bus Transfer

There are several of these curves
Throughput
Touched
Data Size
CPU Cache
TLB
NUMA
Remote
Storage

Response time = Service Time + Wait Time
Algorithms
and
Data Structures
“Bottlenecks”

• DBA tasks
• Installation of OS and SQL
• Basic Memory Configuration
• Basic Perfmon style monitoring
• Backup/Restore and HA setup
• Basic reading a Query Plan
• Basic understanding of database
structures
• Adding Indexes to tables
• Running a Profiler trace
What you ALREADY know

What we Need
• Free tools from
MS
• Windows SDK
• In Win8: The
“ADK”
• Need .NET 4 to
install

Where Did the Time Go?
Xperf –on Base –f Base.etl
SELECT TOP 100000 *
FROM LINEITEM
INNER JOIN ORDERS
ON O_ORDERKEY = L_ORDERKEY
SQLCMD –E –S. –i “Select.sql”
Xperf –stop

BASE profile with xperf

Right Click – Summary Table

What exactly is SQLNCLI?

Quantifying just how stupid XML is
SELECT TOP 1000000 *
FROM ORDERS
JOIN LINEITEM
ON L_ORDERKEY = O_ORDERKEY
FOR XML RAW ('OUTPUT')
Xperf –on Base –f Base.etl
With XML
“Native” Format

Which CPU cycles are Expensive?
“App” tier
Web Server Licensing
>3K USD Blades
Database Tier
Core Licensing
>10K USD
<XML> ?

• What about the time
INSIDE the process?
• What if the EXE won’t
tell us?
Diving even Deeper

What is a Debug Symbol?
mov ax,10
mov bx,20
mov cx,3
push ax
push bx
push cx
call <address>
<address>
push bp
mov bp,sp
mov ax,[bp+8]
mov bx,[bp+6]
mov cx,[bp+4]
add ax,bx
div cx
mov dx,ax
ret
HeaderdoStuff(10,20,3)
…
int doStuff(int a, int b, int c)
{
return (a + b) / c
}
myProg.exe
Machine Code
<address> = doStuff
Symbol table
myProg.pdb

Where do you get PDB files?
_NT_SYMBOL_PATH=SRV*C:Symbols*http://msdl.microsoft.com/download/symbols
_NT_SYMCACHE_PATH=C:SymCache
• Public Symbol Server
• Configure Environment
• Dbghelp.dll

• Auto Generated by Visual Studio:
Your Own Debug Symbols

• Symbols are indexed. Have to add them
Adding and Checking Your Symbols
Cd Bin/x64/Release/
symstore add /f *.pdb /s C:/Symbols /t ‚MyExe‛
• Validate that the Symbols can resolve
Cd Bin/x64/Release/
symchk MyExe.exe /V

• Standard Xperf works fine
for you own native code
• BUT: Before Windows
8, stack walking is broken
for x64 .NET
• If you have .NET with 64
bit code. You must NGEN
first:
Got .NET and x64?
Ngen install Bin/x64/Release/MyExe.exe
(ngen lives here: %Windir%Microsoft.NETframework64<Version>Ngen.exe

• Free tool from MS:
.NET tracing is a pain, get a tool!
• Not to be confused with xperfview
• Same trace API and file format
• Helps set obscure .NET specific trace flags

And Finally, You can do Very Cool Things
Did I tell you about interlocked
operations?...
Whiteboard time!

• Consider again our LINEITEM table
What is SQL Server REALLY doing?
• How expensive is it to read from that?
• Think ETL code and DW/BI queries
CREATE TABLE LINEITEM (
[L_ORDERKEY] [int] NOT NULL,
[L_PARTKEY] [int] NOT NULL,
[L_SUPPKEY] [int] NOT NULL,
[L_LINENUMBER] [int] NOT NULL,
[L_QUANTITY] [decimal](15, 2) NOT NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,
[L_DISCOUNT] [decimal](15, 2) NOT NULL,
[L_TAX] [decimal](15, 2) NOT NULL,
[L_RETURNFLAG] [char](1) NOT NULL,
[L_LINESTATUS] [char](1) NOT NULL,
[L_SHIPDATE] [date] NOT NULL,
[L_COMMITDATE] [date] NOT NULL,
[L_RECEIPTDATE] [date] NOT NULL,
[L_SHIPINSTRUCT] [char](25) NOT NULL,
[L_SHIPMODE] [char](10) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL
)
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL

SQLCMD – Native code Test
SQLCMD.EXE
Where does the time go?

Standard Reading of Data
xperf -on base -stackwalk profile -f stackwalk.etl
SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch"
55sec
xperf -stop
xperf –merge stackwalk.etl stackwalkmerge.etl

Details of the Time – Padding?

More Details – Conversion Work?

An Educated guess about improvements
CREATE TABLE [dbo].[LINEITEM_native](
[L_QUANTITY] money NOT NULL,
[L_EXTENDEDPRICE] money NOT NULL,
[L_DISCOUNT] money NOT NULL,
[L_TAX] money NOT NULL,
[L_RETURNFLAG] int NOT NULL,
[L_LINESTATUS] int NOT NULL,
[L_SHIPDATE] int NOT NULL,
[L_COMMITDATE] int NOT NULL,
[L_RECEIPTDATE] int NOT NULL,
[L_SHIPMODE] int NOT NULL,
[L_COMMENT] char(44) NOT NULL
)
CREATE TABLE [dbo].[LINEITEM](
[L_QUANTITY] [decimal](15, 2) NOT NULL,
[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,
[L_DISCOUNT] [decimal](15, 2) NOT NULL,
[L_TAX] [decimal](15, 2) NOT NULL,
[L_RETURNFLAG] [char](1) NOT NULL,
[L_LINESTATUS] [char](1) NOT NULL,
[L_SHIPDATE] [date] NOT NULL,
[L_COMMITDATE] [date] NOT NULL,
[L_RECEIPTDATE] [date] NOT NULL,
[L_SHIPMODE] [char](10) NOT NULL,
[L_COMMENT] [varchar](44) NOT NULL,
)
Before After

Getting Rid of Useless Work
Additional parameters for SQLCMD:
-a32767 -W -s";" -f437
x1.5

Unicode – 10% overhead?

Lets try that with Native and Unicode …
x5

• SQLNCLI is one of these in disguise
• ODBC
• OLEDB
• Pick good data types
• MONEY over NUMERIC
• UNICODE of data arrives like this
• Native protocols vs. flexibility
Summary Moving Data

• Get
• Windows 8 ADK
• Windows 7 SDK
• Set up Symbol Paths
• Xperf –on Base
• Standard trace for time, narrow to process
and DLL/EXE
• Xperf –on Base –stackwalk Profile
• Get to the call stack, find the offending
function(s)
• Ease of use for .NET: perfview.exe
Summary – Xperf

Response time = Service Time + Wait Time

Introducing TPC-H

Loop Join
n row B-tree
Log(n) reads
Complexity: O(m * log(n))
m row result
1
43
13
7
3

Linked List Tree
Linked List vs. Tree
0
1
2
3
4
5
6
7
8
n
8
134
62 1510
16141197531
Log2(n)

Cluster on O_ORDERKEY Index on O_ORDERKEY
Basic argument for Cluster Indexes
CREATE UNIQUE CLUSTERED INDEX CIX_Key
ON ORDERS_Cluster (O_ORDERKEY)
WITH (FILLFACTOR = 100)
SELECT *
FROM ORDERS_Cluster
WHERE O_ORDERKEY = 3000000
CREATE UNIQUE INDEX IX_Key
ON ORDERS_Heap (O_ORDERKEY)
SELECT *
FROM ORDERS_Heap
WHERE O_ORDERKEY = 3000000
Table 'ORDERS_Heap'. Scan count 0, logical reads 3
, physical reads 0, read-ahead reads 0
Table 'ORDERS_Cluster'. Scan count 0, logical reads 4
, physical reads 0, read-ahead reads 0

Cluster on O_ORDERKEY heap + Index on O_ORDERKEY
But what if we do this a lot?
CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY)
CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY)
SELECT *
FROM ORDERS_Heap
WHERE O_CUSTKEY = 47480
SELECT *
FROM ORDERS_Cluster
WHERE O_CUSTKEY = 47480
Table 'ORDERS_Cluster'. Scan count 1
, logical reads 27, physical reads 0
Table 'ORDERS_Heap'. Scan count 1
, logical reads 11, physical reads 0

How many LOOP joins/sec/core?
7 Sec

What did we just measure?
Xperf –on Base –stackwalk profile
About 40%...

• The query
language itself
• Why so many
ExecuteStmt?
• …With so much
CPU use?
What is sqllang.dll?

A different way to Measure Loops
1 Sec

VS.
What does THAT look like?
Takeaway:
The T-SQL language
itself is expensive

• Sample from
LINEITEM
• Force loop join with
index seeks
• Do 1.4M seeks
Test: Singleton Row Fetch

Singleton seeks – Cost of compression
Compression Seek (1.4M seeks) CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 24 sec 100% one core
None – I/O 21 sec 100% one core
PAGE – I/O 32 sec 100% one core
Function % Weight
CDRecord::LocateColumnInternal 0.82%
DataAccessWrapper::DecompressColumnValue 0.47%
SearchInfo::CompareCompressedColumn 0.28%
PageComprMgr::DecompressColumn 0.24%
AnchorRecordCache::LocateColumn 0.18%
ScalarCompression::AddPadding 0.04%
ScalarCompression::Compare 0.11%
Additional Runtime of
GetNextRowValuesInternal 0.14%
Total Compression 2.28%
Total CPU (single core) 8.33%
Compression % 27.00%
xperf –on base
–stackwalk profile

Modern CPU
CPU
L3 Cache
4MB
Inst
Cache
32KB
Core
Data
Cache
32KB
L2 Uni Cache
256K
Inst
Cache
32KB
Core
Data
Cache
32KB
L2 Uni Cache
256K
Bus

The B+ Tree
B+ Tree

Hekaton Style “Loop”
Lookup Table
(hash)

Merge Join
m row result
1
1
2
3
n row result
1
2
3
4
4
43
43
Sorted
Sorted
Complexity: O(m + n)

Merge Join – What is Fastest?
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE)
FROM LINEITEM
INNER MERGE JOIN ORDERS
…or
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE)
FROM ORDERS
INNER MERGE JOIN LINEITEM

Comparing the Query Plans

Digging in Deeper
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0
Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0
SQL Server Execution Times:
CPU time = 3265 ms, elapsed time = 3357 ms.
Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0
Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0
SQL Server Execution Times:
CPU time = 2469 ms, elapsed time = 2607 ms.

We can beat SQL Server at this game
SELECT MAX(O_ORDERDATE), MAX(MAX_P)
FROM
(SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P
FROM LINEITEM
GROUP BY L_ORDERKEY) b
INNER MERGE JOIN ORDERS
ON O_ORDERKEY = b.L_ORDERKEY

Hash Join
m row result
1
43
13
7
n row join table
Hash(1)
n row hash table
Complexity: O(m + 2n)
3

When Hash Joins hurt you
0
5
10
15
20
25
30
050100150200250300350400
Hash Memory (MB)
Runtime (seconds)
Spill Zone!

Hash Joins Don’t Scale in MSSQL

ACCESS_METHODS_DATASET_PARENT:
“Used to synchronize child dataset access
to the parent dataset during parallel
operations.”
Books Online Story…
Image: FreeDigitalPhotos.net

Using XPERF to find documentation
xperf –on base+cswitch+dispatcher
–stackwalk profile+readythread+cswitch

Lets dig in…
xperf -on base -stackwalk profile -f stackwalk.etl

What LATCH pattern do we see?
GetNextRangeForChildScan
Inside:
TableScanNew

• Partition the table by a
“random” value
• Modulo the Key for
example
• Use SQL Server partition
function/schema
The Fix?…
0
1
2
3
4
5
6
253
254
255
hash

CPU Caches
0
100
200
300
400
500
600
700
800
900
1,000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
MillionPages/sec
Size of Accessed memory (MB)
Random Pages
Sequential Pages
Single Page

Goals:
• Compressed
• Prefetch Friendly
• Cache Resident Code
Example, Column Stores
ID Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
ID Date
1 2011-11-25
2 2011-11-25
3 2011-11-25
4 2011-11-25
5 2011-11-25
6 2011-11-25
7 2011-11-25
Date
ID Sale
1 2 GBP
2 2 GBP
3 10 GBP
4 5 GBP
5 5 GBP
6 10 GBP
7 10 GBP
Sale

Compression is Easy
ID Value
1-2 Beer
3 Vodka
4-5 Whiskey
6-7 Vodka
ID Customer
1-3 Thomas
4-5 Christian
6-7 Alexei
Product’ Customer’
ID Date
1-7 2011-11-25
Date’
ID Sale
1-2 2 GBP
3 10 GBP
4-5 5 GBP
6-7 10 GBP
Sale’
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
RL Date
7 2011-11-25
Date’
RL Sale
2 2 GBP
1 10 GBP
4 5 GBP
2 10 GBP
Sale’

Squeezing it even more
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Product’
RL Value
2 1
1 2
2 3
2 2
Product’
Beer = 1
Vodka = 2
Whiskey = 3
ID Value
1-2 Beer
3-3 Vodka
4-5 Whiskey
6-7 Vodka
Product’
4+4+4+2 = 14B
+ 4+4+5+2 = 15B
+ 4+4+7+2 = 17B
+ 4+4+5+2 = 15B
= 61B
4+4+2 = 10B
+ 4+5+2 = 11B
+ 4+7+2 = 13B
+ 4+5+2 = 11B
= 45B
4+4 = 8B
+ 4+4 = 8B
+ 4+4 = 8B
+ 4+4 = 8B
= 32B
RL Value
2 0x01
1 0x10
2 0x11
2 0x10
Product’
4 = 4B
+ 4 = 4B
+ 4 = 4B
+ 4 = 4B
+ 4 x 2b = 2B
= 18B

RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
2 steps with Beer
2 steps with Thomas
Beer Thomas
Beer Thomas
SELECT Product, Customer FROM Table
1 step with Vodka
1 step with Thomas
Vodka Thomas
2 step with Whiskey
2 step with Christian
Whiskey Christian
Whiskey Christian
2 step with Vodka
(Note: Repeated value)
2 step with Alexei
Vodka Alexei
Vodka Alexei

Hash Joining with Column Stores
RL Key
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Table
Key Type
Beer Soft
Vodka Strong
Whiskey Strong
Vodka Strong
Dim Product
SELECT …
FROM Table
JOIN DimProduct ON Key
WHERE Type = ‘Strong’
1 Compute bloom filter of Keys belonging to ‘strong’
2 Read RL = 2, Beer from Table
3 Compute bloom value of Beer.
4 Equal to filter value from 1? Yes. Output two rows (RL=2)
5 Compute bloom value for Vodka
6 Equal to filter value from 1? No. Do nothing
7 Compute bloom value for Whiskey
8 Equal to filter value from 1? No. Do nothing
Can pre fetch data (news RLE)
Can calculate match/no
match using only local CPU
cache
Wont work for OLTP!

Why is it so hard to get joins right?
n
m
Time
Loop Join
Merge Join
Hash Join

What Type of Workload?
BigSmall
Small
Big
DataReturned
Data Touched
OLTP BI/DW
Simulation ETL

How to Classify?
OLTP BI/DW
Simulation ETL
Full Scan/sec
Range Scans/sec
Probe Scans/sec
Index Search/sec
Range Scans/sec
Full Scan/sec
Range Scans/sec
Bulk Copy Rows/sec
?

There should ALWAYS be a fully
indexed path to the data.
OLTP System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL

1. Find worst CPU consuming query with
sys.dm_exec_query_stats
2. Add OPTION (LOOP JOIN) to offending
query
3. Check estimated query plan
4. If table spool found: add index to
remedy and GOTO 3
5. Happy? If not, GOTO 1
The Super Quick OLTP Tuning Guide

The query will not be (much)
worse than a full scan of a fact
partition
DW/BI System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL

1. Find offending query
2. Add OPTION (HASH JOIN) to query
3. Does dimension tables have indexed path
to build hash? If not, add index
4. Do you get a fact table scan and hash
build of all dimensions? If not, check
statistics (especially on facts and skewed)
5. Optimize Fact table scans
1. Partition and partition elimination
2. Column store if you have it
3. Aggregate Views
4. Bitmap index pushdown (statistics!)
5. Composite indexes (last resort!)
The Super Quick DW tuning Guide

The expected DW Query Plan
Partial
Aggregate
Fact CSI Scan
Dim Scan
Dim Seek
Batch
Build
Batch
Build
Hash
Join
Hash
Join
HashStream
Aggregate

• At least enough RAM to hold the hash
tables of the largest dimension
• De-normalisation helps… a LOT
• Especially for the large/large joins
• Likely: need to scan fast from disk if
RAM is not big enough to hold the fact
• Compression REALLY matters
Things that Follow from desired DW Plan

Where EVERY Server wide diagnosis starts
SELECT *
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (SELECT wait_type FROM
#ignorewaits)
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC

• Shows up as waits for PAGEIOLATCH
• You can dig into details with:
Common Problems - PAGEIO
SELECT *
FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)
• Can also Xevent your way to it per
query
CREATE EVENT SESSION [TraceIO] ON SERVER
ADD EVENT
sqlserver.file_read_completed(
ACTION (sqlserver.database_id,sqlserver.session_id))

• I/O, like memory, is a GLOBAL resource
for the machine
• When does it make sense to partition a
global resource?
• When you deeply know the workload
• When the workload is ALREADY partitioned
• When neither of those are true: DON’T
partition
• If you have NAND/SSD – Why bother?
The general I/O Guidance

A good way to Think of Spindle I/O

JBOD SAME
LUN
Seq.
LUN
Seq.
LUN
Seq.
RAID system
Large LUN
Seq. Seq. Seq.
RANDOM I/O

Stripe vs. Concatenation
RAID 10 RAID 10
Concatenated LUN
RAID 10 RAID 10
Striped LUN

OLTP
• One big SAME setup
• data files
• Tempdb
• Dedicate
• Transaction log
• DRAM:
• Enough to hold most of
DB
Data Warehouse
• JBOD setup
• Data Files
• 1-2 per LUN
• SAME setup
• Tempdb
• Dedicate
• Transaction Log
• DRAM:
• Enough to hold largest
partition of largest table
Rules of Thumb – Spindle I/O and DRAM

• Short Stroking
• Elevator Sort
• Sequential vs.
Random
• Weaving
You can do a bit better… or worse

• Intentionally use
lower % of total
space
• Tradeoff:
• Space for Speed
• Test:
• 15K rpm
• SAS spindle
• 300GB
Short Stroking Disks
150
200
250
300
350
400
0% 20% 40% 60% 80% 100%
IOPS
% Capacity Used

Full Stroked Short Stroked
Why does Short Stroking Work?
Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the
disk wont use the full platter either. The result: less head movement

Adding Elevator Sorting
0
200
400
600
800
1000
1200
0
100
200
300
400
500
600
Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked
Latency
IOPS
8K random I/O
IOPS
Avg. Latency
Max Latency
Bat powered
disk!

Why Chase Sequential I/O?
0
10
20
30
40
50
60
70
80
1
10
100
1000
10000
100000
Sequential Full Stroke Random
Latency(ms)
Log(IOPS)
8K Block Pattern
IOPS
Avg Latency
Max Latency

• One SATA disk
• Two partitions
• One file on each
• Sequential read on
each file
But all is not well!
File1 File2

I/O Weaving in action
0
2
4
6
8
10
12
14
16
18
0
50
100
150
200
250
300
64K Random 64K Dual Sequential
Latency(ms)
IOPS
IOPS
Avg Latency
Source: Michael Anderson Service Time + Wait Time

Storage Pool and Weaving
DataLog DataLog DataLog
Massive, then Provisioned Pool
Seq
Ran
Seq
Ran
Seq
Ran
RANDOM!

The SAN will properly handle Sharing!
Green: Checkpoint, Red: tx/sec, Black: Disk Latency Service Time + Wait Time

Numbers to Remember - Spindles
Characteristic Typical Units
Throughput / Bandwidth 90-125MB/sec
But ONLY if sequential access!
Operations per Sec 10K RPM Spindle: 100-130 IOPS
15K RPM Spindle: 150-180 IOPS
Can get about 2x if short stroking (more
later)
Latency 3-5ms
(compare DRAM: 100ns)
Capacity 100s of GB to single digit TB
2012 numbers, will change in future Service Time + Wait Time

• Few hundreds of IOPS
• Faster if short stroked
• Trade latency for speed with elevator
sort
• Sequential is hard to get right
Summary so far.. Single Disk

• Wider Stripes neat
• But scale not linear
• Very deep queues
help
• But add latency
• Shared
Components
Why does a big RAID pile not solve this?

RAID Scale?
Your Mileage WILL vary with the hardware

Before After
Getting rid of Sharing
Switch
HBA HBA HBA HBA
Storage
Port
Storage
Port
Switch
LUN LUN
Cache
Disk
CPU
Switch
HBA HBA HBA HBA
Storage
Port
Storage
Port
Switch
LUN LUN
Cache
Disk
CPU
x2

4K
PN N
NAND Flash Basics
112
PN N
Oxide Layer
Floating Gate
Electrons
trapped
Control Gate
NAND Die
Pack
Blocks
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
PN N PN NPN N
PN NPN N PN NPN N
Pages

NAND Flash Problems
• Erase Cycles
• Around 100K
• Rebalancing and reclaim/trim
• Voltage measurement
• Gets worse with density
• Changes over time
• Depends on how you program
• Bit Rot
• Must refresh even on read
• SLC easier to manage than MLC
• But much more expensive!
113
Voltage
00
01
10
11

Lessons Learned: Try to Avoid Sharing
BAD BETTER BEST

• Only partially diagnosed as waits in
sys.dm_os_wait_stats
• Task Manager gives a bit more
information
• Need: transparency to the deep level
latencies and packets!
Common Problems: ASYNC_NETWORK, OLEDB

A common Wait Type
The database is really
slow! The code takes
forever to run!

• We may not always have insight into
what is going on at the client…
Xperf Diagnosing the Network
xperf –on latency+network
Summary
Table

Timeline of the network Traffic

ASYNC_NETWORK_IO, the typical issue

Handling network is EXPENSIVE
xperf –on latency
?

Short Story on DPC/ISR handling
CPU
Core
Core
L1-L3
Cache
PCI
BUS
IRQ
HALT execution
Fire ISR Routine
if (my interrupt)
{
<Mark Handled>
Queue DPC
}
NIC
Work Done
DPC
<Do work needed>
<Wake Application>
Core can
run other stuff
again

It looks like this…
DPC
ISR

• Option 1: Use the HW vendors tool
• Option 2: Use interrupt Affinity Policy Tool
from MS
Setting Interrupt Affinity

• Standard Payload
Network (MTU):
• 1500 B
• Jumbo Frames
• 9014 B(MTU)
Jumbo Frame and SQL Packets
• Standard SQL
payload
• 4096 B
• Largest
• 32767 B
SELECT session_id, net_packet_size
FROM sys.dm_exec_connections
Server=foo;Packet size=32767

Core Evolution
Moore’s “Law”:
“The number of transistors per
square inch on integrated
circuits has doubled every
two years since the
integrated circuit was
invented”

• Never faster than a single core
• Smaller servers are faster than bigger ones
• Large L2 caches and more clock speed help
• The algorithm dictates speed
• Latency of Wait Time sets upper limit
• Examples from MSSQL land:
• Formula Engine in MSAS
• Transaction Log Writes
• INSERT/UPDATE/DELETE (as we shall see)
Single Threaded

VLF files
• When switching to new VLF – it has to be ”formatted” with
8K sync write
• While this happens, transactions are blocked
• Too many VLF = Too much blocking
• Lesson: Preallocate the database log file in big chunks
• Up to 128 Log Buffers per database
• Spawned on demand, will not be released once spawned
• Transactions will wait for LOGBUFFER is no buffer available
• Think of this like a pipeline of commits waiting…
VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K
<=60K
X 128

Transaction Log Background
Buffer Offset (cache line)
LOGCACHE
ACCESS
Alloc Slot in Buffer
MemCpy Slot
Content
Log Writer
Writer Queue
Async I/O Completion
Port
Slot
1
LOGBUFFER
WRITELOG
LOG
FLUSHQ
Signal thread which
issued commit
T0
Tn
Slot
127
Slot
126

• Speed is determined by Latency and
Code Path
• Max Log Write Size: 60K
Zooming to the Log Writer
Log Writer
Async I/O Completion Port
Signal thread which
issued commit
Latency
Writer Queue

Long Distance Replication…
Log Entry Log Entry
Network
Log Entry
Send log
Ack Log
Primary Secondary
Write Write
Executive Summary:
The speed of light ( c )
is not fast enough!

• Perfmon will only show millisec
• What if we want microseconds?
Getting to the Real Latency
xperf –on latency

It’s in Memory, so it must be fast?
VS.
Latency: 15-30us Latency: <5us
RAM DISK
1.5sec 1.5sec

No, Because…
This adds up to one core… it is doing all it can with the CPU it has

The Effect on UPDATE
Naïve
UPDATE MyBigTable
SET c6 = 43
Parallel
UPDATE MyBigTable
SET c6 = 43
WHERE key
BETWEEN 10**9 * n
AND 10**9 * (n+1) -1CX
Runtime
(smaller is faster)

What is Scalable?
0
500
1000
1500
2000
2500
3000
0 4 8 12 16 20 24
Throughput
Some Hardware Resource
Good
So so
Bad
We want
to live here

Amdahl’s Law of gated speedup
1
6
11
16
21
26
31
0 8 16 24 32 40 48 56 64
SpeedupFactor
Number of cores
P = 100%
P = 95%
P = 90%
P = 80%
P = Part of program that can be made Parallel
(Note that this may be 0... or 1)
N = Number of CPU cores available
Speedup =

Introducing Contention – Locks
Table A
Table B
Table C
INSERT TableA …
INSERT TableB …
INSERT TableC …
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Wait Stat: LCK_<X>

But those rows have to be stored…
Table A
Table B
Table C
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Data
File
File
Group

It all Starts with Wait Stats
SELECT *
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (SELECT
wait_type FROM #ignorewaits)
AND waiting_tasks_count > 0
ORDER BY wait_time_ms DESC
DBCC PAGE

PFS – Hidden Single Page Contention
Data File
GAM/
SGAM
PFS
64MB
PFS PFS
64MB
PFS
64MB
PFS
B B B B
B B B B
B B B B
B B B B
8K
10010010
INSERT TableA …
Allocated bit

Data
File
Data
File
Data
File
More Files
Table A
Table B
Table C
LCK
LCK
LCK
LCK
LCK
LCK
LCK
LCK
Data
File
File
Group • Round Robin
between files
• More files, more
structures
• No affinity

How many more Files?
1
10
100
1000
10000
100000
1000000
10000000
260
280
300
320
340
360
380
400
0 8 16 24 32 40 48
PAGELATCH
Runtime
# Data Files
Runtime PAGELATCH_UP

• Shared, physical MEMORY structures
can cause bottlenecks (ex: PFS)
• SQL Server must sync too…
• Understanding where structure resides
leads to tuning fix
• Theory of engine!
Concurrency: What we learned so far

• Commonly misdiagnosed
• CXPACKET does NOT (always) mean
that your DOP is “too high”
CXPACKET
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
10.015.020.025.030.035.040.0
CXPACKETWaits
Throughput (MB/sec)
CXPACKET waits / Throughput
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
1 11 21 31 41
Throughput(MB/sec(
DOP
Throughput / DOP

CXPACKET = Issue may be elsewhere…

• What happens when you get things like:
LATCH_<x>
PAGELATCH_<x>
Step 1: Dig into:
Diagnosing Latches
SELECT *
FROM sys.dm_os_latch_stats

Post Fix Pattern

• Before: 6GB/sec
• After: 20GB/sec
• This sometimes works on cluster
indexes too…
…Whiteboard
Speedup with Hash Partition of Heap

UPDATE Hotspot
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX

Before
ALTER TABLE HotUpdates
ADD COLUMN Padding CHAR(5000)
NOT NULL DEFAULT („X‟)
After
UPDATE Hack on Small Tables
Page (8K)
ROW
LCK_U
PAGELATCH_EX
CHAR(5000)
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX

Test: Updates of pages
Compression Update 1.4M CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 54 sec 100% one core
None – I/O 17 sec 100% one core
PAGE – I/O 59 sec 100% one core
L_QUANTITY is NOT NULL
i.e. in place UPDATE

Function CPU %
qsort 0.86
CDRecord::Resize 0.84
CDRecord::LocateColumnInternal 0.36
perror 0.36
Page::CompactPage 0.36
ObjectMetadata::`scalar deleting destructor' 0.27
SearchInfo::CompareCompressedColumn 0.24
CDRecord::InitVariable 0.19
CDRecord::LocateColumnWithCookie 0.18
memcmp 0.16
PageDictionary::ValueToSymbol 0.16
Record::DecompressRec 0.14
PageComprMgr::DecompressColumn 0.14
CDRecord::InitFixedFromOld 0.1
SOS_MemoryManager::GetAddressInfo64 0.08
AnchorRecordCache::LocateColumn 0.08
CDRecord::GetDataForAllColumns 0.08
ScalarCompression::Compare 0.07
PageComprMgr::CompressColumn 0.07
Record::CreatePageCompressedRecNoCheck 0.06
memset 0.05
PageComprMgr::ExpandPrefix 0.04
PageRef::ModifyColumnsInternal 0.04
Page::ModifyColumns 0.03
DataAccessWrapper::ProcessAndCompressBuffer 0.03
SingleColAccessor::LocateColumn 0.03
CDRecord::BuildLongRegionBulk 0.02
ChecksumSectors 0.02
Page::MCILinearRegress 0.02
DataAccessWrapper::DecompressColumnValue 0.02
SOS_MemoryManager::GetAddressInfo 0.02
CDRecord::FindDiff 0.02
AnchorRecordCache::Init 0.02
PageComprMgr::CombinePrefix 0.01
Total 5.17
UPDATE Compression burners
Out of 8.55 … Approx: 60%

Compression and Locks
Xevent Trace
Lock Acquire/Release
High Res Timer

How long are locks held?
0
100
200
300
400
500
600
PAGE NONE
CPU KCycles
Lock Held Cycle Count
Avg
StdDev

• Sharing is generally bad for scale (but
may be good for performance)
• PAGELATCH and LATCH diagnosis starts
in sys.dm_os_latch_stats
• CXPACKET
• Only important if throughput drops when
DOP goes up
• If this happens, look for another wait/latch
• Table partitioning can be used to work
around concurrency issues
Summary Concurrency – So Far..

The Paul Randal INSERT test
160M rows, executing at concurrency
Commit every 1K:
EASY
tuning?

But Page Splits are Bad, right?
= BAD!
= Better!...

WRITELOG gone? Faster?
?
?
sys.dm_os_wait_stats

And the Score Is…
0
5000
10000
15000
20000
25000
30000
35000
newguid() newsequentialid() IDENTITY
Time in Seconds

What is going on here???
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max

Tricks to Work Around this
0
-1000
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
INSERT
INSERT
INSERT

All Cores at 100%
0
5000
10000
15000
20000
25000
30000
35000
newguid(
)
newsequ
entialid()
IDENTITY
IDENTITY
+Unique
IDENTITY
+Unique
+Hash8
IDENTITY
+Hash24
IDENTITY
+Hash48
SPID+
Offset
Seconds
Runtime in Seconds
600K
Inserts/sec
830K
Inserts/sec
All Cores at ~100%

• Don’t use Sequential Keys
• Page Splitting isn’t so bad
• Neither are GUID
• Generate keys wisely. Ideally in the app
server
• For “transparent” speedup, consider our
old hash trick
Takeaways, INSERT workload

• Minimally Logged
• Single, large
execution
(thousands)
• Unsorted data
• Concurrent Loaders
BULK INSERT Workload
Heap
Bulk Insert
Bulk Insert

Measure:
SELECT * FROM
sys_dm_os_latch_stats
Observe waits on
ALLOC_FREESPACE_CACHE
Theory (just read BOL):
“Used to synchronize the access to
a cache of pages with available
space for heaps and binary large
objects (BLOBs). Contention on
latches of this class can occur
when multiple connections try to
insert rows into a heap or BLOB at
the same time. You can reduce
this contention by partitioning the
object.”
When does BULK INSERT scale break?
0.0
50.0
100.0
150.0
200.0
250.0
0 5 10 15 20 25 30
MB/Sec
Concurrent BULK INSERT
1
2
3

What is Happening here?
Free Page information (PFS/GAM/SSGAM)
HOBT Cache
Fat
Chunks
Alloc
new
pages!Bulk Insert
ALLOC_FREESPACE_CACHE
This is in DRAM
and L2

• Break Up table
by “some key”
• Optional: Switch
out partitions
• Spin up multiple
bulks
• Linear scale
• 3GB/sec
• 16M
LINEITEM/sec
Breaking Through the Bottleneck
425
555
215
200
101
453
666
Area
Bulk Insert
Bulk Insert
Bulk Insert

BULK INSERT - Reloaded
• Thomas, you might have gotten 16M
rows/sec at 3GB/sec insert speed
• But this was on heaps, I have a clustered
table
• Alright then, let us hit a cluster index
1-1000
Clustered and partitioned
1001-2000
2001-3000
3001-4000
X Lock
X Lock
X Lock
X Lock

Cluster Bulking – It seemed so plausible!
1
2
3

Cluster Bulking – Stage and Switch
1
2
3

• Context Switching is expensive
• Typically 10K or more CPU cycles
• If you expect the ressource to be held
only shortly, why fall asleep?
What is a Spinlock?
spin_acquire(int* s)
{
while(*s==1)
*s = 1;
}
Spin_release(int* s)
{
*s = 0;
}

• Acquire can be very expensive
• SQL Server implements a backoff
mechanism
What is a backoff?
spin_acquire(int* s)
{
int spins = 0;
while(*s==1)
{
spins++;
if (spins > threshold)
{
<Sleep and WaitForRessource>
}
}
*s = 1;
}
SELECT *
FROM sys.dm_os_spinlock_stats
DBCC SQLPERF(spinlockstats)
Backoff

WRITELOG is I/O – right?
Should be the same as this… or?
No! Because:

• Step 1: Copy sqlserver.pdb to the BINN
directory
• Step 2: DBCC TRACEON (3656, -1)
• Step 3: Steal script from:
http://www.microsoft.com/en-
us/download/details.aspx?id=26666
Note for 2012, you additionally need:
• sqlmin.pdb, sqllang.pdb, sqldk.pdb
Diagnosing a Spinlock the Cool way!

Spinlock Walkthrough – Extended Events Script
--Get the type value for any given spinlock type
select map_value, map_key, name from
sys.dm_xe_map_values
where map_value IN ('SOS_CACHESTORE')
--create the even session that will capture the
callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action
(package0.callstack)
where
type = 144 --SOS_CACHESTORE)
add target
package0.asynchronous_bucketizer (
set
filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with
(MAX_MEMORY=50MB, MEMORY_PARTITION_MODE =
PER_NODE)
--Run this section to measure the contention
alter event session spin_lock_backoff on server
state=start
--wait to measure the number of backoffs over a 1
minute period
waitfor delay '00:01:00'
--To view the data
--1. Ensure the sqlservr.pdb is in the same directory
as the sqlservr.exe
--2. Enable this trace flag to turn on symbol
resolution
DBCC traceon (3656, -1)
--Get the callstacks from the bucketize target
select
event_session_address, target_name, execution_count, c
ast (target_data as XML)
from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on
(xst.event_session_address = xs.address)
where xs.name = 'spin_lock_backoff'
--clean up the session
alter event session spin_lock_backoff on server
state=stop
drop event session spin_lock_backoff on server

Of course, you can just use 2012…

How to improve a spinlock?
CPU
Core
Core
L1-L3
Cache
CPU
Core
Core
L1-L3
Cache
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU

CoreInfo.Exe – where are my cores?
CoreInfo.exe

Revisiting the TLOG
Buffer Offset (cache line)
LOGCACHE
ACCESS
Alloc Slot in Buffer
MemCpy Slot
Content
Log Writer
Writer Queue
Async I/O Completion
Port
Slot
1
LOGBUFFER
WRITELOG
LOG
FLUSHQ
Signal thread which
issued commit
T0
Tn
Slot
127
Slot
126

I/O Affinity Mask!
0
50
100
150
200
250
SPID
+ Offset
SPID
+ Affinity
sp_configure
„AffinityIOMask‟

Bulking at Concurrency
• What’s that spin?
xperf –on latency –stackwalk profile
xperf –d trace.etl
xperview trace.etl
SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY spins_count
DBCC SQLPERF (spinlockstats)
?

SOS_OBJECT_STORE at high INSERT
• Observed: This Spin happens when
inserting
• Need: Reduce locking overhead
• Fixes that work well here:
8x
throughput
Bonus

• Lets try something really silly:
• Run lots of: EXEC emptyProc
• This should be infinitely scalable, right?
Diagnosing another Spinlock
CREATE PROCEDURE emptyProc
AS
RETURN

Initial Diagnosis
MUTEX ??? … what Mutex?

Using the Spinlock Script gives us
Some cache
Which one?

Validating the Theory
CREATE PROCEDURE emptyProc0
AS
RETURN
GO
AS
RETURN
GO
…
AS
RETURN

What is the SOS_OBJECT_STORE?
Security Check?

Validating the new “fix”…

DECLARE @ParmDef NVARCHAR(500)
DECLARE @sql NVARCHAR(500)
SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK)
(c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)'
SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3))
SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5
INT, @p6 BIGINT'
DECLARE @constDate DATETIME = '1974-12-22'
DECLARE @i INT
WHILE (1=1) BEGIN
BEGIN TRAN
SET @i = 1
WHILE @i <= 1000 BEGIN
EXEC sys.sp_executesql @sql, @ParmDef
, @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13
SET @i = @i + 1
END
COMMIT TRAN
Consider this Test harness code…

Spinning on MUTEX
Diagnose with trace flag shows spins
stack offender:
CSecurityContext::GetUserTokenFromCache
This is REALLY expensive at scale:
EXEC sys.sp_executesql @sql,
SET @i = @i + 1
END
Initialize a new execution context on
every loop!

Fixing the MUTEX spin
• Instead of:
EXEC sys.sp_executesql @sql,
SET @i = @i + 1
END
• Write:
SET @sql = N'
DECLARE @i INT
WHILE (1=1) BEGIN
BEGIN TRAN
INSERT INTO dbo_<t>.MyBigTable_<t> WITH
(TABLOCK)
(c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)
SET @i = @i + 1
END
COMMIT TRAN
END
EXEC sys.sp_executesql @sql, @ParmDef
4x
throughput
Bonus

• When all other bottlenecks are
gone, sharing happens in the most
unlikely places
• You can use spinlock Xevents inside SQL
Server
• Remember symbol files in BINN
• Trace flag 3656
• This can also be done in XPERF for non
SQL apps
• Ex: Analysis Services
Concurrency, Spinlock Summary

• Control of buffers and NUMA for Xperf
setting
• By default:
• 4MB mem
• Spool to disk at root of C-drive
• Can do buffer/file control:
• -buffersize and –maxbuffers
• -maxfile and –FileMode Circular
Xperf controlling buffers

• Round robin between NUMA nodes
• Inside the NUMA: Pick the one that
looks the least busy
• This is NOT a perfect system
How SQL Server assigns threads

Xperf -on
Latency+CSWITCH+DISPATCHER -
stackWalk
CSwitch+ReadyThread+ThreadCreate+Pr
ofile -BufferSize 1024 -MaxBuffers
1024 -MaxFile 1024 -FileMode Circular
REG ADD
"HKLMSystemCurrentControlSetControl
Session ManagerMemory Management"
-v DisablePagingExecutive -d 0x1 -t
REG_DWORD -f
Super Xperf

• All the tuning wont help you if your
model is wrong
• Tunings gets your far, but to really
scale, you need a good data model
• This is what my other courses are about
But does the Data Model Work?

Problem Statement
Queue Structure
Msg Msg Msg Msg Msg
Ordered
Push Pop
300B
msg

The Naïve Approach
• Push
• Seek First Row
• INSERT Row
• Pop
• Seek Last Row
• DELETE/Output
Key
Max
Msg
Min Max
Msg
Min
Msg

Why this doesn’t Scale!
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max

NextPrev
Virtual
Root
LATCH
HOBT_VIRTUAL_ROOT
LCK
PAGELATCH
PAGELATCH
PAGELATCH
B-Tree Root Pages

Summarising the Problem
• Hot stuff
• Root
• Min page
• Max page
• Intermediate
pages
• Alloc/Dealloc
• BUT: We Must
have order!

What if…
• Push
• Seek first value
page
• UPDATE Reference
Count
• Pop
• Seek last value
page
• UPDATE Reference
Count
Min Max
Msg++
Min Max
Msg--

Dissipate the Heat
Min
Msg--
Max
Msg++
Min
Msg--
Max
Msg++
Min
Msg--
Max
Msg++
Last Digit = 0 Last Digit = 1 Last Digit = 2

Eliminating Thread Contention
Queue Structure
Ordered
PushSequence++PopSequence++
87654
VERY
fast!

Ring Buffers
Queue Structure
Ordered
PushSequence++
Mod 100
PopSequence++
Mod 100
Slot: 8
Msg: 108
Slot: 7
Msg: 107
Slot: 6
Msg: 106
Slot: 5
Msg: 105
Slot:4
Msg:104

Summing Up Message Queue Hack
• UPDATE
• instead of INSERT/DELETE
• More partitions = More
B-Trees
• Ring buffer using modulo
• Find Sweet spot
concurrency

Master tuning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Master tuning

Similar to Master tuning (20)

Recently uploaded

Recently uploaded (20)

Master tuning

Editor's Notes