Technical Leader at @SolidQ and Microsoft Data Platform MVP
Mar. 20, 2013•0 likes•2,081 views
1 of 42
Parallelism in sql server
Mar. 20, 2013•0 likes•2,081 views
Report
Technology
In this session we will discuss about the parallelism in SQL Server. We will talk about configuration parameters, parallel execution plans, parallel operators and more. We also will talk about problems and best practices
1. Parallelism in SQL Server
Enrique Catala Bañuls
Mentor, SolidQ
ecatala@solidq.com
Twitter: @enriquecatala
2. Enrique Catala Bañuls
Computer engineer
Mentor at SolidQ in the relational engine
team
Microsoft Technical Ranger
Microsoft Active Professional since 2010
Microsoft Certified Trainer
4. Volunteers:
They spend their FREE time to give you this
event. (2 months per person)
Because they are crazy.
Because they want YOU
to learn from the BEST IN THE WORLD.
If you see a guy with “STAFF” on their back –
buy them a beer, they deserve it.
9. Objectives of this session
Basics on parallelism
Settings to adjust parallelism
Exchange operators
Enemies of the parallelism
Best practices
9 | 3/20/2013 |
10. Parallelism
“Parallelism is the action of executing a
single task across several CPUs”
It enhances performance taking advance of
newest HW configurations
11. Parallelism benefits
SQL Server uses all CPU by default
Generally the queries that qualify for parallelism are
high IO queries
12. SMP
Symmetric multiprocessing (SMP) system
All the CPUs share the same main memory
No hardware partitioning for memory access
Typically used in smaller computers
SMP architecture
CPU CPU CPU CPU CPU CPU CPU CPU
System bus
CPU CPU
F
Main
S
Memory Memory
B
CPU CPU
13. NUMA
Non-Uniform Memory Access
Nodes connected by shared bus, cross-bar,
ring
Typically used in high-end computers
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
Memory Memory Memory Memory
Controller Controller Controller Controller
Node Controller Node Controller Node Controller Node Controller
Shared Bus
14. NUMA
SQL Server is NUMA aware
Automatically detects NUMA configuration
Minimizes the memory latency by using local
memory in each node
SQL Server must be properly configured to
gain the best performance in NUMA systems
15. SQL Server Execution Model
SQLOS
SQLOS creates a scheduler for
Memory Node
each logical CPU
A scheduler is like a logical
CPU Node CPU used by SQL Server
Scheduler
workers
Only one worker can be executed
Worker by a scheduler at the same time
The unit of work for a worker is a
Task
task
16. Schedulers and concurrency
Pre-emptive scheduler (Windows)
Windows uses pre-emptive scheduling because of its general
operating system nature
It uses a priority-driven architecture
Each thread executes in a predetermined time slice
A thread can be preempted by a higher priority thread
Cooperative scheduler (SQL Server)
Each task puts itself in the waiting list every time it needs a
resource
The same scheduler executes until the end
This voluntary yielding by workers prevents context switching
and improves performance
17. Objectives of this session
Basics on parallelism
Settings to adjust parallelism
Exchange operators
Enemies of the parallelism
Best practices
17 | 3/20/2013 |
18. Settings to adjust parallelism
Hardware level
NUMA
Instance level
Soft-NUMA (affinity mask)
Degree of parallelism
Cost threshold for parallelism
Max worker threads
-P parameter
Connection level
Resource Governor by configuring MAXDOP
Query level
MAXDOP clause
T-SQL patterns
CROSS APPLY
Functions…
19. CPU Affinity Mask
• Used to set which processor(s) can be used by the SQL
Server instance.
• Setting a processor affinity will tie the threads to a particular
processor
20. Affinity I/O Mask
Used to affinitize the CPU usage to I/O
operations
Each I/O operation needs to be finalized
Byte checksum, number of transferred bytes,
page number okay, etc.
CPU consumption
Can be used to specify the lazy writer (in a
new hidden scheduler)
Bad Good
22. Threshold for parallelism
Instance level configuration
Change statistically the parallel execution
Changes the boundaries of when a serial plan should be
changed to parallel plan
if(best_plan_for_now.cost<1) return(best_plan_for_now)
else if(MAXDOP>0
and best_plan.cost > threshold for parallelism)
return(MIN(create_paralel_plan().cost, best_plan_for_now))
23. Demonstration 1
Affinity mask, cost threshold for
parallelism
24. Degree of parallelism (DOP)
Max degree of parallelism
o Instance setting that affects the whole instance
o Can be configured at resource governor´s
workload level
o Enforces the maximum number of CPUs that a
single worker can use
MAXDOP hint
o Can be used at query level
26. Objectives of this session
Basics on parallelism
Settings to adjust parallelism
Exchange operators
Enemies of the parallelism
Best practices
26 | 3/20/2013 |
27. Exchange operators
Operators dedicated to moving rows between
one or more workers, distributing individual
rows among them
28. Distribute streams operator
Row distribution based on
Hash
Each row computed a hash and each thread Works only with the rows that have
the same hash
Round-robin
Each row is sent to the following thread of a round-robin
Broadcast
All rows are sent to all threads
Range
Each row is sent to a thread based on a range computation over a column
Rare and used in some parallel index creation operations
Demand
Pull mode
It SENDS the row to the operator is calling
It appears on partitioned tables
29. Repartition streams operator
Takes rows from multiple sources and send rows
to multiple destinations
Doesn´t update any row
30. Gather streams operator
It takes rows from multiple sources and send
to a single destination (thread)
Tipically increases CXPACKETS
32. Objectives of this session
Basics on parallelism
Settings to adjust parallelism
Exchange operators
Enemies of the parallelism
Best practices
32 | 3/20/2013 |
33. Enemies of the parallelism
Makes the whole plan serial
Modifying the contents of a table variable (reading is fine)
Any T-SQL scalar function
CLR scalar functions marked as performing data access (normal ones
are fine)
Random intrinsic functions including OBJECT_NAME,
ENCYPTBYCERT, and IDENT_CURRENT
System table access (e.g. sys.tables)
Serial zones
TOP
Sequence project (e.g. ROW_NUMBER, RANK)
Multi-statement T-SQL table-valued functions
Backward range scans (forward is fine)
Global scalar aggregates
Common sub-expression spools
Recursive CTEs
37. Objectives of this session
Basics on parallelism
Settings to adjust parallelism
Exchange operators
Enemies of the parallelism
Best practices
37 | 3/20/2013 |
38. Best practices
Never trust the default configuration for the
degree of parallelism
By default, MAXDOP = 0
As a general rule
Pure OLTP should use MAXDOP = 1
MAXDOP not to exceed the number of physical cores
If NUMA architecture,
MAXDOP <= #physical_cores_numa_node
wait type name wait time (ms) requests
CXPACKET 786556034 128110444
LATCH_EX 255701441 155553913
ASYNC_NETWORK_IO 129888217 19083082
PAGEIOLATCH_SH 83672746 2813207
WRITELOG 70634742 48398646
SOS_SCHEDULER_YIELD 47697175 176871743
39. Best practices
When to apply MAXDOP?
ALTER INDEX operations
Typically set MAXDOP = #_physical_cores
When to set max degree of parallelism?
When you see high CXPACKET waits
OLTP pure systems should set its value to 1
When to set cost threshold for parallelism?
When you want to change the number of parallel
operations statistically
40. Objectives of this session
Basics on parallelism
Settings to adjust parallelism
Exchange operators
Enemies of the parallelism
Best practices
41 | 3/20/2013 |
42. Parallelism in SQL Server
Enrique Catala Bañuls
Mentor, SolidQ
ecatala@solidq.com
Twitter: @enriquecatala
Editor's Notes
Thereis a lot of topicsonthis área and i tryedto concéntrate some of themostimportantparts in anhoursession
A quickexample:If i have 200 differentcoins and Iwanttogethowmuchmoney Ihave, i can addonebyone , I can give 100 coinstomypartnertoget a partialresult, orforexample Split mycoinsbetween 10 partnerstoget 10 partialresults and thenobtainhowmuchmoney I have … aftersome “specialfee” youknowThe real time expended gettingtheresultswillnot be thesame and obviouslythemuchpartners i use togetpartialresults, the more quicklyi´llgettheresult….butthisisnotalways true.
Typicalbennefit: the more CPU, the more performance…butitsthat true?It´stipicallon REBUILDING indexes, aggregations, tablescans,…CHART, GRAPHIC
The server iscomposedonmultiple NUMA nodes 2-4 typicallyonthe standard configurationsEach NUMA node has itsown CPU and memoryThe server seesthe sum of CPU and memory and all are accesible from SQL Server
The images show the detection of three-node NUMA hardware by SQL Server and the three lazy writer threads (one per each NUMA node).SQL Server is able to get the best performance in NUMA hardware by doing some special automatic configurations, such as having special threads for some internal components in each NUMA node. Note: Mention that as a common rule, you must configure the MAXDOP value lower than the number of physical cores per each NUMA node. With this configuration, if a query is executed in parallel, all the threads will be in the same node.
the SQLOS is a thin user-mode layer that sits between SQL Server and Windows. It is usedfor low-level operations such as scheduling, I/O completion, memory management, and resourceManagementWhen an execution request is made within a session, SQL Server divides the work into one or moretasks and then associates a worker thread to each task for its duration. Runs in user modeReduces context switchingBetter resource usageMultiprocessing is enhancedA task uses the same scheduler most of the timeMultiple tasks can be executed at the same timeData locality is enforcedBetter scalability on NUMA hardwareSQLOS works the same in each OS host (w2k3, w2k8r2, w2k12, etc.)
why would i do that?
This configuration is mainly for:Systems with more than one SQL Server instanceSystems with more than 32 heavily used CPUs on which you detected specific I/O congestion problemsWhen you don't use IO affinity the SQL Server worker handles (posts) the IO and takes care of the IO completion on the scheduler the worker was assigned to.The SQL Server GUI on SQL Server 2012 don´tletyoumakemistakesQUESTION: Whyiswrongtheconf “Bad”?REASON: By setting both at the same scheduler they will compete for resources, that is just what you want to avoid.
ENHANCE DATA LOCALITYOnlargesystems, bydoingthiskind of affinity, you can obtain a performance gain of 20%. QUESTION: Why?ANSWER: Becausestatistically, when a scheduler “touches” a datapage, the page isstored at NUMA memory X. if a schedulercommingfromanother NUMA nodeneedstoreadthatspecific page, ittakes 3-4 times the time togetthat page fromoutsideits NUMA node. So bydoingthis, we can forcé specificaplicationstoworkwithspecific NUMA nodes and doingthis, toincreasethepossibilitytoread-write data pagesonthesame NUMA nodes.
26’
Degree of parallelism (DOP) is assigned at each parallel step of the execution planAll CPUs can be used by the schedulers, so threads can use all available CPUsNo special consideration for hyperthreaded CPUsBy limiting DOP, you can limit the number of available threads to solve a query DOP is determined when execution plan is retrieved from the plan cache
28
This operator takes a single input stream of records and produces multiple output streams. The record contents and format are not changed. Each record from the input stream appears in one of the output streams. This operator automatically preserves the relative order of the input records in the output streams. Usually, hashing is used to decide to which output stream a particular input record belongs.
operator consumes multiple streams and produces multiple streams of records. The record contents and format are not changed. If the query optimizer uses a bitmap filter, the number of rows in the output stream is reduced. Each record from an input stream is placed into one output stream. If this operator is order preserving, all input streams must be ordered and merged into several ordered output streams. El cálculo si se mira el plan de eejcución en detalle, viene dado por una expresión que se puede obtener ya en el hash match. En el momento del gather se obtiene el valor 6.En la demo 2-exchange_operators.sql se puede ver con detenimiento
consumes several input streams and produces a single output stream of records by combining the input streamsPARALLEL PAGE SUPLIER to divide rowsacrossthreads in batches
45
There are someenemies of theparalleliam and those are some of them
Paralleloperationsmust be synchronizedbeforeserializyng. So ifsomeworkerendsitsexecution and someotherisstillexecuting, he throws a CXPACKET signalto SQL Server announcingthat he iswaiting and finisheditsexecution. CXPACKET isnot a problemitselfbutitsanindicator of badparallel SQL Server configurationifweseelots of waitsignals of thistype
50
Here are typical scenarios involving CXPACKET wait statistics.Note: It is very unusual to have a pure OLTP system because most customers uses their SQL Server instances for applications, reports, BI data loading solutions, and more.In the example at the bottom, note that 9 days of CPU time is wasted by CXPACKET (786556034 ms = 13109 minutes = 218 hours = 9 days) in threading synchronization due to a bad configuration. (This is a real example from one of the SolidQ’s customers.)Important: It is very important that the students really understand the degree of parallelism setting. It is very common for students to confuse MAXDOP with CPU AFFINITY. Furthermore, make sure that students understand what is a pure OLTP system.