Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine

 An independent SQL Consultant
 A user of SQL Server from version 2000 onwards with 12+ years
experience.
 A DBA / developer hybrid

Techniques for scaling out the data flow and how well they
scale
A look into the inner working of the dataflow engine using
Xperf.
How ‘Elastic’ scalability might be achieved
A wrap up with some key ‘Takeaway’ points

No parallel ‘On’ switch
Parallelism has to be implemented by design, at:
Package level
In the execution flow
In the data flow, by hand and / or through
Transforms that come with SSIS
Third party components
Separating out synchronous transforms

This flow helps determine:
1. Maximum data flow performance <=
source extract speed
Does the source need to be
parallelized ?
2. CPU and I/O profile of the source
when no back pressure is taking place.
Does this swamp the available
hardware resources ?

Good parallel throughput requires:
An even distribution of work between child
threads ( data flows )
Hardware to be configured such that it is
“Hot spot free”
SQL Server and SSIS configured such that
hardware resources are utilised evenly.
In other words, the SSIS equivalent of Bad CX
Packet waits is to be avoided.

Four different ways of extracting data from the source
will be looked at:
NTILE

DELETE statement with an OUTPUT clause
Hash partitioning the source table
Select statement to ‘Partition’ the source by
TransactionID.

 SQL Server 2012 SP 1
 Windows server 2008 R2
 Adam Mechanic's “Big adventure” database
 Hardware
 Intel i960, 6 core, 12 logical threads 3.2 Ghz
 22 Gb memory
 2 x 80Gb Fusion IO (Gen 1) io drives

 Scaling beyond three threads was initially hampered by
PATCHLATCH_EX, LCK_M_X, LCK_M_IX and SOS_SCHEDULER_YIELD waits.
 The ‘Winning’ approach:
 Partition the bigTransactionHistory evenly across twelve file groups, one per
logical processor
 Assign specific threads to specific partitions.
 Page and row locking turned off on the table and lock escalation set to auto on
the clustered primary key in order to force partition level locking.

Test

Execution
Time ( s )

CPU
Consumption
(%)

IO Throughput
( Mb/s)

% Improvement
From Baseline

Baseline

57

40

130

Forced partition level locking

33

46

215

42

OLE.DB provider for SQL used instead of
SQL native client

28

50

240

51

Packet size changed from
4K default to 8K

22

50

275

61

Execution Time (s) Per Data Flow (Thread) Count
140

120

100

80

60

40

20

0
1

2

Destructive Read

3

Partition Scan

4

Range Scan

6

Ntile

Average Percentage CPU Consumption Per
Data Flow (Thread) Count
80

70

60

50

40

30

20

10

0
1

2

Destructive Read

3

Partition Scan

4

Range Scan

6

NTILE

Wait Event Breakdown ( Percentage )
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

Destructive Read

Range Scan

Partition Scan

ASYNC_NETWORK_IO

PREEMPTIVE_OS_WAITFORSINGLEOBJECT

ASYNC_IO_COMPLETION

SOS_SCHEDULER_YIELD

WRITELOG

LOGBUFFER

PAGEIOLATCH_SH

Ntile

NTILE is clearly the slowest approach.
The range scan and partition scan can only be separated by CPU
consumption.
Wait activity stats are dominated by ASYNC_NETWORK_IO and
PREEMPTIVE_WAITFORSINGLEOBJECT
The source is out performing the rest of flow.

Use a heap version of the bigTransactionHistory table partitioned
across twelve file groups on (TransactionID % 12) + 1.
Compare the scalability of the balanced data distributor versus the
conditional split.
Source is a single straight select from the bigTransactionHistory
table.

Synchronous
 Non blocking
 Rows in = Rows out
Asynchronous
Rows out usually <> Rows in
 Semi Blocking
 Blocking
 “Magic” Virtual buffer ;-)

Execution Time (s) Per Output Count

160

140

120

100

80

60

40

20

0
1

2

3

Balanced Data Distributor

4

Saturation
point, time to scale
out

Conditional Split

5

6

IO Throughput (MB/s)
Per Output Thread Count
250

200

150

The two Fusion I/O cards are capable of more
throughput than that which appears on any of
the graphs in this material. What is presented is
sustained throughput, when performing the
actual tests, during check points, ‘Spikes’ of
much higher throughput were observed.

100

50

0
1

2

3


4

Conitional Split

5

6

Average CPU Consumption ( % )
Per Thread Count
60

50

40

A transform level view of
the CPU can be obtained via
xperf as per the next slide . .
.

30

20

10

0
1

2

3


4

Conditional Split

5

6

TxSplit.dll weight
= 13,004,998.777

Too few threads
= CPU starvation 
Too many threads
= context switching 
The “Sweet spot” is somewhere in between O/
Elements in the dataflow that can create new threads:
Execution paths
Conditional splits, multicasts and the balanced data distributor create
threads for their outputs
Synchronous transforms

A section in the
dataflow starting with a
asynchronous
component and ending
with a transform or
destination with no
synchronous output.
. . . as the next slide will
help illustrate.

EXECUTION PATH

Execution
Path 1

Execution
Path 2

Execution Time / Thread Count
30

25

20

15

10

5

0
1

2

3

Union

Pass Through

4

5

CPU Consumption / Data Flow (Thread) Count
120

100

80

60

40

20

0
1

2

3

Union

4

Pass Through

5

6

IO Throughput Per Data Flow (Thread) Count ( MB/s)
180

160

140

120

100

80

60

40

20

0
1

2

3

Union

4

Pass Through

5

6

One execution path
= 37,039 context switches

Two execution paths
= 69,986 context switches

Most of the demos so far have achieved data flow scale out via
“Copy and paste”.
Service broker is highly elastic, the number of readers associated
with a queue can be increased via the
ALTER QUEUE command.
SSIS has no “Out of the box” equivalent to this.
However the work pile pattern can be adapted in order to
achieve ‘Elastic’ style scale out as the next slide will illustrate.

“WORK PILE”

SSIS “Server Farm”

Package 1

DTEexec . . .
/set Package.variables[MaxThreads].Value;3
/set Package.variables[ThreadNumber].Value;1

SSIS Server 1

Package 2

DTEexec . . .

SSIS Server 2

Package N

DTEexec . . .

SSIS Server N

With a dedicated server hardware for SSIS SQL Server, how does the
resource utilisation vary on each as various scale out via parallelisation
techniques are used ?.
How does SSIS perform with hyper threading turned on and off ?
L2/3 cache is touted as the “New flash memory”:
How does the “Performance curve” behave in relation to L2/3 misses ?
What can be done to influence L2/3 cache misses.

The performance and scalability of extracting from the source is
paramount, the only wait events you want to see are ASYNC_NETWORK_IO
and PREEMPTIVE_WAITFORSINGLEOBJECT.
When deleting from partitions ( and inserting into them ), significant
performance gains can be had by forcing partition level locking.
Packages with fewer execution paths will tend to incur fewer context
switches and scale better.
Seek out opportunities to scale out synchronous transforms by splitting
them up as much as possible.
Look to leverage the work pile pattern for ‘Elastic’ scale out.

Integration Services: Performance Tuning Techniques
Elizabeth Vitt, Intellimentum and Hitachi Corporation
SQL Server Integration Services Performance Design Patterns
Matt Masson, Senior Program Manager Microsoft
Increasing Throughput of Pipelines by Splitting Synchronous
Transformations into Multiple Tasks
Sedat Yogurtcuoglu, Henk van der Valk, and Thomas Kejser
Resources for SSIS Performance Best Practices
Matt Masson and others

ChrisAdkin8

chris1adkin@yahoo.co.uk

http://uk.linkedin.com/in/wollatondba

Speaker

Title

Room

Jan Pieter Posthuma

ETL with Hadoop and MapReduce

Theatre

Phil Quinn

XML: The Marmite of SQL Server

Exhibition B

Laerte Junior

The Posh DBA: Troubleshooting SQL Server with PowerShell

Suite 3

James Skipwith

Table-Based Database Object Factories

Suite 1

Neil Hambly

SQL Server 2012 Memory Management

Suite 2

Matija Lah

SQL Server 2012 Statistical Semantic Search

Suite 4

#SQLBITS

Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine

More Related Content

What's hot

Similar to Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine

More from Chris Adkin

Recently uploaded

Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine

Editor's Notes