SQL Server 2005 vs 2008  Integration Services   World Record Performance! Henk van der Valk Workload Performance Architect Unisys  ES7000 Performance Centers [email_address]
Agenda Performance : SSIS 2005 vs SSIS 2008 performance study Optimizing tricks for SSIS Bulk inserts And  if time permits …. SQL2008 Storage/IO tuning  Windows 2008 tuning
About the speaker Deals with the largest SQL environments in the world Co-Founder ES7000 Performance Centers  (2001) Performance optimizer & troubleshooter Hosting Dutch SQLPass Performance SIG Windows Datacenter Edition Certified
Performance Study Goals Work with Microsoft SSIS Dev team to test improvements of the reworked SSIS Pipeline, from  SQL2008 versus SQL 2005 Document performance & scalability for loading and transforming data sets running on the Unisys ES7000/one (x64) servers Live Demo’s - Windows2008 DC build 6.0.6001 / SQL 2008 EE (Katmai)  10.0.1300.04 SSIS Test history:  SSIS Sept. 2004 ,  SQL 2005 Beta2, build 9.00.954  SSIS Feb. 2005 ,  SQL 2005 IDW13, build 9.00.1094 SSIS Dec 2006,  Katmai build  9.0.9086.2  Katmai SSIS:  build 10.0.1075.7  (SQL_PreRelease).070927-0159 ) Versus:  SQL2005-Post SP2: Version 9.0.3175
What’s the big deal with  the Optimized Dataflow engine in SSIS 2008? A quick overview
De SSIS Pipeline XML DB Sources Flat File Dests RAW Custom DB Flat File Custom File OLEDB   Data Destination  ODBC   CUSTOM  Raw   Adapters   FLATFILE Derived Column Conditional Split Aggregate Fuzzy Lookup Merge Join RAW OLEDB   Data Source  ODBC   CUSTOM  Raw   Adapters   FLATFILE
OS Platform memory support Fast shared memory connection for running SSIS / SQL on same system! 32bit:  Each SSIS package can use 3GB RAM, (up to 20 in parallel)  64bit:  Each SSIS package can use up to 2TB,  practically “unlimited for now”! IA32 (32 CPU’s) IA64 (64 Cores) X64 (64 Cores) Windows Total  Virtual Memory 4 GB 16 TB 16 TB Per Process Virtual Addressable Memory 2 or 3 GB 8 TB 8 TB Supported physical memory  64 GB 2 TB 2 TB
Hardware configuration ES7000 /540 16 way/16GB 32-bit 3.0 GHz Xeon MP ES7000 /420 16 way/64GB, 64-bit 1.5 GHz Itanium-2 Unisys ES7000 /one 64 Core/256GB 3.4 GHz x64
Lab Infrastructure Both ES7000/one systems are identical configured (32cores / 128GB)
Test approach The starting point: TPC-H Schema (Decision support benchmark Schema) Random data generation utility to generate Input files Loading data from flat files (16 columns of data)  Each Line-item file : Measure duration + resource utilization  Increase the amount of parallelism by loading more files simultaneously Increase HW resources (more CPU ’ s, type of CPUs ) Compare Yukon vs Katmai execution times + max. total handling capacity.   Size of  flat files Number of rows 10 x 21.6 GByte 10 x 100 Million
Introduction to Aggregations … Fact Table numerieke performance measurements Dimension Table Dimension Table Dimension Table Dimension Table Employee_Dim EmployeeKey EmployeeID ... Time_Dim TimeKey TheDate ... Product_Dim ProductKey ProductID ... Customer_Dim CustomerKe y CustomerID ... Location_Dim LocationKey LocationID ... Sales _ Fact TimeKey EmployeeKey ProductKey CustomerKey LocationKey Sales ...
Package details Number of distinct values: Col1: 1  –  1000 Col2: 1  –  20 Col3: 1  –  1000 Col4: 1  –  1000 Packages created  8 aggregations using  Group-bys  Each Aggregation had: 8  –  (10) measures  comprised of min, max, sums, averages and counts Aggs Col1 col2 Col3 Col4 Agg1  Agg2  Agg3  Agg4  Agg5   Agg6   Agg7     Agg8     Agg9      Agg 10     
Aggregation Package  1 st  design
1 st  package / base design execution
Aggregation Results 1 st   package To process 1 file with 100 Million LineItems: Feb. 2005:  On  Itanium2  64-bit - SQL2005  this took : 2 hours, 10 minutes   With hardware upgrade to ES7000/one - X64 Execution on both SQL2005 and SQL2008 :   1 hour,  33 minutes Conclusion: - New x64 hardware: 35% gain - No Katmai performance improvements for  this  … basic aggr. (blocking component. )
Katmai Data Flow Engine a new Worker thread for each execution tree
Package with Multicast  Aggregating up to 1 billion lineItems 8 aggregations, 8 transforms with Multicast: Test scenario: run 1, up to 10 packages  of 100 Million lines each,  Execute in parallel
Base Aggregation 1hour 4 min . to aggregate  100 million rows /22 GB
Optimization1:  Conditional split SQL2005:  27min. 48 sec SQL2008:     10min. 55 sec
Katmai Dataflow engine @ work Yukon:  27min. 48 sec Katmai:  10min. 55 sec
-Yukon-  Optimization  Use “Union All’s” to create parallelism SQL2005 - Elapsed time from  27min. 48 sec  Down to : 00:10 min 57sec  Up to 6 CPU’s are fully utilized
Katmai Dataflow engine @ work  Katmai - Elapsed time :  00:10min 55 sec  Conclusion:  No need for “Union Alls” in SQL2008
SQL2005 10 packages in parallel: Avg. CPU load 37% 140 Mbyte/sec read Disk IO from FlatFiles Aggregating 1 billion LineItems  SQL2005 with Multicast
SQL2008  10 packages in parallel: Avg. CPU load 100% 270 MByte/sec read Disk IO from FlatFiles Aggregating 1 billion LineItems  SQL2008 with Multicast
2nd package with Multicast  Aggregating 1 billion lineItems Using the Multicast in Katmai provides significant increase  in throughput Processing 8+ packages in parallel use all available 32 Cores
Basic flat file throughput
Demo Flat file input source throughput Reading from a 100 mill row / 22GB flat file –     15 / 5 / 1  Columns   Itanium2  / Yukon :    20 / 35 / 55  MB/sec x64 both Katmai / Yukon  :  72 / 92 / 130 MB/sec  (new hardware) 5 1 15 col. 5 col. 1 col. 15
Data flow Engine threads Yukon #  Engine threads =5  Katmai # Engine threads =10 Sysinternals.com - Process Explorer  (look at the thread tab – CPU & contxt  switches) Pslist.exe dtsdebughost /d Tlist.exe dtsdebughost
“ World record” TPCH Data loading on the Unisys ES7000/one  with Windows2008 & SQL 2008  - Bulk Inserts -
Agenda SQL 2008 tuning  Optimizations / Configuration  Filegroups / files  vs Write IO size
1 File per Filegroup? SQL2008:  256KB Write IOs   1 Filegroup gives variable blocksizes (64 KB - 256 KB IO’s) Check for  PageIOLatch_UP
Initial : Maximum performance ?
Limit hit at  400 MB/sec read
64 core / 64 Bulk Inserts with –x  Minidump analysis show lots of perf logging overhead Starting SQLServer with –x option boosts throughput:
SQL Server with startup parameter -x
Tip: Soft Numa on 64 cores Assign BULK INSERT Tasks to dedicated CPU’s  (both SQL2005/2008)  [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\NodeConfiguration\Node63] "CpuMask"=hex:00,00,00,00,00,00,00,80 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQLServer\SuperSocketNetLib\Tcp] "ListenOnAllIPs"=dword:00000001 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQLServer\SuperSocketNetLib\Tcp\IPAll] "TcpPort"="2000[0x00000001],2001[0x00000002],
Tip: Sharpen data type  Money type  (13% improvement) Use Money type instead of decimal columns  Storing as money (a 8-byte integer with implied 4 decimal digits).  TDS (Tabular Data Stream) is the format SQL Server uses for transfer of data over the wire, and it does not support decimal or numeric.  (Both Yukon and Katmai)
T-SQL bulkInserts lineItems 64000 rows/sec per CPU
World record - TPCH Data loading  With SSIS 2008  Bulk Inserts
Agenda Now SQL2008 is fully optimized,  shifting focus to SSIS : Infrastructure  Windows2008 network parameters Interrupt Affinity  tool  settings SSIS Package optimizations
 
SSIS Base package – Control Flow
SSIS Base package – Data Flow Data types sharpened
IntPolicy  tool /  Interrupt Affinity On ES7000 SQL Server, assign NIC interrupts & DPC’s onto CPU‘s
Interrupt Affinity set for 8 network cards
Intel Pro/1000 MT Apply changes to each of the (16) network cards: 0)  Adaptive Inter-Frame spacing disabled 1)  Flow control = Tx & Rx enabled   client & server Interrupt Moderation = Medium 2)  Jumbo Packet  =  9014 bytes enabled 3) Client & server Interrupt Moderation = Medium Coalesc buffers = 256 4) Set server Rx buffers to 512 and server Tx buffers to 512 5) Set client Rx buffers to 512 and client Tx buffers to 256  6) Link speed 1000mbps Full Duplex
Other flat file best practices Use Fast Parse option when possible: Flat file source /destinations Data Conversion and Derived Column transformations  Integer data types and date/time formats Reduce volume where possible Don’t push unneeded columns Conditional split for filtering rows Do not parse or convert columns unnecessarily In a fixed-width format you can combine adjacent unneeded columns into one Leave unneeded columns as strings
SSIS Bulk Insert package :  Some basic optimizations found Elapsed time:  02 min  56  sec  DtsDebughost.exe:  1.7 GB flatfile Read   1.9 GB write to SQLServer Approx. 200 MB RAM
IO Throughput  Average Read bytes/sec 10 MB/sec  Average Write bytes/sec 12 MB/sec
SSIS - IO tuning  Observation: SSIS –  14K  Reads vs  465K  writes  (128 KB IO Read)  SQL  -  465 K  Reads vs 8800 Writes  (256 KB IO Write) ->> Time to tune Data transport between SSIS and SQL!
Tip:  Increase packet size  SSIS: Connection Mgr.  Packet Size from 0  into 32K
Result: 465K writes down to 58K write IOs Elapsed time:  2 min  36  sec  (= 20 sec less)
SQL Server 2008 startup options Use SQL startup flags: -x  (Do not collect perfmon data)  -E  256k Allocs per file , default 64K  Use Network packet size: 32767 Lock pages in memory privilege
 
 
Result: Packages completed in  less then 1800 seconds!
Windows  2008 Optimizations
Change DEP (Windows2008) Tip from the Unisys TPCC Benchmark team: bcdedit /set nx OptIn     the w2k8 DEP policy by default is OptOut
Windows 2008 optimization Enable MPIO, discover Multi-Path IO
Seeing Today -  Securing tomorrow [email_address]
 

HeroLympics Eng V03 Henk Vd Valk

  • 1.
  • 2.
    SQL Server 2005vs 2008 Integration Services World Record Performance! Henk van der Valk Workload Performance Architect Unisys ES7000 Performance Centers [email_address]
  • 3.
    Agenda Performance :SSIS 2005 vs SSIS 2008 performance study Optimizing tricks for SSIS Bulk inserts And if time permits …. SQL2008 Storage/IO tuning Windows 2008 tuning
  • 4.
    About the speakerDeals with the largest SQL environments in the world Co-Founder ES7000 Performance Centers (2001) Performance optimizer & troubleshooter Hosting Dutch SQLPass Performance SIG Windows Datacenter Edition Certified
  • 5.
    Performance Study GoalsWork with Microsoft SSIS Dev team to test improvements of the reworked SSIS Pipeline, from SQL2008 versus SQL 2005 Document performance & scalability for loading and transforming data sets running on the Unisys ES7000/one (x64) servers Live Demo’s - Windows2008 DC build 6.0.6001 / SQL 2008 EE (Katmai) 10.0.1300.04 SSIS Test history: SSIS Sept. 2004 , SQL 2005 Beta2, build 9.00.954 SSIS Feb. 2005 , SQL 2005 IDW13, build 9.00.1094 SSIS Dec 2006, Katmai build 9.0.9086.2 Katmai SSIS: build 10.0.1075.7 (SQL_PreRelease).070927-0159 ) Versus: SQL2005-Post SP2: Version 9.0.3175
  • 6.
    What’s the bigdeal with the Optimized Dataflow engine in SSIS 2008? A quick overview
  • 7.
    De SSIS PipelineXML DB Sources Flat File Dests RAW Custom DB Flat File Custom File OLEDB Data Destination ODBC CUSTOM Raw Adapters FLATFILE Derived Column Conditional Split Aggregate Fuzzy Lookup Merge Join RAW OLEDB Data Source ODBC CUSTOM Raw Adapters FLATFILE
  • 8.
    OS Platform memorysupport Fast shared memory connection for running SSIS / SQL on same system! 32bit: Each SSIS package can use 3GB RAM, (up to 20 in parallel) 64bit: Each SSIS package can use up to 2TB, practically “unlimited for now”! IA32 (32 CPU’s) IA64 (64 Cores) X64 (64 Cores) Windows Total Virtual Memory 4 GB 16 TB 16 TB Per Process Virtual Addressable Memory 2 or 3 GB 8 TB 8 TB Supported physical memory 64 GB 2 TB 2 TB
  • 9.
    Hardware configuration ES7000/540 16 way/16GB 32-bit 3.0 GHz Xeon MP ES7000 /420 16 way/64GB, 64-bit 1.5 GHz Itanium-2 Unisys ES7000 /one 64 Core/256GB 3.4 GHz x64
  • 10.
    Lab Infrastructure BothES7000/one systems are identical configured (32cores / 128GB)
  • 11.
    Test approach Thestarting point: TPC-H Schema (Decision support benchmark Schema) Random data generation utility to generate Input files Loading data from flat files (16 columns of data) Each Line-item file : Measure duration + resource utilization Increase the amount of parallelism by loading more files simultaneously Increase HW resources (more CPU ’ s, type of CPUs ) Compare Yukon vs Katmai execution times + max. total handling capacity. Size of flat files Number of rows 10 x 21.6 GByte 10 x 100 Million
  • 12.
    Introduction to Aggregations… Fact Table numerieke performance measurements Dimension Table Dimension Table Dimension Table Dimension Table Employee_Dim EmployeeKey EmployeeID ... Time_Dim TimeKey TheDate ... Product_Dim ProductKey ProductID ... Customer_Dim CustomerKe y CustomerID ... Location_Dim LocationKey LocationID ... Sales _ Fact TimeKey EmployeeKey ProductKey CustomerKey LocationKey Sales ...
  • 13.
    Package details Numberof distinct values: Col1: 1 – 1000 Col2: 1 – 20 Col3: 1 – 1000 Col4: 1 – 1000 Packages created 8 aggregations using Group-bys Each Aggregation had: 8 – (10) measures comprised of min, max, sums, averages and counts Aggs Col1 col2 Col3 Col4 Agg1  Agg2  Agg3  Agg4  Agg5   Agg6   Agg7   Agg8   Agg9    Agg 10   
  • 14.
  • 15.
    1 st package / base design execution
  • 16.
    Aggregation Results 1st package To process 1 file with 100 Million LineItems: Feb. 2005: On Itanium2 64-bit - SQL2005 this took : 2 hours, 10 minutes With hardware upgrade to ES7000/one - X64 Execution on both SQL2005 and SQL2008 : 1 hour, 33 minutes Conclusion: - New x64 hardware: 35% gain - No Katmai performance improvements for this … basic aggr. (blocking component. )
  • 17.
    Katmai Data FlowEngine a new Worker thread for each execution tree
  • 18.
    Package with Multicast Aggregating up to 1 billion lineItems 8 aggregations, 8 transforms with Multicast: Test scenario: run 1, up to 10 packages of 100 Million lines each, Execute in parallel
  • 19.
    Base Aggregation 1hour4 min . to aggregate 100 million rows /22 GB
  • 20.
    Optimization1: Conditionalsplit SQL2005: 27min. 48 sec SQL2008: 10min. 55 sec
  • 21.
    Katmai Dataflow engine@ work Yukon: 27min. 48 sec Katmai: 10min. 55 sec
  • 22.
    -Yukon- Optimization Use “Union All’s” to create parallelism SQL2005 - Elapsed time from 27min. 48 sec Down to : 00:10 min 57sec Up to 6 CPU’s are fully utilized
  • 23.
    Katmai Dataflow engine@ work Katmai - Elapsed time : 00:10min 55 sec Conclusion: No need for “Union Alls” in SQL2008
  • 24.
    SQL2005 10 packagesin parallel: Avg. CPU load 37% 140 Mbyte/sec read Disk IO from FlatFiles Aggregating 1 billion LineItems SQL2005 with Multicast
  • 25.
    SQL2008 10packages in parallel: Avg. CPU load 100% 270 MByte/sec read Disk IO from FlatFiles Aggregating 1 billion LineItems SQL2008 with Multicast
  • 26.
    2nd package withMulticast Aggregating 1 billion lineItems Using the Multicast in Katmai provides significant increase in throughput Processing 8+ packages in parallel use all available 32 Cores
  • 27.
    Basic flat filethroughput
  • 28.
    Demo Flat fileinput source throughput Reading from a 100 mill row / 22GB flat file – 15 / 5 / 1 Columns Itanium2 / Yukon : 20 / 35 / 55 MB/sec x64 both Katmai / Yukon : 72 / 92 / 130 MB/sec (new hardware) 5 1 15 col. 5 col. 1 col. 15
  • 29.
    Data flow Enginethreads Yukon # Engine threads =5 Katmai # Engine threads =10 Sysinternals.com - Process Explorer (look at the thread tab – CPU & contxt switches) Pslist.exe dtsdebughost /d Tlist.exe dtsdebughost
  • 30.
    “ World record”TPCH Data loading on the Unisys ES7000/one with Windows2008 & SQL 2008 - Bulk Inserts -
  • 31.
    Agenda SQL 2008tuning Optimizations / Configuration Filegroups / files vs Write IO size
  • 32.
    1 File perFilegroup? SQL2008: 256KB Write IOs 1 Filegroup gives variable blocksizes (64 KB - 256 KB IO’s) Check for PageIOLatch_UP
  • 33.
    Initial : Maximumperformance ?
  • 34.
    Limit hit at 400 MB/sec read
  • 35.
    64 core /64 Bulk Inserts with –x Minidump analysis show lots of perf logging overhead Starting SQLServer with –x option boosts throughput:
  • 36.
    SQL Server withstartup parameter -x
  • 37.
    Tip: Soft Numaon 64 cores Assign BULK INSERT Tasks to dedicated CPU’s (both SQL2005/2008) [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\100\NodeConfiguration\Node63] "CpuMask"=hex:00,00,00,00,00,00,00,80 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQLServer\SuperSocketNetLib\Tcp] "ListenOnAllIPs"=dword:00000001 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQLServer\SuperSocketNetLib\Tcp\IPAll] "TcpPort"="2000[0x00000001],2001[0x00000002],
  • 38.
    Tip: Sharpen datatype Money type (13% improvement) Use Money type instead of decimal columns  Storing as money (a 8-byte integer with implied 4 decimal digits).  TDS (Tabular Data Stream) is the format SQL Server uses for transfer of data over the wire, and it does not support decimal or numeric.  (Both Yukon and Katmai)
  • 39.
    T-SQL bulkInserts lineItems64000 rows/sec per CPU
  • 40.
    World record -TPCH Data loading With SSIS 2008 Bulk Inserts
  • 41.
    Agenda Now SQL2008is fully optimized, shifting focus to SSIS : Infrastructure Windows2008 network parameters Interrupt Affinity tool settings SSIS Package optimizations
  • 42.
  • 43.
    SSIS Base package– Control Flow
  • 44.
    SSIS Base package– Data Flow Data types sharpened
  • 45.
    IntPolicy tool/ Interrupt Affinity On ES7000 SQL Server, assign NIC interrupts & DPC’s onto CPU‘s
  • 46.
    Interrupt Affinity setfor 8 network cards
  • 47.
    Intel Pro/1000 MTApply changes to each of the (16) network cards: 0) Adaptive Inter-Frame spacing disabled 1) Flow control = Tx & Rx enabled client & server Interrupt Moderation = Medium 2) Jumbo Packet = 9014 bytes enabled 3) Client & server Interrupt Moderation = Medium Coalesc buffers = 256 4) Set server Rx buffers to 512 and server Tx buffers to 512 5) Set client Rx buffers to 512 and client Tx buffers to 256 6) Link speed 1000mbps Full Duplex
  • 48.
    Other flat filebest practices Use Fast Parse option when possible: Flat file source /destinations Data Conversion and Derived Column transformations Integer data types and date/time formats Reduce volume where possible Don’t push unneeded columns Conditional split for filtering rows Do not parse or convert columns unnecessarily In a fixed-width format you can combine adjacent unneeded columns into one Leave unneeded columns as strings
  • 49.
    SSIS Bulk Insertpackage : Some basic optimizations found Elapsed time: 02 min 56 sec DtsDebughost.exe: 1.7 GB flatfile Read 1.9 GB write to SQLServer Approx. 200 MB RAM
  • 50.
    IO Throughput Average Read bytes/sec 10 MB/sec Average Write bytes/sec 12 MB/sec
  • 51.
    SSIS - IOtuning Observation: SSIS – 14K Reads vs 465K writes (128 KB IO Read) SQL - 465 K Reads vs 8800 Writes (256 KB IO Write) ->> Time to tune Data transport between SSIS and SQL!
  • 52.
    Tip: Increasepacket size SSIS: Connection Mgr. Packet Size from 0 into 32K
  • 53.
    Result: 465K writesdown to 58K write IOs Elapsed time: 2 min 36 sec (= 20 sec less)
  • 54.
    SQL Server 2008startup options Use SQL startup flags: -x (Do not collect perfmon data) -E 256k Allocs per file , default 64K Use Network packet size: 32767 Lock pages in memory privilege
  • 55.
  • 56.
  • 57.
    Result: Packages completedin less then 1800 seconds!
  • 58.
    Windows  2008Optimizations
  • 59.
    Change DEP (Windows2008)Tip from the Unisys TPCC Benchmark team: bcdedit /set nx OptIn   the w2k8 DEP policy by default is OptOut
  • 60.
    Windows 2008 optimizationEnable MPIO, discover Multi-Path IO
  • 61.
    Seeing Today - Securing tomorrow [email_address]
  • 62.