Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Slide 1 - Michigan DB2 Users Group -- Home Page


Published on

  • Best survey site online! $1,500 a month thanks to you guys! Without a doubt the best paid surveys site online!I have made money from other survey sites but made double or triple with for the same time and effort. The variety and number of daily paid surveys I get from them is unmatched. A must for anyone looking for extra cash or a full time income. ■■■
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Slide 1 - Michigan DB2 Users Group -- Home Page

  1. 1. Session: E05 DB2 Design Patterns – Solutions to Problems Rob Williams MHC Inc. 14 October 2008 • 9:30 – 10:45 Platform: Linux, Unix, and Windows
  2. 2. Agenda • Definition of design patterns • Problems and Solutions to common design patterns • Hardware Layout • Application Development • Database Design • Database Configuration • Application Architecture 2
  3. 3. What Are Design Patterns? • Design patterns are commonly used in the software development field • General reusable solution to commonly occurring problems • You may know many of these patterns • Can be a solution to a problem • When you see this do this • Or a specific way to implement something • ie. a pattern for DB2 configuration 3
  5. 5. Hard Disk/File System Layout Pattern • Raid Stripe Size = Extent Size * DB2 Page Size • File System Block Size = DB2 Page Size • Have seen a lot of 4KB raid stripe sizes. This causes dramatically higher disk activity and can lower throughput • Want to prevent disk trashing • Hard disks are good at sequential reading, but bad at seeking 5
  6. 6. Raid Pattern • People tend to get very uptight about their raid mode • Use raid-1 as a bare minimum for any new installation • Raid-5 • Slow rebuild times that effect production performance • Slower insert/update/delete because of parity updates • Raid 10 for anything else 6
  7. 7. Tablespace Layout Pattern • Create a data tablespace and an index tablespace per table • Can group small tables into one tablespace • Be wary of the new initial size default of 32MB if doing this across all tables • With tablespace level recovery in v9 this makes like much easier • Pesky not logged drop of tables • Prevents logical fragmentation • Con: extra administration 7
  8. 8. Fragmentation Testing Pattern • Tablespaces can be fragmented both logically and physically! • Depending on your IO patterns this can have a huge impact on reporting queries • How can you tell if you are suffering from fragmentation issues? • Performance a SELECT count(*) FROM table WHERE not_indexed_column = 0 • Cause a table scan. Make sure it isn‟t indexed otherwise DB2 may use an index • Clear bufferpool and file system caching • 22 MB/s • Look at vmstat and check the read spead • Compare this to cat tablespace_file > /dev/null • 70 MB/s • Read speed should be very close to that of the SQL statement 8
  9. 9. File System Frag Testing • Overlooked these days but can be an issue in hybrid data marts over a long period of time • How do you test if file system overhead is an issue? • cat /dev/sda1 > /dev/null • 200 MB/s • This reads the actual data off the hard drive in a completely sequential manner. • Allows you to estimate the file system overhead in reporting situations • Can be substantial • After new tablespace and running defrag output = 160MB/s 9
  11. 11. Problem: Loop Processing Pattern • Problem: Very common for developers to write looping logic as it is natural for them • Performance is typically poor for even 30,000 rows • resultset = SELECT * FROM XYZ while(resultset->fetch_row){ SELECT something EXEC UPDATE …. } • Context: DB2 has a large number of facilities to typically process such logic in a single statement • Some solutions are presented on the following slides 11
  12. 12. Solution 1: Delta Pattern • To merge the differences into another database • Common activity in ETL processes and data warehouses • Deltas are typically implemented in some form of loop • Solution: MERGE INTO archive ar USING (SELECT activity, description FROM activities) ac ON (ar.activity = ac.activity) WHEN MATCHED AND (cond 1)THEN UPDATE SET description = ac.description WHEN NOT MATCHED THEN INSERT (activity, description) VALUES (ac.activity, ac.description) • Useful in other programming situations and more efficient than looking for SQL exception cases • Be careful on locks and unit of work size 12
  13. 13. Solution 2: Hierarchical SQL Pattern • Many developers are unfamiliar or uncomfortable with recursive SQL • Typically implemented in loop logic or recursively calling application functions that issue SQL • Context WITH temptab(deptid, empcount, superdept) AS ( SELECT root.deptid, root.empcount, root.superdept FROM departments root WHERE deptname='Production' UNION ALL SELECT sub.deptid, sub.empcount, sub.superdept FROM departments sub, temptab super WHERE sub.superdept = super.deptid ) SELECT sum(empcount) FROM temptab 13
  14. 14. Solution 3:Loop Insert Pattern • To insert more than a few records for(int i =0; i < arr_size; i++){ insert into table values (….) } • Bind arrays, either column or row based to a prepared statement • Very low network overhead • Extremely fast 14
  15. 15. Solution 4:Highest Balance /Moving Average Pattern • Many programmers have built ugly solutions to analyzing trends and linear data • Typically implemented in nested loops • select date_timestamp, stock_price, avg(stock_price) over(order by date_timestamp) range between 90 preceding and current row) as spending_pattern from stock_prices *********Show prog expalme 15
  16. 16. Solution 5: Paging Through A Result Set Pattern • We typically see paging poorly implemented in applications using DB2 as it does not have OFFSET like open source databases and only has FETCH FIRST x rows only • Paging typically done by doing a for loop and sql- >fetchrow. Lots of network traffic • select row_number( ) over (order by name) as row_number, other from employee WHERE row_number BETWEEN 5 AND 10 FETCH FIRST 10 ROWS ONLY 16
  17. 17. Problem: OR and AND Simplification • SELECT * FROM t1 WHERE ((x = „a‟ OR x = „b‟) AND (y = „c‟ or j = „e‟)) OR (( x = „a‟ OR x= „c‟) AND (y = „c‟ or j = „e‟)) OR ((x = „b‟ or x = „c‟) AND (y=„c‟ or j = „e‟)) • Assume high selectivity of the predicates and full distribution statistics • One index that includes all the columns • Only a small set of rows returned • Problem: • Large amount of index space used • DB2 9 has a tendency to avoid index anding in the 20 – 100 million row range when OR and AND chaining • This has caused us some grief in migrations • Extra processing on select, insert, updates, and delete 17
  18. 18. Solution: OR and AND Simplification • Solution - Use a generated column with an index • May reduce number of columns indexed. Increased performance through reduced index writing, simpler index access paths. • *Sometimes you do not need to rewrite queries • Tip – Have a standard prefix so developers know not to update generated columns • Consider Triggers/Views/LBAC‟s to enforce development policy on generated columns 18
  19. 19. LIKE %% • SELECT * FROM TABLE WHERE COLUMN LIKE „%SOMETHING%‟ • Problem: • A % at the front of a column causes at the bare minimum a full index scan and most likely a table scan • Can potentially have problems even with „ASDF%‟ as you may have high cardinality on strings starting with „ASDF‟ • Solution: • Use the DB2 text extender (free), apache lucene, or have a word map table 19
  20. 20. Prepared Statements and Flags/Skewed data Patterns • In general we believe prepared statements are great • Problem: Occasionally seeing a large spike in read io on a table. Captured all the SQL and didn‟t see any abnormal queries. • Noticed prepared statements being used a filtered on flag/skewed distribution data. This can be an issue as the access path is only generated once. • Solution: Switched to dynamic sql in the bean and if using stored procedures use the REOPT(ALWAYS) bind option 20
  21. 21. Concurrency Patterns • Always use CS unless truly necessary • Don‟t create artificial constraints • Pessimistic locking should not be considered unless it is critical to functionality • Favor optimistic locking • Consider DB2_EVALUNCOMMITTED and DB2_SKIPDELETED • When having concurrency issues • Can be a result of denormalization of data • FOR UPDATE is typically not understood by developers. • They often think it‟s equivalent to CS • Can kill concurrency • Slows down runstats • COMMIT select statements or close result sets as soon as possible • Can possibly hold a row level lock longer than needed • Be careful with WITH HOLD. Can leave behind locks till the cursor is closed. 21
  22. 22. MQT Federated Caching • Often overlooked, but using MQTs to cache data from a federated source • Refresh nightly • Reduce network round time • Allows for better optimization of access paths • Much simpler and faster than other caching implementation 22
  23. 23. DB2 Java Driver Pattern • Developers typically confused over which Java driver to use. Normally they take the driver of the first example they find • JDBC 2 for a local DB2 connection. Runs much faster than type 4 as the driver does system level calls instead of network calls • JDBC 4 for a remote db connection. Easier portability, similar performance to type 2 in this setup. It communicates to DB2 through TCP/IP 23
  24. 24. Splitting Table Pattern • General tendency to have huge central tables with a large number of flag columns, text data, and infrequently accessed data • Splitting core data, preferably fixed width, from other columns • Can speed up reporting • Reduces CPU overhead • When you need access to the other tables, ensure that they are in clustering order for a merge join to be used. That way little overhead is realized 24
  25. 25. DATABASE DESIGN 25
  26. 26. Problem: Flag Pattern • Flag columns that are typically selected and processed based on their values • Generally have reapers that run based on flag values • Requires a larger indexes and slower update/insert • Context: Wasting a lot of disk space and memory on a flag index • Cardinality issues can slow updates of flags • Solution: Use MDCs on flag columns unless there are large amounts of sequential IO • Additional license is potentially required for MDCs 26
  27. 27. MDC Indexes • Dimension • “Block” index column • eg, year, region, itemId • Slice • Key value in one dimension, eg. year = 1997 • Cell • Unique set of dimension values, eg, year = 1997, region = Canada, AND itemId = 1 27
  28. 28. Statistics and Access Path Patterns • Developers use the selectivity predicate to influence the optimizer • Certain “experts” recommend bogus predicates to change access paths • Has short-term functionality but fails in the long run • IBM employs lots of smart people who work on the optimizer • Rather than hacking a solution with selectivity, instead, inform the optimizer. 28
  29. 29. Statistical View Pattern • In base DB2, statistics are on the base table and do not have information on the cardinality of the join relationship. • Statistical views allow the optimizer to compute more accurate results • Problems with correlated and skewed data across complex relationships • On larger tables poor access paths mean dramatically more cpu, io, and elapased time • Tendency for the optimizer to be overly optimistic on join selectivity, particularly when distributions change over time. 29
  30. 30. Statistical View Pattern • Create statistical views for common filtering columns on fact tables that are used in large joins to flakes. • Ex: • SALE_FACT (store_id, item_id, …….) • STORE(store_id, store_name, manager) • ITEM(item_id, item_class, item_desc, …..) CREATE VIEW sv_salesfact_store AS (SELECT sf.* FROM store s, sale_fact sf WHERE s.store_id = sf.store_id) ALTER VIEW sv_salesfact_store ENABLE QUERY OPTIMIZATION RUNSTATS on table sv_salesfact_store WITH DISTRIBUTION 30
  31. 31. Data Correlation Pattern • Problem: Poor performing SQL running against a large fact table with multiple filter predicates • SELECT item.* FROM item, supplier WHERE item.type= supplier.type AND item.type = „TOOL‟ AND item.material = supplier.material AND item.material = „STEEL‟ • Context: Looking at the explain output we noticed the use of a nested loop join when a merge/hash join should have been used • By default the optimizer assumes predicates are independent. So selectivity is calculated as: • SELECTIVITY(item.type)* SELECTIVITY(item.material) • 0.25 * 0.01 = 0.0025 • Correlated selectivity = 0.25 + 0.01 = 0.26 • 0 <= SELECTIVITY () <= 1 • Means over estimation of the filter when data is corelated 31
  32. 32. Data Correlation Pattern Cont • Solution: Either create a multicolumn index with both the columns and collect statistics or run runstats with grouping. • RUNSTATS ON TABLE item((type,material)) with distribution • How to test if you should do this? • db2 "select ((count(*)*SUM(comm*salary)- SUM(comm)*SUM(salary))/sqrt((count(*)*sum(power(co mm,2))-sum(power(comm,2))) *(count(*)*sum(power(salary,2))-sum(power(salary,2))))) from employee“ • If the result is >= .3 or <= -.3 collect statistics 32
  33. 33. Fact Table Cluster Pattern • Problem: Poor performance on large joins against a large fact table and statistics are perfect • Context: DB2 was using a nested loop join or hash join and causing an overflow. Instead of using a merge join because data was not in clustered order on both tables. • Solution: Avoid clustering central fact tables on a unique id column. Cluster on columns that will have large joins against them. Can use MDC for finer granularity 33
  34. 34. MQT OLAP Pattern • Problem: Customer when trying to use MQT‟s make them too granular either causing no matches or a large number of MQT‟s • People sometimes take design advisor recommendations without modifying or analyzing • Can get great recommendations, other times very poor • Context: We don‟t want too many MQT as that slows down overall query optimization. It also has a heavy cost on insert/update/delete. How can we make MQT matches more general? 34
  36. 36. MQT OLAP Pattern Cont • Using ROLLUP • select substr(tabschema,1,20) as SCHEMA, substr(tabname,1,30) as TABLE, count(*) as NUM_TABLES, sum(npages) as Pages from syscat.tables group by rollup(tabschema,tabname) SCHEMA TABLE NUM_TABLES PAGES --------------- --------------- ----------- -------------------- - - 119 3948 SYSIBM - 105 948 SYSTOOLS - 6 3 EATON - 8 2997 SYSIBM SYSTABLES 1 41 SYSIBM SYSCOLUMNS 1 233 36
  38. 38. Refactoring without SQL change pattern • Problem: Customer designed to use 64 byte(not bit) identifiers for everything. Worked fine in test but after billions of transactions the system had huge storage requirements and was slow. • Solution: In DB2 8.1 instead of triggers were introduced. This allows all views to be updatable. • Make the top 5 largest tables a views and create a mapping table for 64 byte identifiers to bigint. • Utilized statistical views • No performance difference noticed in application and much higher throughput in reporting 38
  39. 39. Money Column Pattern • Never allow nulls on any column that has a dollar value • Never use a float value to represent money • Loses accuracy as you move farther away from 0 • (a + b) + c is not necessarily equal to a + (b + c) • 1234.567 + 45.67844 = 1280.245 • 1280.245 + 0.0004 = 1280.245 but 45.67840 + 0.0004 = 45.67844 45.67844 + 1234.567 = 1280.246 • In new development on 9.5 always use the DECFLOAT type. • New IEEE Standard • Multiple rounding modes • ROUND_CEILING, ROUND_FLOOR, ROUND_HALF_UP, ROUND_HALF_EVEN, ROUND_DOWN 39
  40. 40. Data Quality Pattern • Validate IN, batch yearly • With web services these days externally validating data has become extremely simple and cheap • Address Validation • Phone Number Validation • SIN Validation • Consider batch validating or adding checks to data going in • Catching input errors right away 40
  41. 41. User Patterns • Always use a connection pool • Use trusted contexts in 9.5 LUW to be able to audit usage based on “real” user. • Segment application modules with different user id‟s • Becomes extremely easy to isolate modules with performance issues/problems • Can use event monitor trace based on authid • Add permissions only as required. When refactoring being able to tell which applications and modules access a given table is extremely useful 41
  42. 42. Constraint Pattern • Problem: Companies do not use check constraints to verify data going in and out of their database • Context: Fails the concept of not duplicating business logic. Could have the logic in the ESB but does not guarantee accuracy. • People also do not realize you can use UDFs in a CHECK constraint. • What about Java/C validation routines? • Solution: Design all check constraints as scalar functions and enforce at the dbms level. That way data is always clean. 42
  43. 43. BP Layout Pattern • Separate out bufferpool by OLTP and DW tablespaces • Then separate out by index, xda, and data • Can go for finer granularity although you risk wasting memory • Allows you to control the most critical components of what's in memory • Consider disabling STMM to make use of block based areas 43
  45. 45. Block Based Buffer Pool Pattern • Always use block based bufferpools unless you have sequential I/O upon which performance is critical • STMM doesn‟t support block based bufferpools - you need to enable it manually • Prevents large searches / sequential I/O from evicting frequently used pages from memory • Better reliability and average response time 45
  46. 46. Testing Parameter Pattern • Put your lock timeout as close to 1 as possible. • Find any concurrency issues in testing • Disable STMM • Idea is to have optimal performance. Not helpful when you want to find bugs • Put your bufferpools as small as possible to simulate larger sets of data than expected • Copy production statistics to test machines • Reduce sort heap to as small as value • Can help find cases of bad clustering and improper indexes • Create random network delays to simulate real world situations and create different locking patterns 46
  47. 47. Load Patterns • Use IXF over DEL and ASC • Dramatically less CPU usage and runs faster • Contains table ddl information as a plus • Don‟t forget the disk parallelism option when working with raid drives • Default = number of containers • Use statistics profiles to collect stats on your load instead of doing a runstats afterwards • In 9.5 compress data during load. Not as effective as running a reorg compression but still useful • As long as network is reasonably fast load over network drives typically does not slow down total load time 47
  48. 48. Upgrade Pattern • People are too quick to upgrade hardware/software licenses when there are performance problems as they don‟t like to blame their own code • Review indexes • Ensure: • the system is tuned • you have identified the limiting resource in the system. Ex. CPU, memory • Identify the exact statements and process that is causing the problem and validate that it is optimal 48
  49. 49. Monitoring Pattern • Health Center works but isn‟t great • Check out hyperic hq it‟s an open source monitoring tool that allows you to easily monitor DB2 with the new SQL views • It‟s free and can interface with the operating system, disk sub systems, network controllers • Has support built in for Oracle, SQL Server, operating systems, etc • Huge thanks to Fred Sabotka for letting me know about this software 49
  50. 50. Monitoring Pattern Cont • What do you want to monitor? • CPU Usage • Hard Disk Utilization • BP Hit Ratios • Hash/Sort Overflows • Deadlocks • Lock timeouts • Rollbacks • Sync Read Percentage • Average Transaction Time • Statements per minute 50
  52. 52. Active Record Pattern • From Martin Fowler in his book "Patterns of Enterprise Application Architecture" • Still see a lot of companies hard coding SQL in presentation layer. Typically caused by quick web scripting languages • Database schema changes become a nightmare • Acquisition integration • Map relational data to classes then use the classes for presentation • CREATE TABLE student (id,first_name,last_name) • CLASS Student {int id; string first_name, string last_name;} • Very important at the most basic level to follow as it prevent tight coupling of the database • Many technologies help automate this such as Hibernate, PureQuery, etc.. 52
  53. 53. PureXML Abstraction Pattern • Even having data manager classes issue SQL to a database causes a database dependency • XML Message based architecture removes any dependence on the logical/physical schema. • SOA, ESB? • XML Message in/XML Message out all onto an ESB • Great when acquired, consolidating, reasonable performance • XSTL • Application only needs to be worried about the XML schema • Don‟t rush in, prototype, start slow • Very good tools to manage such infrastructure 53
  54. 54. Data Analysis Patterns • When people believe there is magic in some product they are willing to pay money for it • Particularly in the BI field we see many customers spending crazy amount of money for fairly trivial algorithms • Design several summery tables in DB2 to serve as basis for end user recommendations • Don‟t need to export all your data to a 3rd party product • Surprisingly trivial algorithms for fairly reasonable results • Product recommendations / similarity recommendations • Euclidian Distance • Pearson correlation • K-clustering • Price Models • Weighted KNN 54
  55. 55. Data Analysis Patterns • Grouping • K-Means • Hierarchical Clustering • Not difficult and well documented online • Typically more flexible than bundled solutions and faster • Easy to prototype 55
  56. 56. Processor Evaluation Pattern • When upgrading systems you are faced with higher clock speed or more cores • Clock Speed: • Favor clock speed if you are looking for elapsed time improvement • Cheaper • Number of cores: • # of concurrent transactions • Note we have an unhappy customer on the Niagara core due to low clock speed per core 56
  57. 57. Questions? 57
  58. 58. Session E05 DB2 Design Patterns – Solutions to Problems Rob Williams & Martin Hubel MHC Inc. / 58