Multi-Dimensional Clustering: A High-Level Overview


Published on

High-level overview of the multi-dimensional table clustering feature of DB2.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Multi-dimensional clustering (MDC) provides a way to cluster data along multiple dimensions. MDC is primarily intended for data warehousing and decision support systems, but it can also be used in OLTP environments. As the name implies, Multidimensional Clustering allows users to physically partition data into clusters or dimensions based one or more values. It is all about physically grouping together data which are related according to their dimension values. When data is inserted into MDC table, records having different dimension values are put into separate blocks. This way, each block contains data that has a particular combination of dimension values. This particular set of dimension values will ONLY be found in that block.
  • Block A block is the smallest allocation unit of an MDC. A block is equivalent of extent (a consecutive set of pages on the disk). The block size determines how many pages are in a block. Block index The structure of a block index is almost identical to a regular index. The main difference is that regular indexes entries point to data rows, while block index entries point to extents. Because each block index entry points to an extent, whereas the regular index entry points to a row, block indexes are much smaller than regular indexes, while still pointing to the same number of rows. Block indexes, however, cannot enforce the uniqueness of rows. For that, a regular index on the column is necessary. Block indexes are system generated and cannot be dropped. (If block index gets corrupted it will have to be marked for recreation using the db2dart /MI option).
  • Dimension block indexes are automatically created when the table is created. They cannot be dropped. However, they can be renamed (as a system generated name is assigned to them for you). They contain block identifiers (BID) that point to blocks of data (vs RIDs in regular indexes that point to individual rows of data). A composite block index will also automatically be created, containing all columns across all dimensions. It will be used to maintain the clustering of data during insert and update operations.
  • Block Map is a control structure which contains information about the usage status of each block (e.g. free, in-use etc). One block map per table or partition is created. Status bits include: In-use - Block is assigned to a cell System – Reserve blocks Load - Block is recently loaded; not visible by scans Constraint - Block is recently loaded; constraint checking is required Refresh - Block is recently loaded; MQTs need to be refreshed Can be dumped using the db2dart /DM option
  • A dimension is an axis along which data is organized in an MDC table. A dimension is an ordered set of one or more columns, which you can think of as one of the clustering keys of the table. Data having particular dimension values can be found via that dimension's axis in the grid. This cube is a way of presenting a logical view of how the data is organized in an MDC table having 3 dimensions.
  • A slice is the portion of the table that contains all the rows that have a specific value for one of the dimensions.
  • The highlighted cell contains all records that match the key (1997,Canada,yellow).
  • All you have to do to create an MDC table is to define the dimension columns (or key definitions) using the ORGANIZE BY clause.
  • DB2_MDC_ROLLOUT registry variable determines what kind of rollout deletion is used for MDC tables.
  • Improved Query Performance Block indexes are much smaller than row-based indexes. They are smaller because they point to extents rather than rows. Smaller indexes mean faster index lookups. Also, because block indexes are smaller and simpler, they are less likely to get corrupted than row-level indexes. Data is guaranteed to be clustered. Data with particular dimension values are guaranteed to be found in a set of blocks that contain only records having those values. Clustering is automatically maintained over time by inserting into existing blocks that already contain the dimension values as the record being inserted. Reorg is not required to recluster data; however, reorg can still be used to reclaim space. A new feature in V9.7 (RECLAIM EXTENTS ONLY) allows users to reclaim space without taking the table offline and doing a full reorg. Prefetching is more efficient with MDC tables . Because the data is clustered by the dimension column values, all the rows in the fetched blocks pass the predicates for the dimension columns. This means that, compared to a non-MDC table, a higher percentage of the rows in the fetched blocks are selected. This translates into less I/O, which yields better query performance. Reduced logging Block indexes have much less overhead associated with them for logging and maintenance because they only need to be updated when adding the first record to a block, or removing the last record from a block. There is no update of the dimension block indexes on a table insert unless there is no space available in a corresponding block. This results in fewer log entries than when the table has row-level indexes, which are updated on every insert. When performing a mass delete on an MDC table by specifying a WHERE clause containing only dimension columns (resulting in the deletion of entire cells), less data is logged than on a non-MDC table because only a few bytes in the blocks of each deleted cell are updated. Logging individual deleted rows in this case is unnecessary.
  • Reduced table maintenance Because blocks in MDC tables contain only rows with identical values for the dimension columns, every row inserted is guaranteed to be clustered appropriately. This means that, unlike a table with a clustering index, the clustering effect remains as strong when the table has been heavily updated as when it was first loaded. Thus there is no need to reorganize data for the purpose of reclustering. Reduced application dependence on index structures Because the dimensions in an MDC table are indexed separately, the application does not need to reference them in a specific order for optimum usage. When the table has a multi-column clustering index, in comparison, the application needs to reference the columns at the beginning of the index to use it optimally.
  • Performance Although MDC is suitable for both environments, the DSS environment usually realizes greater performance gains. This is because the volume of data processed is generally larger and queries are more complex and longer running. MDC is very effective in that environment because it reduces the amount of I/O that is required to satisfy queries. The OLTP environment, however, is more dominated by intensive insert/update activity and rapid query response. MDC can improve query response times, but it can impact update activity when the value in a dimension column is changed. In that case, the row must be removed from the cell in which it currently resides and placed in a suitable cell. This might involve creating a new cell and updating the dimension indexes. Disk space An MDC table takes more space than the same table without MDC because extents may contain fewer values (due to clustering). In addition, more indexes are created. Table design To design an MDC table properly, one must analyze the SQL that is used to query the table. Improper design may lead to lots of wasted table space, resulting in much larger space requirements. For example, a key that yields too many distinct values may lead to many sparsely used blocks (due to too allocating too many blocks that are occupied by a single value) and no performance improvement.
  • Multi-Dimensional Clustering: A High-Level Overview

    1. 1. Multi-Dimensional Clustering A High-Level Overview Zoran Kulina DB2 CE Kernel Development
    2. 2. MDC Purpose <ul><li>One of the three methods for partitioning data in DB2 (others being range and database partitioning). </li></ul><ul><li>Allows flexible, continuous and automatic clustering of data along multiple dimensions. </li></ul><ul><li>Primarily intended for data warehousing and large database systems; can also be used in OLTP environments. </li></ul><ul><li>Enables a table to be physically clustered on more than one key (or dimension) simultaneously. </li></ul>
    3. 3. MDC Concepts <ul><li>Block </li></ul><ul><ul><li>MDC version of extent </li></ul></ul><ul><ul><li>Consecutive set of pages on the disk </li></ul></ul><ul><ul><li>The smallest allocation unit of an MDC table </li></ul></ul><ul><li>Block index </li></ul><ul><ul><li>Automatically created </li></ul></ul><ul><ul><li>Point to blocks of data rather than individual rows </li></ul></ul><ul><ul><li>Cannot enforce uniqueness </li></ul></ul><ul><ul><li>Cannot be dropped </li></ul></ul>
    4. 4. MDC Concepts <ul><li>Dimension block index </li></ul><ul><ul><li>One per dimension </li></ul></ul><ul><ul><li>Used to access dimension data </li></ul></ul><ul><li>Composite block index </li></ul><ul><ul><li>One per table or partition </li></ul></ul><ul><ul><li>Contains all dimension columns </li></ul></ul><ul><ul><li>Used to maintain clustering of data during insert or update </li></ul></ul>
    5. 5. MDC Concepts <ul><li>Block map </li></ul><ul><ul><li>Maintains usage status information for blocks (extents) </li></ul></ul><ul><ul><li>Facilitates quick lookup of empty blocks in MDC tables </li></ul></ul>Reserved Data stored year East, 1996 North, 1996 North, 1997 South, 199 9 0 1 2 3 4 5 6 . . . Extents in the table 0 X 1 F U U U F U 2 3 4 5 6 ... 7 F ... Reserved Free - no bits set X F U In use - data assigned to a cell
    6. 6. MDC Concepts <ul><li>Dimension </li></ul><ul><ul><li>Ordered set of one or more columns (clustering keys) of the table </li></ul></ul><ul><ul><li>Axis along which data is organized in an MDC table </li></ul></ul><ul><ul><li>Example: dimensions for nation, color, and year </li></ul></ul>1997, Canada, blue 1997, Mexico, yellow 1997, Mexico, blue 1997, Canada, yellow 1998, Mexico, yellow 1997, Mexico, yellow 1998, Canada, yellow 1997, Canada, yellow year dimension colour dimension nation dimension
    7. 7. MDC Concepts <ul><li>Slice </li></ul><ul><ul><li>Portion of the table that contains all the rows that have a specific dimension value (e.g. nation = ‘Canada’) </li></ul></ul>1997, Canada, blue 1997, Mexico, yellow 1997, Mexico, blue 1997, Canada, yellow 1998, Mexico, yellow 1997, Mexico, yellow 1998, Canada, yellow 1997, Canada, yellow year dimension colour dimension nation dimension Canada slice
    8. 8. MDC Concepts <ul><li>Cell </li></ul><ul><ul><li>Portion of the table that contains rows having the same unique set of dimension values </li></ul></ul><ul><ul><li>Intersection of slices from each dimension (e.g. all records where year=2002, country='Canada', and color='yellow‘) </li></ul></ul>year dimension colour dimension nation dimension Cell for (1997, Canada, yellow) Each cell contains one or more blocks. 1997, Canada, blue 1997, Mexico, yellow 1997, Mexico, blue 1997, Canada, yellow 1998, Canada, yellow 1997, Mexico, yellow 1998, Mexico, yellow 1997, Canada, yellow 1998, Canada, yellow 1998, Mexico, yellow
    9. 9. MDC Syntax <ul><li>ORGANIZE BY clause in CREATE TABLE CREATE TABLE mdctable ( Year INT, Nation CHAR(25), Colour VARCHAR(10), ... ) ORGANIZE BY (Year, Nation, Colour) </li></ul><ul><li>This MDC table will have four block indexes: </li></ul><ul><ul><li>Three dimension block indexes: Year, Nation and Colour </li></ul></ul><ul><ul><li>One composite block index: (Year, Nation, Colour) </li></ul></ul>
    10. 10. MDC Syntax <ul><li>DB2_MDC_ROLLOUT registry variable </li></ul><ul><ul><li>1, TRUE, ON, YES, IMMEDIATE (default) </li></ul></ul><ul><ul><li>0, FALSE, OFF, NO </li></ul></ul><ul><ul><li>DEFER </li></ul></ul><ul><li>Delete statement special register </li></ul><ul><ul><li>SET CURRENT ROLLOUT MODE IMMEDIATE CLEANUP </li></ul></ul><ul><ul><li>SET CURRENT ROLLOUT MODE NONE </li></ul></ul><ul><ul><li>SET CURRENT ROLLOUT MODE DEFERRED CLEANUP </li></ul></ul>
    11. 11. MDC Benefits <ul><li>Improved query performance </li></ul><ul><ul><li>Block indexes are much smaller than row-level indexes </li></ul></ul><ul><ul><li>Data is guaranteed to be clustered </li></ul></ul><ul><ul><li>Prefetching is more efficient with MDC tables </li></ul></ul><ul><li>Reduced logging </li></ul><ul><ul><li>Inserts are not logged unless a new block is needed </li></ul></ul><ul><ul><li>Mass deletes (rollouts) of entire cells log less data than regular deletes </li></ul></ul>
    12. 12. MDC Benefits <ul><li>Reduced table maintenance </li></ul><ul><ul><li>Clustering maintained automatically </li></ul></ul><ul><ul><li>No need for reorg unless to reclaim space </li></ul></ul><ul><li>Reduced application dependence on clustering indexes </li></ul><ul><ul><li>No need to reference columns in particular order for optimum usage </li></ul></ul>
    13. 13. MDC Usage Considerations <ul><li>Performance </li></ul><ul><ul><li>Best suited for data warehouses where queries are complex and long-running </li></ul></ul><ul><ul><li>Good for OLTP environments, but some update operations on MDC tables may take longer than on regular tables </li></ul></ul><ul><li>Disk space </li></ul><ul><ul><li>MDC tables takes more space than equivalent regular tables </li></ul></ul><ul><li>Table design </li></ul><ul><ul><li>Poor selection of clustering key may lead to wasted disk space and no performance gain </li></ul></ul>
    14. 14. References <ul><li>DB2 V9.7 Documentation </li></ul><ul><ul><li> </li></ul></ul><ul><li>Database Partitioning, Table Partitioning and MDC for DB2 9 </li></ul><ul><ul><li> </li></ul></ul>