When & Why\'s of Denormalization

Presented By Aliya Saldanha DENORMALISATION PROS AND CONS

OBJECTIVES Definition of terms. Describe the denormalization design process. Denormalization Strategies A Comparative Case Study Know the pros and cons of denormalization The Dangerous Illusion Conclude

Introduction RDBMS design - conceptual and physical modeling levels. Conceptual diagrams - precursor to designing relational tables. Critical issues- level of system performance, reflected by system response time

Normalization The normalized model is a cornerstone for every database system. Process of decomposing large, inefficiently structured tables into smaller, more structured tables without losing any data in the process. There are still times where we denormalize a database to enhance performance

What is normalization? A series of steps followed to obtain a database that is consistent and avoids duplication The process passes through fulfilling Normal Forms A table is said to be in a certain normal form if it satisfies certain constraints KEY POINTS Each table represents a single subject Keeps redundancy to a minimum All attributes are dependent on the primary key Checks stability and integrity of E-R diagram Removes Insert, Update, Delete anomalies. 1 st Normal Form 2 nd Normal Form 3 rd Normal Form BCNF 4 th Normal Form 5 th Normal Form Normalized relational db model Relational db model

As normalization progresses… The number of Relations required to represent the data of the application being normalized increases . The increased number of tables require multiple JOIN’s to combine data from different tables. (more the joins the worse it gets) Queries that have a lot of complex joins will require more CPU usage and will adversely affect performance.

Practically speaking Queries run slowly . Reports take too long to print. On-screen forms take time to populate. Web pages take too long to populate. More complicated SQL required for multi-table queries and joins. In short, extra work for DBMS can mean slower applications

Other issues… No calculated values . CV’s are a fact of life for all applications, but a normalized DBMS lacks them. Non-reproducible Calculations . Application must generate them on the fly as needed. If your application changes over time, you risk not being able to reproduce prior results. Join Jungles . When each fact is stored in exactly one place, you it is daunting to pull together everything for a certain query. Making it hard to code, hard to debug, and dangerous to alter. Performance . When you face a JOIN jungle you almost always face performance problems.

??before denormalizing Can the system achieve acceptable performance without denormalizing? Will the performance of the system after denormalizing still be unacceptable ? Will the system be unreliable due to denormalization? If the answer to any of these is "yes," avoid denormalization because any benefit that is accrued will not exceed the cost.

Denormalization and Why? Frequently, performance needs dictate very quick retrieval capability for data stored in relational databases. To accomplish this, sometimes the decision is made to denormalize the physical implementation. Denormalization is the process of putting one fact in numerous places. This speeds data retrieval at the expense of data modification.

Does it mean Un-normalization ? ‘ Denormalization’ does not mean that anything goes. Denormalization does not mean chaos. Un-normalized data model is little or no analysis is performed. In short, seek to denormalize a data model that has already been normalized.

DENORMALIZATION PROCESS Develop E-R Refinement &Normalize Identify candidates Identifying form Map to physical schema Determining integrity effects

Development of Conceptual data model E-R/M aims at identifying the entities that are part of the system, the attributes that make up these entities, and the dependencies between entities. No Dependency among the attributes – Normalization resolves the functional dependencies between attributes Shows Data at rest – Denormalization considers the types of queries and their frequency 1

Refinement and normalization The ERD is further refined, in order to resolve the functional dependencies between the attributes of an Entity. May lead to splitting of tables to reduce data redundancy. Identifying candidates for denormalization Application performance criteria. Type of queries to be executed (update/retrieve). Frequency of queries Number of rows accessed by each transaction. Cardinality – 1:1, 1:M Derived data, Lookup data 2 3

Determine effect on data integrity The effect of denormalization is reviewed. Denormalizing may lead to performance degradation Or unacceptable consistency issues. In such a case Denormalization decision must be reconsidered 4

Form for denormalized entity Identifying what form the denormalized entity may take We move down the normal forms ladder of steps. 5 Map conceptual scheme to physical scheme. Once the scheme is tested and verified it is implemented. 6

Pre joined Tables Report Tables Mirror Tables Split Tables DENORMALIZATION STRATEGIES Redundant Data Repeating Groups Derivable Data Speed Tables

Pre-joined tables Two or more tables are joined and the result is stored as another table. When the cost of joining is prohibitive Example: Retail store databases Contain only those columns absolutely necessary for application to meet processing needs. The pre-joined table must be created periodically using SQL to join the normalized tables.

Report Tables When specialized critical reports are too costly to generate Create table that contains the report. To be viewed in online environments. Lot of formatting and data manipulation

Mirror tables When tables are required concurrently by two different types of environments. If online processing and decision support access the same table Can duplicate table, use second table for read-only use Example: Heavy Online Traffic Care must be taken to periodically migrate the foreground data to background tables. Performance bottlenecks are resolved.

Split tables When distinct groups use different parts of a table. - vertically - horizontally. The original table must be available for certain transactions.

Vertical Split Attributes are divided between the two tables, primary key put into both tables Particularly useful if one group of applications accesses some columns and another group accesses different columns Example: Many columns of the customer table contain data specific to credit limit assessment, whereas others contain more general contact and customer profiling information Split the table vertically, one partition containing credit limit information, and the other containing the more general customer details.

Horizontal Split Rows are divided between two tables Usually rows are divided by range of key values The operation of UNION ALL, when applied later should not add more rows than contained in the original, un-split tables Example: For a large customer table, we might split it into two tables, one for home-based customers, and the other for overseas customers.

Redundant Data Some columns of other table are made redundant in a given table. To reduce the number of table joins Use when 1/more columns from one table are accessed whenever data from another table is accessed. The original column must not be removed from the table. Best for data that is not updated often. Example: Consider the DEPARTMENT and EMPLOYEE tables. Queries require the name of the employee's department then the department name column could be carried as redundant data in the EMP table.

Repeating Groups Another table is created that contains the columns corresponding to every element of group. Example A (Customer_No, Balance_period, Balance) B (Customer_No, Balance_period1, Balance_period2, Balance_period3, Balance_period4, Balance_period5) Points To Remember The data is rarely or never aggregated , averaged, or compared within the row The data has a stable number of occurrences The data is usually accessed collectively The data has a predictable pattern of insertion and deletion

Derivable data Derived data is data not directly stored in the database, but is instead calculated from the data which is stored in the database Cost of deriving data using complicated formulae is prohibitive then consider storing the derived data in a column instead of calculating it. Example: Score Calculation The stored derived data must be updated whenever the underlying data it is based on is changed.

Speed tables A speed table is a denormalized version of a hierarchy. Every parent has a row for every child that reports to it at any level, either directly or indirectly. A speed table optionally carries information such as level within a hierarchy and whether or not the child is at a detail most level within the hierarchy (bottom of tree) Used when tree like hierarchy is to be stored in database. Data is replicated within a speed table to increase the speed of data retrieval.

CASE STUDY-Prejoin A simplified retail example Before denormalization: 1 M SALES SALES_DETAIL

Prejoin Denormalization A simplified retail example... After denormalization: SALES_AND_DETAILS

SAMPLE QUERY Q) What was my total volume between '06-AUG-08'and '06-AUG-09'? BEFORE denormalization: select sum(sales_detail.product_qty) from sales ,sales_detail where sales.sale_id = sales_detail.sale_id and sales.sale_date between TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

Sample Query 2 Q) What was my total volume between '06-AUG-08'and '06-AUG-09'? AFTER denormalization: select sum(product_qty) from sales_and_details where sales_and_details.sale_date between TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

Sample Query 3 What happens if we ask about the number of “sales” rather than the quantity transacted? BEFORE denormalization: select count(*) from sales where sales.sale_date between TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

What happens if we ask about the number of “sales” rather than the quantity transacted? AFTER denormalization: select count(distinct sale_id) from sales_and_details where sales_and_details .sale_date between TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');

PROS Convenience Using calculated values it is far easier for programmers to generate reports without have generating code to calculate them. Saves CPU time. Simple Queries Each eliminated JOIN is a simpler query that is easier to get right the first time, easier to debug, and easier to keep correct when changed.

PROS The Performance Argument We end up improving performance (speed) because we need fewer JOINs to retrieve the same number of facts. The Storage Argument Data availability the locations where it will be used. The number of foreign keys are reduced (how separate tables are related), the number of indexes are reduced (foreign keys are frequently indexed)

CONS Leads to data duplication and increases the storage requirement of the database. Documenting decisions, ensuring valid data, data migration. Having multiple copies leads to synchronization issues. Increased update time.

Physically speaking.,.,, Performance determined entirely at the “physical database level “ Storage and access methods Hardware Physical design DBMS implementation details Degree of concurrent access

AN ILLUSION In denormalization have an understanding that: 1. Higher the normalization, greater the number of tables 2. Greater number of tables require more joins 3. Joins slow performance 4. Denormalization reduces number of tables and, hence less joins, improved performance. The problem is that points 2 and 3 are not necessarily true, in which case point 4 does not hold and even if they hold true.

It is claimed that from the integrity perspective, there are two database design options: Fully normalize the database thereby maximizing the simplicity of integrity enforcement; Denormalize the database and complicate integrity enforcement. According to the illusion argument, the first choice is the better option. Why, then, the prevailing insistence on the second choice? The argument for denormalization is, of course, based on performance considerations

Conclusion In a real-life project, you have to bring back some data redundancy for performance reasons. Database design is about efficient data engineering - tradeoffs in design choices , choosing the right design for the performance requirements As stated by most database practitioners, denormalization may or may not result in a better performance or a more flexible data structure for users. Selective denormalization is usually required. Weigh and decide whether the perceived benefits are worth the effort to maintain the database properly. The importance of the present argument between its pros and cons is of a vital importance

References [1] G. Lawrence Sanders & Seung kyoon Shin , Denormalization Effects on Performance of RDBMS, Proceedings of the 34th Hawaii International Conference on System Sciences, 2001. [2] Denormalization strategies for data retrieval from data warehouses, Seung Kyoon Shina,*, G. Lawrence Sandersb,1a [3] Marsha Hanus, To Normalize or Denormalize, That is the Question, Candle Corporation [4] Denormalization Guidelines by Craig S. Mullins Published: PLATINUM technology, inc. June 1, 1997 [5] Douglas B. Bock and John F. Schrage, Department of Computer Management and Information Systems, Southern Illinois University Edwardsville, published in the 1996 Proceedings of the Decision Sciences Institute, Orlando, Florida, November, 1996 [6] The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1 and Part 2, - Fabian Pascal , DM Review Magazine, July 2002 [7] Service-Oriented Data Denormalization for Scalable Web Applications, Zhou Wei (Tsinghua University Beijing, China), Jiang Dejun (Tsinghua University), Guillaume Pierre (Vrije Universiteit Amsterdam), Chi-Hung Chi (Tsinghua Univers), Maarten van Steen(Vrije Universiteit Amsterdam);April 21-25, 2008. Beijing, China [8] Understanding Normalisation, by Micheal J Hernandez, 2001-2003. [9] Hierarchical Denormalizing: A Possibility to Optimize the Data Warehouse Design By Morteza Zaker, Somnuk Phon-Amnuaisuk, Su-Cheng Haw [10] How Valuable is Planned Data Redundancy in Maintaining the Integrity of an Information System through its Database by Eghosa Ugboma , Florida Memorial University [11 ] Introduction to Databases, Database Design and SQL, Zornitsa Zaharieva, CERN [12] THE DATA ADMINISTRATION NEWSLETTER – TDAN.com

Anomalies Anomalies are inconsistencies in data that occur due to unnecessary redundancy. Update anomaly Some copies of a data item are updated, but others are not. Insertion anomaly Can’t insert “real” data without also inserting unrelated or “made up” data Deletion anomaly Can’t delete some data without also deleting other, unrelated data

First Normal Form (1NF) If a table of data meets the definition of a relation, it is in first normal form. Every relation has a unique name. Every attribute value is atomic (single-valued). Every row is unique. Attributes in tables have unique names. The order of the columns is irrelevant. The order of the rows is irrelevant.

Second Normal Form (2NF) 1NF and no partial functional dependencies. Partial functional dependency : when one or more non-key attributes are functionally dependent on part of the primary key. Every non-key attribute must be defined by the entire key, not just by part of the key. If a relation has a single attribute as its key, then it is automatically in 2NF.

Second normal form 2NF A relation that is not in 2NF Key: Student_ID, Activity Activity  Fee Fee is determined by Activity ACTIVITY Student_ID Activity Fee

Fee Divide the relation into two relations that now meet 2NF Student_ID STUDENT_ACTIVITY Activity ACTIVITY_COST Activity Key: Student_ID and Activity Key: Activity Activity  Fee

Third Normal Form (3NF) 2NF and no transitive dependencies Transitive dependency : a functional dependency between two or more non-key attributes.

A relation with a transitive dependency Student_ID HOUSING Building Fee Key: Student_ID Building  Fee Student_ID  Building  Fee

Divide the relation into two relations that now meet 3NF Student_ID STUDENT_HOUSING Building Key: Student_ID Student_ID  Building BUILDING_COST Building Fee Key: Building Building  Fee

Third Normal Form (3NF) In 2NF and every non-key column is mutually independent means : Calculations Solution: Put calculations in queries and forms OrderDetails OrderID Item Quantity Price Put expression in text control or in query: =Quantity * Price Item Quantity Price Total Hammer 2 $10 $20 Saw 5 $40 $200 Nails 8 $1 $8

BCNF 3NF and every determinant is a candidate key.

A relation where a determinant is not a candidate key Note: Students can have a double major and have an advisor for each major. An advisor works only with students in their assigned area. Student_ID STUDENT_ADVISOR Advisor Major Primary Key: Student_ID, Major Candidate Key: Student_ID, Advisor Advisor  Major

Divide the relation into two relations that meet BCNF Student_ID STUDENT_ADVISOR Key: Student_ID, Advisor ADVISOR_MAJOR Advisor Major Advisor Key: Advisor Advisor  Major

When &amp; Why\'s of Denormalization

More Related Content

What's hot

Similar to When &amp; Why\'s of Denormalization

When &amp; Why\'s of Denormalization

When & Why\'s of Denormalization

Similar to When & Why\'s of Denormalization

When & Why\'s of Denormalization