Refactoring database


Published on

database refactoring presentation

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 1
  • 3
  • Multipurpose column. If a column is being used for several purposes, it is likely that extra code exists to ensure that the source data is being used the "right way," often by checking the values of one or more other columns. An example is a column used to store either someone's birth date if he or she is a customer or the start date if that person is an employee. Worse yet, you are likely constrained in the functionality that you can now supportfor example, how would you store the birth date of an employee? Multipurpose table. Similarly, when a table is being used to store several types of entities, there is likely a design flaw. An example is a generic Customer table that is used to store information about both people and corporations. The problem with this approach is that data structures for people and corporations differpeople have a first, middle, and last name, for example; whereas a corporation simply has a legal name. A generic Customer table would have columns that are NULL for some kinds of customers but not others. Redundant data. Redundant data is a serious problem in operational databases because when data is stored in several places, the opportunity for inconsistency occurs. For example, it is quite common to discover that customer information is stored in many different places within your organization. In fact, many companies are unable to put together an accurate list of who their customers actually are. The problem is that in one table John Smith lives at 123 Main Street, and in another table at 456 Elm Street. In this case, this is actually one person who used to live at 123 Main Street but who moved last year; unfortunately, John did not submit two change of address forms to your company, one for each application that knows about him. Tables with too many columns. When a table has many columns, it is indicative that the table lacks cohesionthat it is trying to store data from several entities. Perhaps your Customer table contains columns to store three different addresses (shipping, billing, seasonal) or several phone numbers (home, work, cell, and so on). You likely need to normalize this structure by adding Address and PhoneNumber tables. Tables with too many rows. Large tables are indicative of performance problems. For example, it is time-consuming to search a table with millions of rows. You may want to split the table vertically by moving some columns into another table, or split it horizontally by moving some rows into another table. Both strategies reduce the size of the table, potentially improving performance. "Smart" columns. A smart column is one in which different positions within the data represent different concepts. For example, if the first four digits of the client ID indicate the client's home branch, then client ID is a smart column because you can parse it to discover more granular information (for example, home branch ID). Another example includes a text column used to store XML data structures; clearly, you can parse the XML data structure for smaller data fields. Smart columns often need to be reorganized into their constituent data fields at some point so that the database can easily deal with them as separate elements. Fear of change. If you are afraid to change your database schema because you are afraid to break somethingfor example, the 50 applications that access itthat is the surest sign that you need to refactor your schema. Fear of change is a good indication that you have a serious technical risk on your hands, one that will only get worse over time.
  • *Introduce referential integrity. You may want to introduce a referential integrity constraint on an existing Address.State to ensure the quality of the data. *Provide code lookup. Many times you want to provide a defined list of codes in your database instead of having an enumeration in every application. The lookup table is often cached in memory. *Replace a column constraint. When you introduced the column, you added a column constraint to ensure that a small number of correct code values persisted. But, as your application(s) evolved, you needed to introduce more code values, until you got to the point where it was easier to maintain the values in a lookup table instead of updating the column constraint. *Provide detailed descriptions. In addition to defining the allowable codes, you may also want to store descriptive information about the codes. For example, in the State table, you may want to relate the code CA to California. 1. Determine the table structure. You must identify the column(s) of the lookup table (State). 2. Introduce the table. Create State in the database via the CREATE TABLE command. 3. Determine lookup data. You have to determine what rows are going to be inserted in the State. 4. Introduce referential constraint. To enforce referential integrity constraints from the code column in the source table(s) to State, you must apply the Add Foreign Key refactoring.
  • *Identifying a true default can be difficult. When many applications share the same database, they may have different default values for the same column, often for good reasons. Or it may simply be that your business stakeholders cannot agree on a single valueyou need to work closely with them to negotiate the correct value. *Unintended side effects. Some applications may assume that a null value within a column actually means something and will therefore exhibit different behavior now that columns in new rows that formerly would have been null now are not. *Confused context. When a column is not used by an application, the default value may introduce confusion over the column's usage with the application team. 1. Invariants are broken by the new value. For example, a class may assume that the value of a color column is red, green, or blue, but the default value has now been defined as yellow. 2. Code exists to apply default values. There may now be extraneous source code that checks for a null value and introduces the default value programmatically. This code should be removed. 3. Existing source code assumes a different default value. For example, existing code may look for the default value of none, which was set programmatically in the past, and if found it gives users the option to change the color. Now the default value is yellow, so this code will never be invoked.
  • 1. Similar RI code. Some external programs will implement the RI business rule that will now be handled via the foreign key constraint within the database. This code should be removed. 2. Different RI code. Some external programs will include code that enforces different RI business rules than what you are about to implement. This implication is that you either need to reconsider adding this foreign key constraint because there is no consensus within your organization regarding the business rule that it implements or you need to rework the code to work based on this new version (from its point of view) of the business rule. 3. Nonexistent RI code. Some external programs will not even be aware of the RI business rule pertaining to these data tables.
  • *Improve query performance. Querying a given set of tables may be very slow because of the requisite joins; therefore, a prepopulated table may improve overall performance. *Summarize data for reporting. Many reports require summary data, which can be prepopulated into a read-only table and then used many times over. *Create redundant data. Many applications query data in real time from other databases. A read-only table containing this data in your local database reduces your dependency on these other database(s), providing a buffer for when they go down or are taken down for maintenance. *Replace redundant reads. Several external programs, or stored procedures for that matter, often implement the same retrieval query. These queries can be replaced by a common read-only table or a new view *Data security. A read-only table enables end users to query the data but not update it. *Improve database readability. If you have a highly normalized database, it is usually difficult for users to navigate through all the tables to get to the required information. By introducing read-only tables that capture common, denormalized data structures, you make your database schema easier to understand because people can start by focusing just on the denormalized tables.
  • *Periodic refresh. Use a scheduled job that refreshes your read-only table. The job may refresh all the data in the read-only table or it may just update the changes since the last refresh. Note that the amount of time taken to refresh the data should be less than the scheduled interval time of the refresh. This technique is particularly suited for data warehouse kind of environments, where data is generally summarized and used the next day. Hence, stale data can be tolerated; also, this approach provides you with an easier way to synchronize the data. *Materialized views. Some database products provide a feature where a view is no longer just a query; instead, it is actually a table based on a query. The database keeps this materialized view current based on the options you choose when you create it. This technique enables you to use the database's built-in features to refresh the data in the materialized view, with the major downside being the complexity of the view SQL. When the view SQL gets more complicated, the database products tend not to support automated synchronization of the view. *Use trigger-based synchronization. Create triggers on the source tables so that source data changes are propagated to the read-only table. This technique enables you to custom code the data synchronization, which is desirable when you have complex data objects that need to be synchronized; however, you must write all of the triggers, which could be time consuming. *Use real-time application updates. You can change your application so that it updates the read-only table, making the data current. This can only work when you know all the applications that are writing data to your source database tables. This technique allows for the application to update the read-only table, and hence its always kept current, and you can make sure that the data is not used by the application. The downside of the technique is you must write your information twice, first to the original table and second to the denormalized read-only table; this could lead to duplication and hence bugs.
  • 15
  • ×