S TA OL LETO NT Successful Data Warehousing TECHNIQUES By: Vikas.K.Jain
Now, if the Estimates made before a Battle indicate Victory, it is because careful calculations show that your conditions are more favorable than those of your enemy; if they indicate defeat ,it is because careful calculations show that the favorable conditions for a Battle are fewer. With more careful calculations one can win ; with less one cannot. How much chance of Victory has one who makes no calculations at all !! --- Sun Tzu , The Art of War Business these days are ,war minus shooting. -Anonymous
What is a Data Warehouse ? A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managements decisions. - WH InmonWH Inmon - Regarded As Father Of Data Warehousing
Necessity is the mother of invention Why Data Warehouse?
Scenario 1ABC Pvt Ltd is a company with branchesat Mumbai, Delhi, Chennai and Banglore.The Sales Manager wants quarterly salesreport. Each branch has a separateoperational system.
Scenario 1 : ABC Pvt Ltd.Mumbai Delhi Sales per item type per branch Sales for first quarter. ManagerChennaiBanglore
Solution 1:ABC Pvt Ltd.• Extract sales information from each database.• Store the information in a common repository at a single site.
Solution 2• Extract data needed for analysis from operational database.• Store it in warehouse.• Refresh warehouse at regular interval so that it contains up to date information for analysis.• Warehouse will contain data with historical perspective.
Solution 2Data Entry Operator ReportTransaction Extract Data Operational Manager data Warehouse databaseData Entry Operator
Scenario 3Cakes & Cookies is a small,newcompany.President of the company wants hiscompany should grow.He needs information sothat he can make correct decisions.
Solution 3• Improve the quality of data before loading it into the warehouse.• Perform data cleaning and transformation before loading the data.• Use query analysis tools to support adhoc queries.
Solution 3 Expansi on sales Data Query and Analysis PresidentWarehouse tool time Improveme nt
Need for Data Warehousing• Better business intelligence for end-users• Reduction in time to locate, access, and analyze information• Consolidation of disparate information sources• Strategic advantage over competitors• Faster time-to-market for products and services• Replacement of older, less-responsive decision support systems• Reduction in demand on IS to generate reports
Business QueriesTypical Business Queries Which product generated maximum revenue over last two quarters in a chosen geographical region, city wise, relative to the previous version of product, compared with the plan What percent of customer procures product A with B in a chosen region, broken down by city, season, and income group
1960 - 1985 : MIS Era Evolution of Data Warehousing • Unfriendly • Slow • Dependent on IS programmers • Inflexible • Analysis limited to defined reports Focus on Reporting
Evolution of Data Warehousing1985 - 1990 : Querying Era Queries that are formulated by the user on the spur of the moment • Adhoc, unstructured access to corporate data • SQL as interface not scalable • Cannot handle complex analysis Focus on Online Querying
Evolution of Data Warehousing1990 - 20xx : Analysis Era • Trend Analysis • What If ? • Cross Dimensional Comparisons • Statistical profiles • Automated pattern and rule discovery Focus on Online Analysis
Warehouse Architecture - 1 EIS /DSS Metadata Select Query Tools Extract Transform Data Integrate Warehouse OLAP/ROLAP Maintain Web BrowsersOperationalSystems/Data Middleware/ API Data Mining Data Preparation Enterprise Data Warehouse
Warehouse Architecture - 2 Metadata EIS /DSS Data Mart Metadata Select Query Tools Extract Transform Data Mart Integrate OLAP/ROLAP Maintain Metadata Web BrowsersOperational Data MartSystems/Data Middleware/ Data API Data Mining Preparation Single Department Data Mart
Warehouse Architecture - 3 Data Marts EIS /DSS Metadata Select Query Tools Extract Data Transform Warehouse Integrate OLAP/ROLAP Maintain Web BrowsersOperationalSystems/Data Middleware/ Operational API Data Mining Data Data Store Preparation Multi-tiered Data Warehouse
Benefits of DWHThese capabilities empower the corporate... To formulate effective business, marketing and sales strategies. To precisely target promotional activity. To discover and penetrate new markets. To successfully compete in the marketplace from a position of informed strength. To build predictive rather than retrospective models.
OLTP Systems Vs Data Warehouse Between OLTP and Data Warehouse systems users are different data content is different, data structures are different hardware is different Understanding The Differences Is The Key
OLTP vs. OLAP OLTP OLAP • Mostly updates • Mostly reads • Many small transactions • Queries long, complex • Mb-Gb of data • Tb-Pb of data • Raw data • Summarized, • Clerical users consolidated data • Up-to-date data • Decision-makers, • analysts as users Consistency, recoverability critical28
OLTP vs. OLAP OLTP OLAPUser Clerk, IT Professional Knowledge workerFunction Day to day operations Decision supportDB Design Application-oriented (E- Subject-oriented (Star, R based) snowflake)Data Current, Isolated Historical, ConsolidatedView Detailed, Flat relational Summarized,Usage Structured, Repetitive MultidimensionalUnit of work Short, Simple Ad hocAccess transaction Complex queryOperations Read/write Read Mostly# Records Index/hash on prim. Key Lots of Scansaccessed Tens Millions#Users Thousands HundredsDb size 100 MB-GB 100 GB-TBMetric Trans. throughput Query throughput, response
OLTP Vs WarehouseOperational System Data WarehouseTransaction Processing Query ProcessingPredictable CPU Usage Random CPU UsageTime Sensitive History OrientedOperator View Managerial ViewNormalized Efficient Denormalized Design forDesign for TP Query Processing
Operational OLTP Vs Warehouse System Data WarehouseDesigned for Atmocity, Designed for quite or staticConsistency, Isolation and databaseDurabilityOrganized by transactions Organized by subject(Order, Input, Inventory) (Customer, Product)Relatively smaller database Large database sizeMany concurrent users Relatively few concurrent usersVolatile Data Non Volatile Data
OLTP Vs WarehouseOperational System Data WarehouseStores all data Stores relevant dataPerformance Sensitive Less Sensitive to performanceNot Flexible FlexibleEfficiency Effectiveness
Processing Power Capacity Planning Time of day Processing Load Peaks During the Beginning and End of Day
Data Marts Subject or Application Oriented Business View of Warehouse » Finance, Manufacturing, Sales etc. » Smaller amount of data used for Analytic Processing » Address a single business processA Logical Subset of The Complete Data Warehouse
Different kinds of Information Needs •• Current Current Is this medicine available in stock What are the tests this •• Recent Recent patient has completed so far Has the incidence of Tuberculosis increased in last 5 years in •• Historical Historical Southern region
Warehouse Models• Modeling data warehouses: dimensions, measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 37
Snowflake schemaSnowflake is referred as normalized star schema. The dimensions are normalized to avoid data redundancy. Year Year Product Month Product Key Month Product ID Year Day Product Desc Day Time Key Category Month Product key Store Store key Customer Store City Customer key Customer Key Unit sales City Customer ID City Gross sales NameState State CityStateCountry 38
Star Schema – A single fact table may be connected to multiple dimension table.Each dimension is represented by one table – It is an un- normalized form ProductTime Product KeyTime Key Product IDDate Product DescDay Time Key CategoryMonthYear Product key Store key CustomerStoreStore Key Customer key Customer KeyStore ID Unit sales Customer IDCity Gross sales NameCountry CityRegion Country 39Year
Starproduct prodId name price store storeId city p1 bolt 10 c1 nyc p2 nut 5 c2 sfo c3 la sale oderId date custId prodId storeId qty amt o100 1/7/97 53 p1 c1 1 12 o102 2/7/97 53 p2 c1 2 11 105 3/8/97 111 p1 c3 5 50 Measures customer custId name address city 53 joe 10 main sfo 81 fred 12 main sfo 111 sally 80 willow la40
timetime_key itemday item_keyday_of_the_week Sales Fact Table item_namemonth brandquarter time_key typeyear supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold state_or_province country avg_sales Measures 41
Dimension Hierarchies sType store city region sType tId size location t1 small downtownstore storeId cityId tId mgr t2 large suburbs s5 sfo t1 joe s7 sfo t2 fred city cityId pop regId s9 la t1 nancy sfo 1M north la 5M south snowflake schema constellations region regId name north cold region south warm region42
Example of Fact Constellationtimetime_key item Shipping Fact Tableday item_keyday_of_the_week Sales Fact Table item_name time_keymonth brandquarter time_key type item_keyyear supplier_type shipper_key item_key from_location branch_keybranch location_key location to_locationbranch_key location_key dollars_costbranch_name units_sold streetbranch_type dollars_sold city units_shipped province_or_state avg_sales country shipper Measures shipper_key shipper_name 43 location_key shipper_type
What Happens without Normalization• A non-normalized database can suffer from data anomalies:• A non-normalized database may store data representing a particular referent in multiple locations. An update to such data in some but not all of those locations results in an update anomaly, yielding inconsistent data. A normalized database prevents such an anomaly by storing such data (i.e. data other than primary keys) in only one location.• A non-normalized database may have inappropriate dependencies, i.e. relationships between data with no functional dependencies. Adding data to such a database may require first adding the unrelated dependency. A normalized database prevents such insertion anomalies by ensuring that database relations mirror functional dependencies.• Similarly, such dependencies in non-normalized databases can hinder deletion. That is, deleting data from such databases may require deleting data from the inappropriate dependency. A normalized database prevents such deletion anomalies by ensuring that all records are uniquely identifiable and contain no extraneous information.
Warehouse or Mart First ? Data Warehouse First Data Mart firstExpensive Relatively cheapLarge development cycle Delivered in < 6 monthsChange management is Easy to manage changedifficultDifficult to obtain continuous Can lead to independent andcorporate support incompatible martsTechnical challenges in Cleansing, transformation,building large databases modeling techniques may be incompatible
Can I seecredit report from Operational Data Store - Definition Accounts, Data from multiple Sales from sources is marketing integrated for a and open subjectorder report from orderentry for this A subject oriented, integrated, customer volatile, current valued data store containing only corporate detailed data Identical queries Data stored only for may give different current period. Old results at different Data is either times. Supports archived or moved analysis requiring to Data Warehouse current data
OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Audience Operating Analysts Managers and Personnel analysts Data access Individual records, Individual records, Set of records, transaction driven transaction or analysis driven analysis driven Data content Current, real-time Current and near- Historical current Data granularity Detailed Detailed and lightly Summarized and summarized derived Data organization Functional Subject-oriented Subject-oriented Data quality All application All integrated data Data relevant to specific detailed needed to support a management data needed to business activity information needs support a business activity
OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Data redundancy Non-redundant Somewhat Managed within system; redundant with redundancy Unmanaged operational redundancy among databases systems Data stability Dynamic Somewhat dynamic Static Data update Field by field Field by field Controlled batch Data usage Highly structured, Somewhat Highly repetitive structured, some unstructured, analytical heuristic or analytical Database size Moderate Moderate Large to very large Database Stable Somewhat stable Dynamic structure stability
OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Development Requirements Data driven, Data driven, methodology driven, structured somewhat evolutionary evolutionary Operational Performance and Availability Access flexibility priorities availability and end user autonomy Philosophy Support day-to- Support day-to-day Support managing day operation decisions & the enterprise operational activities Predictability Stable Mostly stable, some Unpredictable unpredictability Response time Sub-second Seconds to minutes Seconds to minutes Return set Small amount of Small to medium Small to large data amount of data amount of data
scd Customer Key Name State 1001 Christina IllinoisAdvantages:- This is the easiest way to handle the Slowly Changing Dimension problem, sincethere is no need to keep track of the old information.Disadvantages:- All history is lost. By applying this methodology, it is not possible to trace back inhistory. For example, in this case, the company would not be able to know thatChristina lived in Illinois before.Usage:About 50% of the time.When to use Type 1:Type 1 slowly changing dimension should be used when it is not necessary for thedata warehouse to keep track of historical changes
scd Customer Key Name State 1001 Christina Illinois 1005 Christina CaliforniaAdvantages:- This allows us to accurately keep all historical information.Disadvantages:- This will cause the size of the table to grow fast. In cases where the number of rowsfor the table is very high to start with, storage and performance can become aconcern.- This necessarily complicates the ETL process.Usage:About 50% of the time.When to use Type 2:Type 2 slowly changing dimension should be used when it is necessary for the datawarehouse to track historical changes.
scd C. Key Name O.State p.State Date 1001 Chrisy Illinois California 15-Jan-03Advantages:- This does not increase the size of the table, since new information is updated.- This allows us to keep some part of history.Disadvantages:- Type 3 will not be able to keep all history where an attribute is changed more thanonce. For example, if Christina later moves to Texas on December 15, 2003, theCalifornia information will be lost.Usage:Type 3 is rarely used in actual practice.When to use Type 3:Type III slowly changing dimension should only be used when it is necessary for the datawarehouse to track historical changes, and when such changes will only occur for a finitenumber of time.
CS 336 60
What Is Metadata?• Data about data• Its is important to know what data is available and where does it lies for analysis• That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse• About the data being captured and loaded into the Warehouse• Documented in IT tools that improves both business and technical understanding of data and data-related processes
Consumers of Metadata Develop User er Impact Analysis What data, pre-built Queries Exists DBA Metadata Impact of changes in operational S/W Tool- system to data warehouse & data ETL, mart Modelling, OLAP Etc. Support Development of Data Warehouse ,Data Mart
Importance Of MetadataIntegrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making
Importance Of MetadataLocating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result?
What Is Metadata?Defining Metadata Simplest definition – Data about data. Data base table Metadata SALE_ID TABLE_OWNER: MIS_OWNER CUST_ID CREATE DATE: 25 –OCT – 2002 21:54:00 ITEM LAST MODIFIED DATE 03-MAR-2003 09:30:00 LAST MODIFIED BY :MIS OWNER DATE PURPOSE: This table tracks customer sales QUANTITY LINKS TO RELATED REPORTS: TOTAL SALES, CUSTOMER PROFILES UNIT_PRICE LOCATION PROMOTION
Granularity in Fact Table• Granularity is a measure of the level of detail addressed by an individual entry in the fact table.• Business needs, rather than physical implementation considerations, must determine the minimum granularity of the fact table.• It is better to keep the data as granular as possible, even if current business needs do not require it—the additional detail might be critical for tomorrows business analysis.• Do not add summary records to the fact table that include detail facts already in the fact table.• Do not mix granularities in the fact table. If it’s needed, create one table for each level of granularity.
granularityGranularity is usually mentioned in the context of dimensionaldata structures (i.e., facts and dimensions) and refers to thelevel of detail in a given fact table. The more detail there is inthe fact table, the higher its granularity and vice versa.Another way to look at it is that the higher the granularity of afact table, the more rows it will have.
Terminology and definitions (contd..) Data MiningThe extraction of hidden predictive information from large databases.A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior.Ex: data mining software can help retail companies find customers with common interests.The term is commonly misused to describe software that presents data in new ways. True data mining software doesnt just change the presentation, but actually discovers previously unknown relationships among the data. 69