• Save
D presentation
Upcoming SlideShare
Loading in...5

D presentation






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • we invent something only if there is a need for that thing….today we are going to see what data warehousing is…data warehouse is evolved to satisfy some needs….we will see some of these need now
  • Granularity is usually mentioned in the context of dimensional data structures (i.e., facts and dimensions) and refers to the level of detail in a given fact table. The more detail there is in the fact table, the higher its granularity and vice versa. Another way to look at it is that the higher the granularity of a fact table, the more rows it will have. Let me illustrate with the following example: Say we have a data mart with a single fact (Sales) and three dimensions (Time, Organization and Product). The fact table contains three metrics (Unit Price, Units Sold and Total Sale Amount). The Time dimension consists of four hierarchical elements (Year, Quarter, Month and Day). The Organization dimension consists of three hierarchical elements (Region, District and Store). The Product dimension consists of two hierarchical elements (Product Family and SKU). As always, the metrics in the Sales fact table must be stored at some intersection of the dimensions (i.e., Time, Organization and Product). Hence, in this data mart, the highest granularity that we can store Sales metrics is by Day/Store/SKU (i.e., the lowest level in each dimensional hierarchy). Conversely, the lowest granularity that we can aggregate Sales metrics to in this data mart is by Year/Region/Product Family (i.e., the highest level in each dimensional hierarchy). We may also (for a variety of performance reasons) choose to store Sales metrics at some intermediate level of granularity (e.g., by Month/District/SKU

D presentation D presentation Presentation Transcript

  • flashvortex.swf
  • S TA OL LETO NT Successful Data Warehousing TECHNIQUES By: Vikas.K.Jain
  •  Now, if the Estimates made before a Battle indicate Victory, it is because careful calculations show that your conditions are more favorable than those of your enemy; if they indicate defeat ,it is because careful calculations show that the favorable conditions for a Battle are fewer. With more careful calculations one can win ; with less one cannot. How much chance of Victory has one who makes no calculations at all !! --- Sun Tzu , The Art of War Business these days are ,war minus shooting. -Anonymous
  • What is a Data Warehouse ? A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managements decisions. - WH InmonWH Inmon - Regarded As Father Of Data Warehousing
  • Necessity is the mother of invention Why Data Warehouse?
  • Scenario 1ABC Pvt Ltd is a company with branchesat Mumbai, Delhi, Chennai and Banglore.The Sales Manager wants quarterly salesreport. Each branch has a separateoperational system.
  • Scenario 1 : ABC Pvt Ltd.Mumbai Delhi Sales per item type per branch Sales for first quarter. ManagerChennaiBanglore
  • Solution 1:ABC Pvt Ltd.• Extract sales information from each database.• Store the information in a common repository at a single site.
  • Solution 1:ABC Pvt Ltd.Mumbai Report Delhi Data Query & Sales Warehouse Analysis tools ManagerChennaiBanglore
  • Scenario 2One Stop Shopping Super Market has hugeoperational database.Whenever Executives wantssome report the OLTP system becomesslow and data entry operators have to wait forsome time.
  • Scenario 2 : One Stop ShoppingData Entry Operator Report Wait Operational Management DatabaseData Entry Operator
  • Solution 2• Extract data needed for analysis from operational database.• Store it in warehouse.• Refresh warehouse at regular interval so that it contains up to date information for analysis.• Warehouse will contain data with historical perspective.
  • Solution 2Data Entry Operator ReportTransaction Extract Data Operational Manager data Warehouse databaseData Entry Operator
  • Scenario 3Cakes & Cookies is a small,newcompany.President of the company wants hiscompany should grow.He needs information sothat he can make correct decisions.
  • Solution 3• Improve the quality of data before loading it into the warehouse.• Perform data cleaning and transformation before loading the data.• Use query analysis tools to support adhoc queries.
  • Solution 3 Expansi on sales Data Query and Analysis PresidentWarehouse tool time Improveme nt
  • Need for Data Warehousing• Better business intelligence for end-users• Reduction in time to locate, access, and analyze information• Consolidation of disparate information sources• Strategic advantage over competitors• Faster time-to-market for products and services• Replacement of older, less-responsive decision support systems• Reduction in demand on IS to generate reports
  • Business QueriesTypical Business Queries  Which product generated maximum revenue over last two quarters in a chosen geographical region, city wise, relative to the previous version of product, compared with the plan  What percent of customer procures product A with B in a chosen region, broken down by city, season, and income group
  • 1960 - 1985 : MIS Era Evolution of Data Warehousing • Unfriendly • Slow • Dependent on IS programmers • Inflexible • Analysis limited to defined reports Focus on Reporting
  • Evolution of Data Warehousing1985 - 1990 : Querying Era Queries that are formulated by the user on the spur of the moment • Adhoc, unstructured access to corporate data • SQL as interface not scalable • Cannot handle complex analysis Focus on Online Querying
  • Evolution of Data Warehousing1990 - 20xx : Analysis Era • Trend Analysis • What If ? • Cross Dimensional Comparisons • Statistical profiles • Automated pattern and rule discovery Focus on Online Analysis
  • Warehouse Architecture - 1 EIS /DSS Metadata Select Query Tools Extract Transform Data Integrate Warehouse OLAP/ROLAP Maintain Web BrowsersOperationalSystems/Data Middleware/ API Data Mining Data Preparation Enterprise Data Warehouse
  • Warehouse Architecture - 2 Metadata EIS /DSS Data Mart Metadata Select Query Tools Extract Transform Data Mart Integrate OLAP/ROLAP Maintain Metadata Web BrowsersOperational Data MartSystems/Data Middleware/ Data API Data Mining Preparation Single Department Data Mart
  • Warehouse Architecture - 3 Data Marts EIS /DSS Metadata Select Query Tools Extract Data Transform Warehouse Integrate OLAP/ROLAP Maintain Web BrowsersOperationalSystems/Data Middleware/ Operational API Data Mining Data Data Store Preparation Multi-tiered Data Warehouse
  • Benefits of DWHThese capabilities empower the corporate... To formulate effective business, marketing and sales strategies. To precisely target promotional activity. To discover and penetrate new markets. To successfully compete in the marketplace from a position of informed strength. To build predictive rather than retrospective models.
  • OLTP Systems Vs Data Warehouse Between OLTP and Data Warehouse systems users are different data content is different, data structures are different hardware is different Understanding The Differences Is The Key
  • OLTP vs. OLAP OLTP OLAP • Mostly updates • Mostly reads • Many small transactions • Queries long, complex • Mb-Gb of data • Tb-Pb of data • Raw data • Summarized, • Clerical users consolidated data • Up-to-date data • Decision-makers, • analysts as users Consistency, recoverability critical28
  • OLTP vs. OLAP OLTP OLAPUser  Clerk, IT Professional  Knowledge workerFunction  Day to day operations  Decision supportDB Design  Application-oriented (E-  Subject-oriented (Star, R based) snowflake)Data  Current, Isolated  Historical, ConsolidatedView  Detailed, Flat relational  Summarized,Usage  Structured, Repetitive MultidimensionalUnit of work  Short, Simple  Ad hocAccess transaction  Complex queryOperations  Read/write  Read Mostly# Records  Index/hash on prim. Key  Lots of Scansaccessed  Tens  Millions#Users  Thousands  HundredsDb size  100 MB-GB  100 GB-TBMetric  Trans. throughput  Query throughput, response
  • OLTP Vs WarehouseOperational System Data WarehouseTransaction Processing Query ProcessingPredictable CPU Usage Random CPU UsageTime Sensitive History OrientedOperator View Managerial ViewNormalized Efficient Denormalized Design forDesign for TP Query Processing
  • Operational OLTP Vs Warehouse System Data WarehouseDesigned for Atmocity, Designed for quite or staticConsistency, Isolation and databaseDurabilityOrganized by transactions Organized by subject(Order, Input, Inventory) (Customer, Product)Relatively smaller database Large database sizeMany concurrent users Relatively few concurrent usersVolatile Data Non Volatile Data
  • OLTP Vs WarehouseOperational System Data WarehouseStores all data Stores relevant dataPerformance Sensitive Less Sensitive to performanceNot Flexible FlexibleEfficiency Effectiveness
  • Processing Power Capacity Planning Time of day Processing Load Peaks During the Beginning and End of Day
  • Examples Of Some ApplicationsManufacturersManufacturers Retailers Retailers • Target Marketing • Market Segmentation • Budgeting • Credit Rating Agencies • Financial Reporting and Consolidation  Market Basket Analysis - POS Analysis  Fraud Management Customers Customers  Profitability Management  Event tracking
  • Data Marts Subject or Application Oriented Business View of Warehouse » Finance, Manufacturing, Sales etc. » Smaller amount of data used for Analytic Processing » Address a single business processA Logical Subset of The Complete Data Warehouse
  • Different kinds of Information Needs •• Current Current Is this medicine available in stock What are the tests this •• Recent Recent patient has completed so far Has the incidence of Tuberculosis increased in last 5 years in •• Historical Historical Southern region
  • Warehouse Models• Modeling data warehouses: dimensions, measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 37
  • Snowflake schemaSnowflake is referred as normalized star schema. The dimensions are normalized to avoid data redundancy. Year Year Product Month Product Key Month Product ID Year Day Product Desc Day Time Key Category Month Product key Store Store key Customer Store City Customer key Customer Key Unit sales City Customer ID City Gross sales NameState State CityStateCountry 38
  • Star Schema – A single fact table may be connected to multiple dimension table.Each dimension is represented by one table – It is an un- normalized form ProductTime Product KeyTime Key Product IDDate Product DescDay Time Key CategoryMonthYear Product key Store key CustomerStoreStore Key Customer key Customer KeyStore ID Unit sales Customer IDCity Gross sales NameCountry CityRegion Country 39Year
  • Starproduct prodId name price store storeId city p1 bolt 10 c1 nyc p2 nut 5 c2 sfo c3 la sale oderId date custId prodId storeId qty amt o100 1/7/97 53 p1 c1 1 12 o102 2/7/97 53 p2 c1 2 11 105 3/8/97 111 p1 c3 5 50 Measures customer custId name address city 53 joe 10 main sfo 81 fred 12 main sfo 111 sally 80 willow la40
  • timetime_key itemday item_keyday_of_the_week Sales Fact Table item_namemonth brandquarter time_key typeyear supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold state_or_province country avg_sales Measures 41
  • Dimension Hierarchies sType store city region sType tId size location t1 small downtownstore storeId cityId tId mgr t2 large suburbs s5 sfo t1 joe s7 sfo t2 fred city cityId pop regId s9 la t1 nancy sfo 1M north la 5M south  snowflake schema  constellations region regId name north cold region south warm region42
  • Example of Fact Constellationtimetime_key item Shipping Fact Tableday item_keyday_of_the_week Sales Fact Table item_name time_keymonth brandquarter time_key type item_keyyear supplier_type shipper_key item_key from_location branch_keybranch location_key location to_locationbranch_key location_key dollars_costbranch_name units_sold streetbranch_type dollars_sold city units_shipped province_or_state avg_sales country shipper Measures shipper_key shipper_name 43 location_key shipper_type
  • Cube Fact table view: Multi-dimensional cube: sale prodId storeId amt p1 c1 12 c1 c2 c3 p2 c1 11 p1 12 50 p1 c3 50 p2 11 8 p2 c2 8 dimensions = 2 Recall counters in Apriori44
  • 3-D Cube Fact table view: Multi-dimensional cube:sale prodId storeId date amt p1 c1 1 12 p2 c1 1 11 c1 c2 c3 day 2 p1 c3 1 50 p1 44 4 p2 c2 1 8 p2 c1 c2 c3 p1 c1 2 44 day 1 p1 12 50 p1 c2 2 4 p2 11 8 dimensions = 3 45
  • Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodId storeId date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 81 p2 c2 1 8 p1 c1 2 44 p1 c2 2 446
  • Cube Aggregation Example: computing sums c1 c2 c3 day 2 ... p1 44 4 p2 c1 c2 c3day 1 p1 12 50 p2 11 8 c1 c2 c3 sum 67 12 50 c1 c2 c3 p1 56 4 50 p2 11 8 129 sum rollup p1 110 p2 19 drill-down 47
  • Cube Operators c1 c2 c3 day 2 ... p1 44 4 p2 c1 c2 c3day 1 p1 12 50 p2 11 8 sale(c1,*,*) c1 c2 c3 sum 67 12 50 c1 c2 c3 p1 56 4 50 p2 11 8 129 sum sale(c2,p2,*) p1 110 p2 19 sale(*,*,*) 48
  • Extended Cube * c1 c2 c3 * p1 56 4 50 110 p2 11 8 19 c1* 67 c2 12 c3 * 50 129 day 2 p1 44 4 48 p2 c1 c2 c3 * day 1 * 44 4 48 sale(*,p2,*) p1 12 50 62 p2 11 8 19 * 23 8 50 8149
  • What Happens without Normalization• A non-normalized database can suffer from data anomalies:• A non-normalized database may store data representing a particular referent in multiple locations. An update to such data in some but not all of those locations results in an update anomaly, yielding inconsistent data. A normalized database prevents such an anomaly by storing such data (i.e. data other than primary keys) in only one location.• A non-normalized database may have inappropriate dependencies, i.e. relationships between data with no functional dependencies. Adding data to such a database may require first adding the unrelated dependency. A normalized database prevents such insertion anomalies by ensuring that database relations mirror functional dependencies.• Similarly, such dependencies in non-normalized databases can hinder deletion. That is, deleting data from such databases may require deleting data from the inappropriate dependency. A normalized database prevents such deletion anomalies by ensuring that all records are uniquely identifiable and contain no extraneous information.
  • Warehouse or Mart First ? Data Warehouse First Data Mart firstExpensive Relatively cheapLarge development cycle Delivered in < 6 monthsChange management is Easy to manage changedifficultDifficult to obtain continuous Can lead to independent andcorporate support incompatible martsTechnical challenges in Cleansing, transformation,building large databases modeling techniques may be incompatible
  • Can I seecredit report from Operational Data Store - Definition Accounts, Data from multiple Sales from sources is marketing integrated for a and open subjectorder report from orderentry for this A subject oriented, integrated, customer volatile, current valued data store containing only corporate detailed data Identical queries Data stored only for may give different current period. Old results at different Data is either times. Supports archived or moved analysis requiring to Data Warehouse current data
  • OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Audience Operating Analysts Managers and Personnel analysts Data access Individual records, Individual records, Set of records, transaction driven transaction or analysis driven analysis driven Data content Current, real-time Current and near- Historical current Data granularity Detailed Detailed and lightly Summarized and summarized derived Data organization Functional Subject-oriented Subject-oriented Data quality All application All integrated data Data relevant to specific detailed needed to support a management data needed to business activity information needs support a business activity
  • OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Data redundancy Non-redundant Somewhat Managed within system; redundant with redundancy Unmanaged operational redundancy among databases systems Data stability Dynamic Somewhat dynamic Static Data update Field by field Field by field Controlled batch Data usage Highly structured, Somewhat Highly repetitive structured, some unstructured, analytical heuristic or analytical Database size Moderate Moderate Large to very large Database Stable Somewhat stable Dynamic structure stability
  • OLTP Vs ODS Vs DWH Characteristic OLTP ODS Data Warehouse Development Requirements Data driven, Data driven, methodology driven, structured somewhat evolutionary evolutionary Operational Performance and Availability Access flexibility priorities availability and end user autonomy Philosophy Support day-to- Support day-to-day Support managing day operation decisions & the enterprise operational activities Predictability Stable Mostly stable, some Unpredictable unpredictability Response time Sub-second Seconds to minutes Seconds to minutes Return set Small amount of Small to medium Small to large data amount of data amount of data
  • scd Customer Key Name State 1001 Christina IllinoisAdvantages:- This is the easiest way to handle the Slowly Changing Dimension problem, sincethere is no need to keep track of the old information.Disadvantages:- All history is lost. By applying this methodology, it is not possible to trace back inhistory. For example, in this case, the company would not be able to know thatChristina lived in Illinois before.Usage:About 50% of the time.When to use Type 1:Type 1 slowly changing dimension should be used when it is not necessary for thedata warehouse to keep track of historical changes
  • scd Customer Key Name State 1001 Christina Illinois 1005 Christina CaliforniaAdvantages:- This allows us to accurately keep all historical information.Disadvantages:- This will cause the size of the table to grow fast. In cases where the number of rowsfor the table is very high to start with, storage and performance can become aconcern.- This necessarily complicates the ETL process.Usage:About 50% of the time.When to use Type 2:Type 2 slowly changing dimension should be used when it is necessary for the datawarehouse to track historical changes.
  • scd C. Key Name O.State p.State Date 1001 Chrisy Illinois California 15-Jan-03Advantages:- This does not increase the size of the table, since new information is updated.- This allows us to keep some part of history.Disadvantages:- Type 3 will not be able to keep all history where an attribute is changed more thanonce. For example, if Christina later moves to Texas on December 15, 2003, theCalifornia information will be lost.Usage:Type 3 is rarely used in actual practice.When to use Type 3:Type III slowly changing dimension should only be used when it is necessary for the datawarehouse to track historical changes, and when such changes will only occur for a finitenumber of time.
  • CS 336 60
  • Rcd
  • What Is Metadata?• Data about data• Its is important to know what data is available and where does it lies for analysis• That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse• About the data being captured and loaded into the Warehouse• Documented in IT tools that improves both business and technical understanding of data and data-related processes
  • Consumers of Metadata Develop User er Impact Analysis What data, pre-built Queries Exists DBA Metadata Impact of changes in operational S/W Tool- system to data warehouse & data ETL, mart Modelling, OLAP Etc. Support Development of Data Warehouse ,Data Mart
  • Importance Of MetadataIntegrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making
  • Importance Of MetadataLocating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result?
  • What Is Metadata?Defining Metadata  Simplest definition – Data about data. Data base table Metadata SALE_ID TABLE_OWNER: MIS_OWNER CUST_ID CREATE DATE: 25 –OCT – 2002 21:54:00 ITEM LAST MODIFIED DATE 03-MAR-2003 09:30:00 LAST MODIFIED BY :MIS OWNER DATE PURPOSE: This table tracks customer sales QUANTITY LINKS TO RELATED REPORTS: TOTAL SALES, CUSTOMER PROFILES UNIT_PRICE LOCATION PROMOTION
  • Granularity in Fact Table• Granularity is a measure of the level of detail addressed by an individual entry in the fact table.• Business needs, rather than physical implementation considerations, must determine the minimum granularity of the fact table.• It is better to keep the data as granular as possible, even if current business needs do not require it—the additional detail might be critical for tomorrows business analysis.• Do not add summary records to the fact table that include detail facts already in the fact table.• Do not mix granularities in the fact table. If it’s needed, create one table for each level of granularity.
  • granularityGranularity is usually mentioned in the context of dimensionaldata structures (i.e., facts and dimensions) and refers to thelevel of detail in a given fact table. The more detail there is inthe fact table, the higher its granularity and vice versa.Another way to look at it is that the higher the granularity of afact table, the more rows it will have.
  • Terminology and definitions (contd..) Data MiningThe extraction of hidden predictive information from large databases.A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior.Ex: data mining software can help retail companies find customers with common interests.The term is commonly misused to describe software that presents data in new ways. True data mining software doesnt just change the presentation, but actually discovers previously unknown relationships among the data. 69