Upcoming SlideShare
×

# Vikas

237 views
187 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
237
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
0
0
Likes
1
Embeds 0
No embeds

No notes for slide
• we invent something only if there is a need for that thing….today we are going to see what data warehousing is…data warehouse is evolved to satisfy some needs….we will see some of these need now
• In a few short years data warehousing has gone from wild theory to conventional wisdom. Data warehousing is found around the world and is used throughout EVERY industry that has a need for information. Occasionally it is worthwhile to ask why data warehousing is as widespread a phenomenon as it is. This presentation discusses why organizations build data warehouses.
• Granularity is usually mentioned in the context of dimensional data structures (i.e., facts and dimensions) and refers to the level of detail in a given fact table. The more detail there is in the fact table, the higher its granularity and vice versa. Another way to look at it is that the higher the granularity of a fact table, the more rows it will have. Let me illustrate with the following example: Say we have a data mart with a single fact (Sales) and three dimensions (Time, Organization and Product). The fact table contains three metrics (Unit Price, Units Sold and Total Sale Amount). The Time dimension consists of four hierarchical elements (Year, Quarter, Month and Day). The Organization dimension consists of three hierarchical elements (Region, District and Store). The Product dimension consists of two hierarchical elements (Product Family and SKU). As always, the metrics in the Sales fact table must be stored at some intersection of the dimensions (i.e., Time, Organization and Product). Hence, in this data mart, the highest granularity that we can store Sales metrics is by Day/Store/SKU (i.e., the lowest level in each dimensional hierarchy). Conversely, the lowest granularity that we can aggregate Sales metrics to in this data mart is by Year/Region/Product Family (i.e., the highest level in each dimensional hierarchy). We may also (for a variety of performance reasons) choose to store Sales metrics at some intermediate level of granularity (e.g., by Month/District/SKU
• ### Vikas

1. 1. flashvortex.swf
2. 2. S TA OL LETO NT Successful Data Warehousing TECHNIQUES By: Vikas.K.Jain
3. 3.  Now, if the Estimates made before a Battle indicate Victory, it is because careful calculations show that your conditions are more favorable than those of your enemy; if they indicate defeat ,it is because careful calculations show that the favorable conditions for a Battle are fewer. With more careful calculations one can win ; with less one cannot. How much chance of Victory has one who makes no calculations at all !! --- Sun Tzu , The Art of War Business these days are ,war minus shooting. -Anonymous
4. 4. What is a Data Warehouse ? A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of managements decisions. - WH InmonWH Inmon - Regarded As Father Of Data Warehousing
5. 5. Necessity is the mother of invention Why Data Warehouse?
6. 6. Scenario 1ABC Pvt Ltd is a company with branchesat Mumbai, Delhi, Chennai and Bangalore.The Sales Manager wants quarterly salesreport. Each branch has a separateoperational system.
7. 7. Scenario 1 : ABC Pvt Ltd.Mumbai Delhi Sales per item type per branch Sales for first quarter. ManagerChennaiBanglore
8. 8. Solution 1:ABC Pvt Ltd.• Extract sales information from each database.• Store the information in a common repository at a single site.
9. 9. Solution 1:ABC Pvt Ltd.Mumbai Report Delhi Data Query & Sales Warehouse Analysis tools ManagerChennaiBanglore
10. 10. Scenario 2One Stop Shopping Super Market has hugeoperational database.Whenever Executives wantssome report the OLTP system becomesslow and data entry operators have to wait forsome time.
11. 11. Scenario 2 : One Stop ShoppingData Entry Operator Report Wait Operational Management DatabaseData Entry Operator
12. 12. Solution 2• Extract data needed for analysis from operational database.• Store it in warehouse.• Refresh warehouse at regular interval so that it contains up to date information for analysis.• Warehouse will contain data with historical perspective.
13. 13. Solution 2Data Entry Operator ReportTransaction Extract Data Operational Manager data Warehouse databaseData Entry Operator
14. 14. Scenario 3Cakes & Cookies is a small,newcompany.President of the company wants hiscompany should grow.He needs information sothat he can make correct decisions.
15. 15. Solution 3• Improve the quality of data before loading it into the warehouse.• Perform data cleaning and transformation before loading the data.• Use query analysis tools to support adhoc queries.
16. 16. Solution 3 Expansi on sales Data Query and Analysis PresidentWarehouse tool time Improveme nt
17. 17. over time, as applications grew, there grew frustration “I know the information there. I just can’t put my hands around it”users needing information – - sales - finance - marketing - engineering - human resources frustration!!
18. 18. Need for Data Warehousing• Better business intelligence for end-users• Reduction in time to locate, access, and analyze information• Consolidation of disparate information sources• Strategic advantage over competitors• Faster time-to-market for products and services• Replacement of older, less-responsive decision support systems• Reduction in demand on IS to generate reports
19. 19. Business QueriesTypical Business Queries  Which product generated maximum revenue over last two quarters in a chosen geographical region, city wise, relative to the previous version of product, compared with the plan  What percent of customer procures product A with B in a chosen region, broken down by city, season, and income group
20. 20. 1960 - 1985 : MIS Era Evolution of Data Warehousing • Unfriendly • Slow • Dependent on IS programmers • Inflexible • Analysis limited to defined reports Focus on Reporting
21. 21. Evolution of Data Warehousing1985 - 1990 : Querying Era Queries that are formulated by the user on the spur of the moment • Adhoc, unstructured access to corporate data • SQL as interface not scalable • Cannot handle complex analysis Focus on Online Querying
22. 22. Evolution of Data Warehousing1990 - 20xx : Analysis Era • Trend Analysis • What If ? • Cross Dimensional Comparisons • Statistical profiles • Automated pattern and rule discovery Focus on Online Analysis
23. 23. Warehouse Architecture - 1 EIS /DSS Metadata Select Query Tools Extract Transform Data Integrate Warehouse OLAP/ROLAP Maintain Web BrowsersOperationalSystems/Data Middleware/ API Data Mining Data Preparation Enterprise Data Warehouse
24. 24. Warehouse Architecture - 2 Metadata EIS /DSS Data Mart Metadata Select Query Tools Extract Transform Data Mart Integrate OLAP/ROLAP Maintain Metadata Web BrowsersOperational Data MartSystems/Data Middleware/ Data API Data Mining Preparation Single Department Data Mart
25. 25. Warehouse Architecture - 3 Data Marts EIS /DSS Metadata Select Query Tools Extract Data Transform Warehouse Integrate OLAP/ROLAP Maintain Web BrowsersOperationalSystems/Data Middleware/ Operational API Data Mining Data Data Store Preparation Multi-tiered Data Warehouse
26. 26. Benefits of DWHThese capabilities empower the corporate... To formulate effective business, marketing and sales strategies. To precisely target promotional activity. To discover and penetrate new markets. To successfully compete in the marketplace from a position of informed strength. To build predictive rather than retrospective models.
27. 27. OLTP Systems Vs Data Warehouse Between OLTP and Data Warehouse systems users are different data content is different, data structures are different hardware is different Understanding The Differences Is The Key
28. 28. OLTP vs. OLAP OLTP OLAPUser  Clerk, IT Professional  Knowledge workerFunction  Day to day operations  Decision supportDB Design  Application-oriented (E-  Subject-oriented (Star, R based) snowflake)Data  Current, Isolated  Historical, ConsolidatedView  Detailed, Flat relational  Summarized,Usage  Structured, Repetitive MultidimensionalUnit of work  Short, Simple  Ad hocAccess transaction  Complex queryOperations  Read/write  Read Mostly# Records  Index/hash on prim. Key  Lots of Scansaccessed  Tens  Millions#Users  Thousands  HundredsDb size  100 MB-GB  100 GB-TBMetric  Trans. throughput  Query throughput, response
29. 29. Processing Power Capacity Planning Time of day Processing Load Peaks During the Beginning and End of Day
30. 30. Examples Of Some ApplicationsManufacturersManufacturers Retailers Retailers • Target Marketing • Market Segmentation • Budgeting • Credit Rating Agencies • Financial Reporting and Consolidation  Market Basket Analysis -  Fraud Management Customers Customers  Profitability Management  Event tracking
31. 31. Data Marts Subject or Application Oriented Business View of Warehouse » Finance, Manufacturing, Sales etc. » Smaller amount of data used for Analytic Processing » Address a single business processA Logical Subset of The Complete Data Warehouse
32. 32. Different kinds of Information Needs •• Current Current Is this medicine available in stock What are the tests this •• Recent Recent patient has completed so far Has the incidence of Tuberculosis increased in last 5 years in •• Historical Historical Southern region
33. 33. Warehouse Models• Modeling data warehouses: dimensions, measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 33
34. 34. Snowflake is referred as normalized star schema. The dimensions are normalized to avoid data redundancy. Year Year Product Month Product Key Month Product ID Year Day Product Desc Day Time Key Category Month Product key Store Store key Customer Store City Customer key Customer Key Unit sales City Customer ID City Gross sales NameState State CityStateCountry 34
35. 35. Star Schema – A single fact table may be connected to multiple dimension table.Each dimension is represented by one table – It is an un- normalized form ProductTime Product KeyTime Key Product IDDate Product DescDay Time Key CategoryMonthYear Product key Store key CustomerStoreStore Key Customer key Customer KeyStore ID Unit sales Customer IDCity Gross sales NameCountry CityRegion Country 35Year
36. 36. Starproduct prodId name price store storeId city p1 bolt 10 c1 nyc p2 nut 5 c2 sfo c3 la sale oderId date custId prodId storeId qty amt o100 1/7/97 53 p1 c1 1 12 o102 2/7/97 53 p2 c1 2 11 105 3/8/97 111 p1 c3 5 50 Measures customer custId name address city 53 joe 10 main sfo 81 fred 12 main sfo 111 sally 80 willow la36
37. 37. timetime_key itemday item_keyday_of_the_week Sales Fact Table item_namemonth brandquarter time_key typeyear supplier_type item_key branch_key branch location location_key branch_key location_key branch_name units_sold street branch_type city dollars_sold state_or_province country avg_sales Measures 37
38. 38. Dimension Hierarchies sType store city region sType tId size location t1 small downtownstore storeId cityId tId mgr t2 large suburbs s5 sfo t1 joe s7 sfo t2 fred city cityId pop regId s9 la t1 nancy sfo 1M north la 5M south  snowflake schema  constellations region regId name north cold region south warm region38
39. 39. Example of Fact Constellationtimetime_key item Shipping Fact Tableday item_keyday_of_the_week Sales Fact Table item_name time_keymonth brandquarter time_key type item_keyyear supplier_type shipper_key item_key from_location branch_keybranch location_key location to_locationbranch_key location_key dollars_costbranch_name units_sold streetbranch_type dollars_sold city units_shipped province_or_state avg_sales country shipper Measures shipper_key shipper_name 39 location_key shipper_type
40. 40. Cube Fact table view: Multi-dimensional cube: sale prodId storeId amt p1 c1 12 c1 c2 c3 p2 c1 11 p1 12 50 p1 c3 50 p2 11 8 p2 c2 8 dimensions = 240
41. 41. 3-D Cube Fact table view: Multi-dimensional cube:sale prodId storeId date amt p1 c1 1 12 p2 c1 1 11 c1 c2 c3 day 2 p1 c3 1 50 p1 44 4 p2 c2 1 8 p2 c1 c2 c3 p1 c1 2 44 day 1 p1 12 50 p1 c2 2 4 p2 11 8 dimensions = 3 41
42. 42. Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 sale prodId storeId date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 81 p2 c2 1 8 p1 c1 2 44 p1 c2 2 442
43. 43. Cube Aggregation Example: computing sums c1 c2 c3 day 2 ... p1 44 4 p2 c1 c2 c3day 1 p1 12 50 p2 11 8 c1 c2 c3 sum 67 12 50 c1 c2 c3 p1 56 4 50 p2 11 8 129 sum rollup p1 110 p2 19 drill-down 43
44. 44. Cube Operators c1 c2 c3 day 2 ... p1 44 4 p2 c1 c2 c3day 1 p1 12 50 p2 11 8 sale(c1,*,*) c1 c2 c3 sum 67 12 50 c1 c2 c3 p1 56 4 50 p2 11 8 129 sum sale(c2,p2,*) p1 110 p2 19 sale(*,*,*) 44
45. 45. Extended Cube * c1 c2 c3 * p1 56 4 50 110 p2 11 8 19 c1* 67 c2 12 c3 * 50 129 day 2 p1 44 4 48 p2 c1 c2 c3 * day 1 * 44 4 48 sale(*,p2,*) p1 12 50 62 p2 11 8 19 * 23 8 50 8145
46. 46. What Happens without Normalization• A non-normalized database can suffer from data anomalies:• A non-normalized database may store data representing a particular referent in multiple locations. An update to such data in some but not all of those locations results in an update anomaly, yielding inconsistent data. A normalized database prevents such an anomaly by storing such data (i.e. data other than primary keys) in only one location.• A non-normalized database may have inappropriate dependencies, i.e. relationships between data with no functional dependencies. Adding data to such a database may require first adding the unrelated dependency. A normalized database prevents such insertion anomalies by ensuring that database relations mirror functional dependencies.• Similarly, such dependencies in non-normalized databases can hinder deletion. That is, deleting data from such databases may require deleting data from the inappropriate dependency. A normalized database prevents such deletion anomalies by ensuring that all records are uniquely identifiable and contain no extraneous information.
47. 47. Warehouse or Mart First ? Data Warehouse First Data Mart firstExpensive Relatively cheapLarge development cycle Delivered in < 6 monthsChange management is Easy to manage changedifficultDifficult to obtain continuous Can lead to independent andcorporate support incompatible martsTechnical challenges in Cleansing, transformation,building large databases modeling techniques may be incompatible
48. 48. Can I seecredit report from Operational Data Store Accounts, Data from multiple Sales from sources is marketing integrated for a and open subjectorder report from orderentry for this A subject oriented, integrated, customer volatile, current valued data store containing only corporate detailed data Identical queries Data stored only for may give different current period. Old results at different Data is either times. Supports archived or moved analysis requiring to Data Warehouse current data
49. 49. OLTP Vs ODS Vs DWHCharacteristic OLTP ODS Data WarehouseAudience Operating Analysts Managers and Personnel analystsData access Individual records, Individual records, Set of records, transaction driven transaction or analysis driven analysis drivenData content Current, real-time Current and near- Historical currentData granularity Detailed Detailed and lightly Summarized and summarized derivedData organization Functional Subject-oriented Subject-orientedData quality All application All integrated data Data relevant to specific detailed needed to support a management data needed to business activity information needs support a business activity
50. 50. OLTP Vs ODS Vs DWHCharacteristic OLTP ODS Data WarehouseData redundancy Non-redundant Somewhat Managed within system; redundant with redundancy Unmanaged operational redundancy among databases systemsData stability Dynamic Somewhat dynamic StaticData update Field by field Field by field Controlled batchData usage Highly structured, Somewhat Highly repetitive structured, some unstructured, analytical heuristic or analyticalDatabase size Moderate Moderate Large to very largeDatabase Stable Somewhat stable Dynamicstructure stability
51. 51. OLTP Vs ODS Vs DWHCharacteristic OLTP ODS Data WarehouseDevelopment Requirements Data driven, Data driven,methodology driven, structured somewhat evolutionary evolutionaryOperational Performance and Availability Access flexibilitypriorities availability and end user autonomyPhilosophy Support day-to- Support day-to-day Support managing day operation decisions & the enterprise operational activitiesPredictability Stable Mostly stable, some Unpredictable unpredictabilityResponse time Sub-second Seconds to minutes Seconds to minutesReturn set Small amount of Small to medium Small to large data amount of data amount of data
52. 52. SCD-1 Customer Key Name State 1001 Christina IllinoisAdvantages:- This is the easiest way to handle the Slowly Changing Dimension problem, sincethere is no need to keep track of the old information.Disadvantages:- All history is lost. By applying this methodology, it is not possible to trace back inhistory. For example, in this case, the company would not be able to know thatChristina lived in Illinois before.Usage:About 50% of the time.When to use Type 1:Type 1 slowly changing dimension should be used when it is not necessary for thedata warehouse to keep track of historical changes
53. 53. SCD-2 Customer Key Name State 1001 Christina Illinois 1005 Christina CaliforniaAdvantages:- This allows us to accurately keep all historical information.Disadvantages:- This will cause the size of the table to grow fast. In cases where the number of rowsfor the table is very high to start with, storage and performance can become aconcern.- This necessarily complicates the ETL process.Usage:About 50% of the time.When to use Type 2:Type 2 slowly changing dimension should be used when it is necessary for the datawarehouse to track historical changes.
54. 54. SCD-3 C. Key Name O.State p.State Date 1001 Chrisy Illinois California 15-Jan-03Advantages:- This does not increase the size of the table, since new information is updated.- This allows us to keep some part of history.Disadvantages:- Type 3 will not be able to keep all history where an attribute is changed more thanonce. For example, if Christina later moves to Texas on December 15, 2003, theCalifornia information will be lost.Usage:Type 3 is rarely used in actual practice.When to use Type 3:Type III slowly changing dimension should only be used when it is necessary for the datawarehouse to track historical changes, and when such changes will only occur for a finitenumber of time.
55. 55. What Is Metadata?• Data about data• Its is important to know what data is available and where does it lies for analysis• That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse• About the data being captured and loaded into the Warehouse• Documented in IT tools that improves both business and technical understanding of data and data-related processes
56. 56. Consumers of Metadata Develop User er Impact Analysis What data, pre-built Queries Exists DBA MetadataImpact of changesin operational S/W Tool-system to datawarehouse & data ETL,mart Modelling, OLAP Etc. Support Development of Data Warehouse ,Data Mart
57. 57. Importance Of MetadataIntegrating information How various data perspectives connect together? How much time is spent trying to figure out that? How much does the inefficiency and lack of metadata affect decision making
58. 58. Importance Of MetadataLocating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information? How much money was lost or earned as a result?
59. 59. What Is Metadata?Defining Metadata  Simplest definition – Data about data. Data base table Metadata SALE_ID TABLE_OWNER: MIS_OWNER CUST_ID CREATE DATE: 25 –OCT – 2002 21:54:00 ITEM LAST MODIFIED DATE 03-MAR-2003 09:30:00 LAST MODIFIED BY :MIS OWNER DATE PURPOSE: This table tracks customer sales QUANTITY LINKS TO RELATED REPORTS: TOTAL SALES, CUSTOMER PROFILES UNIT_PRICE LOCATION PROMOTION
60. 60. Granularity in Fact Table • Granularity is a measure of the level of detail addressed by an individual entry in the fact table. • Business needs, rather than physical implementation considerations, must determine the minimum granularity of the fact table. • It is better to keep the data as granular as possible, even if current business needs do not require it—the additional detail might be critical for tomorrows business analysis. • Do not add summary records to the fact table that include detail facts already in the fact table. • Do not mix granularities in the fact table. If it’s needed, create one table for each level of granularity.
61. 61. GranularityGranularity is usually mentioned in the context of dimensionaldata structures (i.e., facts and dimensions) and refers to thelevel of detail in a given fact table. The more detail there is inthe fact table, the higher its granularity and vice versa.Another way to look at it is that the higher the granularity of afact table, the more rows it will have.
62. 62. Data MiningThe extraction of hidden predictive information from large databases.A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior.Ex: data mining software can help retail companies find customers with common interests.The term is commonly misused to describe software that presents data in new ways. True data mining software doesnt just change the presentation, but actually discovers previously unknown relationships among the data. 63