MELJUN CORTES Fundamentals of Enterprise Data Management Week 03
Upcoming SlideShare
Loading in...5
×
 

MELJUN CORTES Fundamentals of Enterprise Data Management Week 03

on

  • 324 views

MELJUN CORTES Fundamentals of Enterprise Data Management Week 03

MELJUN CORTES Fundamentals of Enterprise Data Management Week 03

Statistics

Views

Total Views
324
Slideshare-icon Views on SlideShare
324
Embed Views
0

Actions

Likes
1
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MELJUN CORTES Fundamentals of Enterprise Data Management Week 03 MELJUN CORTES Fundamentals of Enterprise Data Management Week 03 Presentation Transcript

    • © 2013 IBM CorporationIBM ConfidentialBAFEDM2: Fundamentals of Enterprise DataManagementWeek 03
    • © 2013 IBM CorporationIBM ConfidentialAgenda2Module 2: Data Warehouse Design Considerations• Data Models• The Dimensional Model• Facts and Dimensions• Four-Step Dimensional Design Process• Case Study: Retail
    • © 2013 IBM CorporationIBM ConfidentialModule 2: Data Warehouse DesignConsiderationsBAFEDM2: Fundamentals of Enterprise Data Management3
    • © 2013 IBM CorporationIBM ConfidentialData Models: DefinitionsData Model• Is the specification of data structures and business rules to representbusiness requirementsData Modeling• Is a structured approach used to identify major components of aninformation systems specifications• Is the process used to analyze the data, identify the relationships, and,ultimately, create the data model4STUDENTstudent last namestudent first namestudent majorCOURSEcourse titlecourse number of creditsattendsis taught to
    • © 2013 IBM CorporationIBM ConfidentialBusiness UsersDesignersImplementersData Models: Data Model Views5EnterpriseConceptual Data ModelTransactionLogical Data ModelAnalyticalLogical Data ModelOLTPPhysicalData ModelODSPhysicalData ModelData WarehousePhysical DataModelData MartPhysicalData Model
    • © 2013 IBM CorporationIBM ConfidentialData Models: Data Model ViewsConceptual Data Model• A Conceptual Data Model (CDM) is a structured business view of thedata required to support current business processes, business events,and related performance measures• It is a single integrated data structure which reflects the structure ofbusiness functions rather than the processing flow or the physicalarrangement of data• Characteristics– Represents overall logical structure of data– Independent of software or data storage structure– Often contains objects not implemented in physical databases– Represents data needed to run an enterprise or a business activity6
    • © 2013 IBM CorporationIBM ConfidentialData Models: Data Model ViewsLogical Data Model• A Logical Data Model (LDM) builds upon the business requirements andincludes a further level of detail that supports both the business andsystem requirements• Business rules are incorporated into the LDM and it loses some of the“generalities” from the Enterprise CDM• Characteristics– Independent of specific software and data storage structure– Includes more specific entities and attributes– Includes business rules and relationships– Includes foreign keys, alternate keys7
    • © 2013 IBM CorporationIBM ConfidentialData Models: Data Model ViewsPhysical Data Model• A Physical Data Model (PDM) is specific to the software andperformance constraints of the specific database management systemto be used in the implementation• Both software and data storage structures are considered and themodel is often modified to meet performance or physical constraints• Characteristics– Dependent on specific software and data storage structure– Includes tables and columns– Includes physical database objects (triggers, stored procedures, tablespaces)– Includes referential integrity rules that restrict relationships between tables8
    • © 2013 IBM CorporationIBM ConfidentialThe Dimensional ModelDimensional modeling is a logical design technique for structuring dataso that its intuitive to business users and delivers fast queryperformance.Dimensional modeling is widely accepted as the preferred approach fordata warehouse presentation.Normalized modeling is quite different from dimensional modeling.Normalized modeling is a design technique that seeks to eliminate dataredundancies. Data is divided into many discrete entities, each of whichbecomes a table in the relational database. This normalization isimmensely beneficial for transaction processing because it makestransaction loading and updating simple and fast.The purpose, however, of dimensional modeling is to improveperformance by matching data structures to queries; hence, structure isflattened and some level of redundancies and aggregations are allowed.9
    • © 2013 IBM CorporationIBM ConfidentialFacts and DimensionsDimensional modeling divides the world into measurements andcontext. Measurements are captured by the organizationsbusiness processes and their supporting operational sourcesystems. Measurements are usually numeric values; we refer tothem as facts.Facts are surrounded by largely textual context that is true at themoment the fact is recorded. This context is intuitively divided intoindependent logical clumps called dimensions. Dimensionsdescribe the “who, what, when, where, why, and how” context ofthe measurement.10
    • © 2013 IBM CorporationIBM ConfidentialFacts and Dimensions (continued)The dimensions describe the facts. If the fact is the number ofproduct sales, then the dimensions support it may be time,location, product, customer, and promotions.This characteristic structure is called a star schema.11TIMETIMELOCATIONLOCATIONPRODUCTPRODUCTCUSTOMERCUSTOMERPROMOTIONSPROMOTIONSSALESSALESTime describes when the sale was made.Location describes where the sale was made.Product describes what item was sold.Customer describes who bought the item.Promotions describe what triggered the sale.
    • © 2013 IBM CorporationIBM ConfidentialFacts and Dimensions: More on Facts The fact tables grain (granularity) isthe business definition of themeasurement event that producesthe fact row. Declaring the grainmeans saying exactly what a facttable row represents by filling in theblank in the following phrase: “A factrow is created when___occurs.” Facts serve as the Key PerformanceIndicators (KPI) of the organization. The most useful facts are bothnumeric and additive. Facts consist of multi-part foreignkeys to the dimension tables andevery relationship is mandatory, i.e.,foreign keys should never be null.12SALES FACTTIME_KEY (FK)LOCATION_KEY (FK)PRODUCT_KEY (FK)CUSTOMER_KEY (FK)PROMOTION_KEY (FK)SALES_QUANTITYSALES_AMOUNTSALES_PROFITA fact row is created whena sale occurs.
    • © 2013 IBM CorporationIBM ConfidentialFacts and Dimensions: More on Dimensions Dimensions provide descriptiveinformation about the fact.Dimensions are identified by filling inthe blanks in the following phrase: “Ineed a report on (fact) by ___, by___, by ___.” Dimensions are composed ofattributes which are used for filteringor labeling data within datawarehouse queries. Dimension tables representhierarchical relationships–a naturalconsequence of denormalization. Dimension rows are uniquelyidentified by a single key field.13PRODUCT DIMENSIONPRODUCT_ID (PK)PRODUCT_NAMEPRODUCT_DESCRIPTIONPRODUCT_COSTPRODUCT_CATEGORYCATEGORY_DESCRIPTIONPRODUCT_BRANDBRAND_DESCRIPTIONI need a report on sales bymonth, by product.
    • © 2013 IBM CorporationIBM ConfidentialFour-Step Dimensional Design Process Step 1: Choose the Business Process• The first step in the design process is to determine the business process ormeasurement event to be modeled. This task can be carried out by understandingthe business objectives and goals, as well as the information requirements. Thisselection step likely occurred during the prioritization activity with senior businessleadership. Step 2: Declare the Grain• Once the business process has been identified, the design team must declare thegrain of the fact table. It is crucial to crisply define exactly what a fact table row isin the proposed business process dimensional model. Without agreement on thegrain of the fact table, the design process cannot successfully move forward.• Be very precise when defining the fact table grain in business terms. Do not skipthis step. The grain is the business answer to the question “What is a fact row,exactly?”• Generally, the fact table grain is chosen to be as atomic or finely grained aspossible. Fact tables designed with the most granular data produce the mostrobust design. Atomic data is far better at responding to both unexpected newqueries and unanticipated new data elements than higher levels of granularity.14
    • © 2013 IBM CorporationIBM ConfidentialFour-Step Dimensional Design Process (continued)Step 3: Identify the Dimensions• Once the grain of the fact table is firmly established, the choice ofdimensions is fairly straightforward. It is at this point you can startthinking about foreign keys. The grain itself will often determine aprimary or minimal set of dimensions. From there, the design isembellished with additional dimensions that take on a unique value atthe declared grain of the fact table.Step 4: Identify the Facts• The final step in the four-step design process is to carefully select thefacts or metrics that are applicable to the business process. The factsmay be physically captured by the measurement event or derived fromthese measurements. Each fact must be true to the grain of the facttable; do not mix facts from other time periods or other levels of detailthat do not match the crisply declared grain.15
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail16• Imagine that we work in the headquarters of a large grocery chain. Our business has 100 grocery stores spread over a five-state area. Each of the stores has a full complement of departments, including grocery, frozen foods, dairy, meat, produce,bakery, floral, and health/beauty aids. Each store has roughly 60,000 individual products on its shelves. The individualproducts are called stock keeping units (SKUs). About 55,000 of the SKUs come from outside manufacturers and have barcodes imprinted on the product package. These bar codes are called universal product codes (UPCs). UPCs are at the samegrain as individual SKUs. Each different package variation of a product has a separate UPC and hence is a separate SKU.• The remaining 5,000 SKUs come from departments such as meat, produce, bakery, or floral. While these products donthave nationally recognized UPCs, the grocery chain assigns SKU numbers to them. Since our grocery chain is highlyautomated, we stick scanner labels on many of the items in these other departments. Although the bar codes are notUPCs, they are certainly SKU numbers.• Data is collected at several interesting places in a grocery store. Some of the most useful data is collected at the cashregisters as customers purchase products. Our modern grocery store scans the bar codes directly into the point-of-sale(POS) system. The POS system is at the front door of the grocery store where consumer takeaway is measured. The backdoor, where vendors make deliveries, is another interesting data-collection point.• At the grocery store, management is concerned with the logistics of ordering, stocking, and selling products whilemaximizing profit. The profit ultimately comes from charging as much as possible for each product, lowering costs forproduct acquisition and overhead, and at the same time attracting as many customers as possible in a highly competitivepricing environment. Some of the most significant management decisions have to do with pricing and promotions. Bothstore management and headquarters marketing spend a great deal of time tinkering with pricing and promotions.Promotions in a grocery store include temporary price reductions, ads in newspapers and newspaper inserts, displays inthe grocery store (including end-aisle displays), and coupons. The most direct and effective way to create a surge in thevolume of product sold is to lower the price dramatically. A 50-cent reduction in the price of paper towels, especially whencoupled with an ad and display, can cause the sale of the paper towels to jump by a factor of 10. Unfortunately, such a bigprice reduction usually is not sustainable because the towels probably are being sold at a loss. As a result of these issues,the visibility of all forms of promotion is an important part of analyzing the operations of a grocery store.
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Step 1: Choose the Business Process• The first step in the design is to decide what business process(es) tomodel by combining an understanding of the business requirementswith an understanding of the available data.• Tip: The first dimensional model built should be the one with the mostimpact—it should answer the most pressing business questions and bereadily accessible for data extraction.17• In our retail case study, management wants to better understand customer purchases as capturedby the POS system. Thus the business process were going to model is POS retail sales. This data willallow us to analyze what products are selling in which stores on what days under what promotionalconditions.
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Step 2: Declare the Grain• Once the business process has been identified, the data warehouse team facesa serious decision about the granularity. What level of data detail should bemade available in the dimensional model?• Tip: Preferably you should develop dimensional models for the most atomicinformation captured by a business process. Atomic data is the most detailedinformation collected; such data cannot be subdivided further.18• In our case study, the most granular data is an individual line item on a POS transaction. To ensure maximumdimensionality and flexibility, we will proceed with this grain.• Providing access to the POS transaction information gives us with a very detailed look at store sales. While usersprobably are not interested in analyzing single items associated with a specific POS transaction, we cant predictall the ways that theyll want to cull through that data. For example, they may want to understand the differencein sales on Monday versus Sunday. Or they may want to assess whether its worthwhile to stock so manyindividual sizes of certain brands, such as cereal. Or they may want to understand how many shoppers tookadvantage of the 50-cents-off promotion on paper towels. Or they may want to determine the impact in terms ofdecreased sales when a competitive diet soda product was promoted heavily. While none of these queries callsfor data from one specific transaction, they are broad questions that require detailed data sliced in very preciseways. None of them could have been answered if we elected only to provide access to summarized data.
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued) Step 3: Identify the Dimensions• Once the grain of the fact table is firmly established, the grain itself will oftendetermine a primary or minimal set of dimensions. From there, we can askwhether other dimensions can be attributed to the data.• Tip: A careful grain statement determines the primary dimensionality of the facttable. It is then often possible to add more dimensions to the basic grain of the facttable where these additional dimensions naturally take on only one value undereach combination of the primary dimensions. If the additional dimension violatesthe grain by causing additional fact rows to be generated, then the grain statementmust be revised to accommodate this dimension.19• Once the grain of the fact table has been chosen, the date, product, and store dimensions fall out immediately.Within the framework of the primary dimensions, we can ask whether other dimensions can be attributed to thedata, such as the promotion under which the product is sold. In addition, well include the POS transaction ticketnumber as a special dimension. (To be discussed later in this case study.)• We begin to envision the preliminary schema as follows:
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Step 4: Identify the Facts• The fourth and final step in the design is to make a careful determination ofwhich facts will appear in the fact table. Again, the grain declaration helpsanchor our thinking. Simply put, the facts must be true to the grain.• When considering potential facts, you again may discover that adjustmentsneed to be made to either our earlier grain assumptions or our choice ofdimensions.20• The facts collected by the POS system include the sales quantity (e.g., the number of cans of chicken noodle soup), per unit salesprice, and the sales dollar amount. The sales dollar amount equals the sales quantity multiplied by the unit price. Some POS systemsalso provide a standard dollar cost for the product as delivered to the store by the vendor; well include it in the fact table. Three ofthe facts–sales quantity, sales dollar amount, and cost dollar amount–are additive across all the dimensions.• We can compute the gross profit by subtracting the cost dollar amount from the sales dollar amount, or revenue. Should calculatedfacts be stored physically in the database? Yes, Storing it also ensures that all users and their reporting applications refer to grossprofit consistently.
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Date Dimension• The date dimension is the onedimension nearly guaranteed to bein every data warehouse becausevirtually every data warehouse is atime series.• Remember that the dimension tableattributes serve as report labels.Simply populating indicators such asthe holiday indicator with a “Y” oran “N” would be far less useful.• Why do we need an explicit datedimension table? This is inanticipation of a need to slice databy these nonstandard dateattributes.21
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued) Product Dimension• The product dimension describesevery SKU in the grocery store.• The product dimension is almostalways sourced from the operationalproduct master file.• For each SKU, all levels of themerchandise hierarchy are welldefined. Some attributes, such as theSKU description, are unique.• Some of the attributes in the productdimension table are not part of themerchandise hierarchy. Package Typeattribute, for example, might havevalues such as Bottle, Bag, Box, orOther. It makes perfect sense tocombine a constraint on this attributewith a constraint on a merchandisehierarchy attribute. For example, wecould look at all the SKUs in the Cerealcategory packaged in Bags.22
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Store Dimension• The store dimension describes everystore in our grocery chain.• The store dimension is the primarygeographic dimension in our casestudy. Each store can be thought ofas a location. Because of this, wecan roll stores up to any geographicattribute, such as ZIP code, county,and state in the United States.Stores usually also roll up to storedistricts and regions. These twodifferent hierarchies are both easilyrepresented in the store dimensionbecause both the geographic andstore regional hierarchies are welldefined for a single store row.23
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued) Promotion Dimension• The promotion dimension describes thepromotion conditions under which aproduct was sold. Promotion conditionsinclude temporary price reductions,end-aisle displays, newspaper ads, andcoupons. This dimension is often calleda causal dimension (as opposed to acasual dimension) because it describesfactors thought to cause a change inproduct sales.• The various possible causal conditionsare highly correlated. A temporary pricereduction usually is associated with anad and perhaps an end-aisle display.Coupons often are associated with ads.For this reason, it makes sense to createone row in the promotion dimension foreach combination of promotionconditions that occurs24
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Promotion Dimension (continued)• There is one important question that cannot be answered by our retailsales schema: What products were on promotion but did not sell? Thesales fact table only records the SKUs actually sold.• A second promotion coverage or event fact table is needed to help answerthe question concerning what didnt happen. The promotion coverage facttable keys would be date, product, store, and promotion in our case study.This obviously looks similar to the sales fact table we just designed;however, the grain would be significantly different. In the case of thepromotion coverage fact table, wed load one row in the fact table for eachproduct on promotion in a store each day (or week, since many retailpromotions are a week in duration) regardless of whether the product soldor not. The coverage fact table allows us to see the relationship betweenthe keys as defined by a promotion, independent of other events, such asactual product sales. We refer to it as a factless fact table because it has nomeasurement metrics; it merely captures the relationship between theinvolved keys.25
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Degenerate Transaction Number Dimension• The retail sales fact table contains the POS Transaction Number on everyline item row. In a traditional parent-child database, the POS transactionnumber would be the key to the transaction header record, containing allthe information valid for the transaction as a whole, such as thetransaction date and store identifier. However, in our dimensional model,we have already extracted this interesting header information into otherdimensions. The POS transaction number is still useful because it serves asthe grouping key for pulling together all the products purchased in a singletransaction.• Although the POS transaction number looks like a dimension key in the facttable, we have stripped off all the descriptive items that might otherwisefall in a POS transaction dimension. Since the resulting dimension is empty,we refer to the POS transaction number as a degenerate dimension (DD).Order numbers, invoice numbers, and bill-of-lading numbers almost alwaysappear as degenerate dimensions in a dimensional model.• Degenerate dimensions often play an integral role in the fact tablesprimary key. In our case study, the primary key of the retail sales fact tableconsists of the degenerate POS transaction number and product key.26
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)Resisting Comfort Zone Urges• Dimension Normalization (Snowflaking)– The flattened, denormalized dimension tables with repeating textual values, e.g.,product dimension, may make a normalization modeler uncomfortable. Ratherthan redundantly storing the department description (and all descriptors) in theproduct dimension table, they may want to store a department code and thencreate a new department dimension for the department decodes. This is referredto as snowflaking.– While snowflaking is a legal extension of the dimensional model, in general, theurge to snowflake should be resisted given our two primary design drivers: easeof use and performance.27
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued) Resisting Comfort Zone Urges (continued)• Too Many Dimensions– The fact table in a dimensional schema is naturally highly normalized and compact.– Interestingly, while uncomfortable with denormalized dimension tables, some modelersare tempted to denormalize the fact table. Rather than having a single product foreign keyon the fact table, they include foreign keys for the frequently analyzed elements on theproduct hierarchy (e.g., brand, category, department) or date dimension (e.g., week,month, year). The compact fact table become joins to literally dozens of dimension tables.–28– Tip: A very large number ofdimensions typically is a sign thatseveral dimensions are notcompletely independent and shouldbe combined into a singledimension. It is a dimensionalmodeling mistake to representelements of a hierarchy as separatedimensions in the fact table.
    • © 2013 IBM CorporationIBM ConfidentialCase Study: Retail (continued)29
    • © 2013 IBM CorporationIBM ConfidentialFor the Next SessionBAFEDM2: Fundamentals of Enterprise Data Management30
    • © 2013 IBM CorporationIBM ConfidentialFor the Next SessionsAgenda• Module 2: Data Warehouse Design Considerations (continued)– Case Study: Education– Case Study: Communications– Dimensional Modeling Best PracticesGroup Project• Project Identification and Sign-Off (Data Model)31