View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube allows data to be modeled and viewed in multiple dimensions
Dimension tables , such as item (item_name, brand, type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold ) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid . The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid . The lattice of cuboids forms a data cube.
A Sample Data Cube Total annual sales of TV in U.S.A. Time Item Location All, All, All sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum
4-D Data Cube Supplier 1 Supplier 2 Supplier 3
Cube: A Lattice of Cuboids all time item location supplier time,item time,location time,supplier item,location item,supplier location,supplier time,item,location time,item,supplier time,location,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
Star schema : A fact table in the middle connected to a set of dimension tables
Snowflake schema : A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables , forming a shape similar to snowflake
Fact constellations : Multiple fact tables share dimension tables , viewed as a collection of stars, therefore called galaxy schema or fact constellation
Example of Star Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch time_key day day_of_the_week month quarter year time location_key street city province_or_street country location
Example of Snowflake Schema location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item supplier_key supplier_type supplier time_key day day_of_the_week month quarter year time branch_key branch_name branch_type branch city_key city province country city
Example of Fact Constellation Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper time_key day day_of_the_week month quarter year time location_key street city province_or_street country location item_key item_name brand type supplier_type item branch_key branch_name branch_type branch
A Data Mining Query Language, DMQL: Language Primitives
by climbing up concept hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Slice and dice:
project and select
reorient the cube, visualization, 3D to series of 2D planes.
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
Querying Using a Star-Net Model Shipping Method AIR-EXPRESS TRUCK ORDER Customer Orders CONTRACTS Customer Product PRODUCT GROUP PRODUCT LINE PRODUCT ITEM SALES PERSON DISTRICT DIVISION Organization Promotion CITY COUNTRY REGION Location DAILY QTRLY ANNUALY Time Each circle is called a footprint
Chapter 2: Data Warehousing and OLAP Technology for Data Mining
Top-down, bottom-up approaches or a combination of both
Top-down : Starts with overall design and planning (mature)
Bottom-up : Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfal l: structured and systematic analysis at each step before proceeding to the next
Spiral : rapid generation of increasingly functional systems, quick modifications, timely adaptation of new designs and technologies
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain ( atomic level of data ) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
Multi-Tiered Architecture Data Warehouse OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Data Storage OLAP Server Extract Transform Load Refresh Operational DBs other sources
a 0 b 0 is not computed (we will need to scan 1-49)
We need to keep 4 a-c chunks in memory We need to keep 16 a-b chunks in memory We need to keep a single b-c chunk in memory A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 B
Can we provide the user with some information before the exact average is computed for all states?
Solution: show the current “running average” for each state as the computation proceeds.
Even better, if we use statistical techniques and sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average for Wisconsin is 2000±102 with 95% probability.