More Related Content
Similar to Data Warehouse Basics (20)
Data Warehouse Basics
- 2. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Data Warehouse Basics
•Data Usage Challenges
•OLAP vs. OLTP
•Understanding Normalization
•OLAP
•Star Schema Basics
•Snowflake Schema Basics
•Understanding Granularity
•Auditing
- 3. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Data Usage Challenges
•Databases are usually divided into two separate types –OLTP / OLAP
- 4. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
OLAP vs. OLTP
OLTP SystemOnline Transaction Processing(Operational System)
OLAP SystemOnline Analytical Processing(Data Warehouse)
Source of data
Operational data; OLTPs are the original source of the data.
Consolidation data; OLAP data comes from the various OLTP Databases
Purpose of data
To control and run fundamental business tasks
To help with planning, problem solving, and decision support
What the data
Reveals a snapshot of ongoing business processes
Multi-dimensional views of various kinds of business activities
Inserts and Updates
Short and fast inserts and updates initiated by end users
Periodic long-running batch jobs refresh the data
Queries
Relatively standardized and simple queries Returning relatively few records
Often complex queries involving aggregations
- 5. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
OLAP vs. OLTP
OLTP SystemOnline Transaction Processing(Operational System)
OLAP SystemOnline Analytical Processing(Data Warehouse)
Processing Speed
Typically very fast
Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements
Can be relatively small if historical data is archived
Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database Design
Highly normalized with many tables
Typically de-normalized with fewer tables; use of star and/or snowflake schemas
Backup and Recovery
Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability
Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
- 6. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Data Usage Challenges
•Databases start out as OLTP (99.99 of times…)
•OLAP functionality becomes a need as data accumulates
•At some point two databases are required
•The OLTP captures and manages daily transactions
•The OLAP is periodically loaded with data from OLTP
- 7. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Normalization
•What is Normalization ?
•The process of organizing the tables in a relational Database
•Eliminates data redundancy
•Lowers record locking
•Increases efficiency in concurrency
•Accomplished by dividing large tables into smaller tables
•Tables have relationships defined
- 8. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Normalization
•Form zero
- 9. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Normalization
•First Form
•Break each field down to the smallest meaningful value
•Remove repeating groups of data and Create a separate table for each set of related data
- 10. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Normalization
•Second Form
•Create new tables for data that applies to more than one record in a table
•Add a related field (foreign key) to the table
- 11. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Normalization
•Third Form
•Remove fields that do not relate to, or provide a fact about, the primary key.
•Take the Manager, Dept, and Sector fields and moved to another table. In addistiona field to establish a relationship between the tables should be added
- 12. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Normalized Structure Challenges
•It is usually very inefficient for data extraction
•Usually requires multiple table joins to reach all the data
•Join queries can be a challenging to write
•Join queries can be challenging for the Database Engine
•It doesn’t store data in the form needed for data analysis
•data is stored in the most detailed form, without aggregation
•Data may be stored in multiple, normalized Databases
- 13. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Star Schema Basics
•What is a Star Schema ?
•The simplest form of database structure used in a DWH
•Answers the basic question :
•What happened, who did it, when did they do it.. Etc.
•Focuses on one, single business area
•What advantaged does a start schema offer ?
•Separates data into two main categories
•Fact
•Dimensions ( Descriptive information about the facts)
- 14. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Star Schema Basics
•Fact vs. Dimensions
•Fact (what happened)
•Product sold
•Customer who bought
•Etc.
•Dimensions (Attributes that describe what happened)
•When the product was sold
•Day / Date / year / quarter
•Where the product was sold
- 15. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Star Schema Basics
- 16. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Star Schema Basics
- 17. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Fact Tables
•A fact table is a collection of measurements
•Note the word Measurements
•This is usually a number, something we can measure about a specificbusiness process.
•Fact table contains a single / multiple facts about a specific process (usually numeric)
•Sales amount
•Order quantity
•Tax amount
•Discount amount
- 18. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Fact Tables
•Fact tables may contain multiple measurements only if they are closely related.
•A data warehouse will have many fact tables
•Each table stores data (measure) for each specific business area)
•Products sold Fact Table / shipment details Fact Table
•Since fact tables design depends on science and data understanding, there are many ways by which fact tables can be designed.
- 19. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Dimensions
•Dimensions give context to measures (facts)
•Dimensions give context, or specific meaning to facts.
•The term “Dimension” usually refers to a table of related dimensions.
- 20. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Dimensions
•Example :
•A facttable contains numbers of products sold
•A date dimension table contains the following “dimensions” of dates pertaining the number of products sold
•Date and time (15.09.2013 09:25:32)
•Quarter
•DayofYear(321)
•Week (44)
•Weekday (Thursday)
- 21. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Dimensions
•Each individual column in a dimension table is an attribute.
•Attribute usually compress or expand data detail
•Data can be “discretized” into smaller, summarized groups
•Days (365 values)
•Weeks (52 Values)
•Months (12 values)
•Quarters (4 values)
•Hour / Minute / Second ..
- 22. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
UnderstandingDimensions
- 23. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Snowflake Schema Basics
•What is a Snowflake Schema ?
•A Star Schema with a little normalization added in
•Dimension tables are normalized somewhat
•Why use snowflake schema ?
•To satisfy data gathering functionality of more advanced data warehousing / mining tools
•To logically separate large dimensions tables
•To more naturally separate dimensional data
•Known customers vs. anonymous customers
- 24. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Snowflake Schema Basics
•One main rule concerning snowflake schema
•Don’t use it, Unless you want to or need to.
- 25. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Snowflake Schema Basics
- 26. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Granularity
•What is meant by the tem granularity in a DWH ?
•The level of detail available
•What determines Granularity
•The level of data loaded into the fact table
•For example, per order numbers vs. daily numbers vs. weekly numbers etc.
•The number and detail level of dimensions
•If we want to look into customer details but we don’t have customer dimension –this data won’t be available
- 27. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Understanding Granularity
•Granularity should be determined during database design.
•This change can be made after database was created as well, but it will require much more effort.
•This change may involve
•Changing fact table structure
•Possible changes in dimension tables
•Changes in data loading
- 28. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Auditing
•Data warehouses do not store data as it is created.
•OLTP databases are populated as business occurs
•Source and purpose of data is generally self explanatory
•Data is added when transaction occurs
•DWH are populated from OLTP data
•Based on various conditions
•At various times
•From various sources
- 29. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Auditing
•Data can be informative based on different aspects
•The data itself
•The source of the data
•The volume of the data
•These characteristics usually change over time
•Auditing identify these aspects
•Usually stored in tables
•Describe source, duration of load, who performed the load, etc.
- 30. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com
Auditing
•SQL Server Integration Services
•Provides SSIS logging