SlideShare a Scribd company logo
1 of 30
Data Warehouse Basics 
Ram Kedem
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Data Warehouse Basics 
•Data Usage Challenges 
•OLAP vs. OLTP 
•Understanding Normalization 
•OLAP 
•Star Schema Basics 
•Snowflake Schema Basics 
•Understanding Granularity 
•Auditing
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Data Usage Challenges 
•Databases are usually divided into two separate types –OLTP / OLAP
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
OLAP vs. OLTP 
OLTP SystemOnline Transaction Processing(Operational System) 
OLAP SystemOnline Analytical Processing(Data Warehouse) 
Source of data 
Operational data; OLTPs are the original source of the data. 
Consolidation data; OLAP data comes from the various OLTP Databases 
Purpose of data 
To control and run fundamental business tasks 
To help with planning, problem solving, and decision support 
What the data 
Reveals a snapshot of ongoing business processes 
Multi-dimensional views of various kinds of business activities 
Inserts and Updates 
Short and fast inserts and updates initiated by end users 
Periodic long-running batch jobs refresh the data 
Queries 
Relatively standardized and simple queries Returning relatively few records 
Often complex queries involving aggregations
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
OLAP vs. OLTP 
OLTP SystemOnline Transaction Processing(Operational System) 
OLAP SystemOnline Analytical Processing(Data Warehouse) 
Processing Speed 
Typically very fast 
Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes 
Space Requirements 
Can be relatively small if historical data is archived 
Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP 
Database Design 
Highly normalized with many tables 
Typically de-normalized with fewer tables; use of star and/or snowflake schemas 
Backup and Recovery 
Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability 
Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Data Usage Challenges 
•Databases start out as OLTP (99.99 of times…) 
•OLAP functionality becomes a need as data accumulates 
•At some point two databases are required 
•The OLTP captures and manages daily transactions 
•The OLAP is periodically loaded with data from OLTP
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Normalization 
•What is Normalization ? 
•The process of organizing the tables in a relational Database 
•Eliminates data redundancy 
•Lowers record locking 
•Increases efficiency in concurrency 
•Accomplished by dividing large tables into smaller tables 
•Tables have relationships defined
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Normalization 
•Form zero
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Normalization 
•First Form 
•Break each field down to the smallest meaningful value 
•Remove repeating groups of data and Create a separate table for each set of related data
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Normalization 
•Second Form 
•Create new tables for data that applies to more than one record in a table 
•Add a related field (foreign key) to the table
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Normalization 
•Third Form 
•Remove fields that do not relate to, or provide a fact about, the primary key. 
•Take the Manager, Dept, and Sector fields and moved to another table. In addistiona field to establish a relationship between the tables should be added
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Normalized Structure Challenges 
•It is usually very inefficient for data extraction 
•Usually requires multiple table joins to reach all the data 
•Join queries can be a challenging to write 
•Join queries can be challenging for the Database Engine 
•It doesn’t store data in the form needed for data analysis 
•data is stored in the most detailed form, without aggregation 
•Data may be stored in multiple, normalized Databases
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Star Schema Basics 
•What is a Star Schema ? 
•The simplest form of database structure used in a DWH 
•Answers the basic question : 
•What happened, who did it, when did they do it.. Etc. 
•Focuses on one, single business area 
•What advantaged does a start schema offer ? 
•Separates data into two main categories 
•Fact 
•Dimensions ( Descriptive information about the facts)
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Star Schema Basics 
•Fact vs. Dimensions 
•Fact (what happened) 
•Product sold 
•Customer who bought 
•Etc. 
•Dimensions (Attributes that describe what happened) 
•When the product was sold 
•Day / Date / year / quarter 
•Where the product was sold
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Star Schema Basics
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Star Schema Basics
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Fact Tables 
•A fact table is a collection of measurements 
•Note the word Measurements 
•This is usually a number, something we can measure about a specificbusiness process. 
•Fact table contains a single / multiple facts about a specific process (usually numeric) 
•Sales amount 
•Order quantity 
•Tax amount 
•Discount amount
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Fact Tables 
•Fact tables may contain multiple measurements only if they are closely related. 
•A data warehouse will have many fact tables 
•Each table stores data (measure) for each specific business area) 
•Products sold Fact Table / shipment details Fact Table 
•Since fact tables design depends on science and data understanding, there are many ways by which fact tables can be designed.
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Dimensions 
•Dimensions give context to measures (facts) 
•Dimensions give context, or specific meaning to facts. 
•The term “Dimension” usually refers to a table of related dimensions.
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Dimensions 
•Example : 
•A facttable contains numbers of products sold 
•A date dimension table contains the following “dimensions” of dates pertaining the number of products sold 
•Date and time (15.09.2013 09:25:32) 
•Quarter 
•DayofYear(321) 
•Week (44) 
•Weekday (Thursday)
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Dimensions 
•Each individual column in a dimension table is an attribute. 
•Attribute usually compress or expand data detail 
•Data can be “discretized” into smaller, summarized groups 
•Days (365 values) 
•Weeks (52 Values) 
•Months (12 values) 
•Quarters (4 values) 
•Hour / Minute / Second ..
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
UnderstandingDimensions
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Snowflake Schema Basics 
•What is a Snowflake Schema ? 
•A Star Schema with a little normalization added in 
•Dimension tables are normalized somewhat 
•Why use snowflake schema ? 
•To satisfy data gathering functionality of more advanced data warehousing / mining tools 
•To logically separate large dimensions tables 
•To more naturally separate dimensional data 
•Known customers vs. anonymous customers
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Snowflake Schema Basics 
•One main rule concerning snowflake schema 
•Don’t use it, Unless you want to or need to.
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Snowflake Schema Basics
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Granularity 
•What is meant by the tem granularity in a DWH ? 
•The level of detail available 
•What determines Granularity 
•The level of data loaded into the fact table 
•For example, per order numbers vs. daily numbers vs. weekly numbers etc. 
•The number and detail level of dimensions 
•If we want to look into customer details but we don’t have customer dimension –this data won’t be available
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Understanding Granularity 
•Granularity should be determined during database design. 
•This change can be made after database was created as well, but it will require much more effort. 
•This change may involve 
•Changing fact table structure 
•Possible changes in dimension tables 
•Changes in data loading
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Auditing 
•Data warehouses do not store data as it is created. 
•OLTP databases are populated as business occurs 
•Source and purpose of data is generally self explanatory 
•Data is added when transaction occurs 
•DWH are populated from OLTP data 
•Based on various conditions 
•At various times 
•From various sources
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Auditing 
•Data can be informative based on different aspects 
•The data itself 
•The source of the data 
•The volume of the data 
•These characteristics usually change over time 
•Auditing identify these aspects 
•Usually stored in tables 
•Describe source, duration of load, who performed the load, etc.
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com 
Auditing 
•SQL Server Integration Services 
•Provides SSIS logging

More Related Content

What's hot

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 

What's hot (20)

data warehouse , data mart, etl
data warehouse , data mart, etldata warehouse , data mart, etl
data warehouse , data mart, etl
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 
Introduction to Databases
Introduction to DatabasesIntroduction to Databases
Introduction to Databases
 
Ppt
PptPpt
Ppt
 
Data warehouse proposal
Data warehouse proposalData warehouse proposal
Data warehouse proposal
 
Oltp vs olap
Oltp vs olapOltp vs olap
Oltp vs olap
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Rdbms
RdbmsRdbms
Rdbms
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse Fundamentals
 
OLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSEOLAP & DATA WAREHOUSE
OLAP & DATA WAREHOUSE
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 

Viewers also liked

Viewers also liked (8)

Data Warehouse Design Considerations
Data Warehouse Design ConsiderationsData Warehouse Design Considerations
Data Warehouse Design Considerations
 
SSIS Basic Data Flow
SSIS Basic Data FlowSSIS Basic Data Flow
SSIS Basic Data Flow
 
SSIS Incremental ETL process
SSIS Incremental ETL processSSIS Incremental ETL process
SSIS Incremental ETL process
 
Big data (Data Size doesn't Matter, How and What is Data that's matter)
Big data (Data Size doesn't Matter, How and What is Data that's matter)Big data (Data Size doesn't Matter, How and What is Data that's matter)
Big data (Data Size doesn't Matter, How and What is Data that's matter)
 
SSAS Cubes & Hierarchies
SSAS Cubes & HierarchiesSSAS Cubes & Hierarchies
SSAS Cubes & Hierarchies
 
SSIS Data Flow Tasks
SSIS Data Flow Tasks SSIS Data Flow Tasks
SSIS Data Flow Tasks
 
Control Flow Using SSIS
Control Flow Using SSISControl Flow Using SSIS
Control Flow Using SSIS
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 

Similar to Data Warehouse Basics

Column Statistics in Hive
Column Statistics in HiveColumn Statistics in Hive
Column Statistics in Hive
vshreepadma
 
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
DataWorks Summit
 

Similar to Data Warehouse Basics (20)

Data Mining in SSAS
Data Mining in SSASData Mining in SSAS
Data Mining in SSAS
 
Data mining In SSAS
Data mining In SSASData mining In SSAS
Data mining In SSAS
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613Designing dashboards for performance shridhar wip 040613
Designing dashboards for performance shridhar wip 040613
 
Column Statistics in Hive
Column Statistics in HiveColumn Statistics in Hive
Column Statistics in Hive
 
Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing Datastage Introduction To Data Warehousing
Datastage Introduction To Data Warehousing
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & Tricks
 
Microservices for java architects schamburg-2015-05-19
Microservices for java architects schamburg-2015-05-19Microservices for java architects schamburg-2015-05-19
Microservices for java architects schamburg-2015-05-19
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Microservices for Java Architects (Madison-Milwaukee, April 28-9, 2015)
Microservices for Java Architects (Madison-Milwaukee, April 28-9, 2015)Microservices for Java Architects (Madison-Milwaukee, April 28-9, 2015)
Microservices for Java Architects (Madison-Milwaukee, April 28-9, 2015)
 
Data Vault Introduction
Data Vault IntroductionData Vault Introduction
Data Vault Introduction
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
ETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL TestingETL Testing - Introduction to ETL Testing
ETL Testing - Introduction to ETL Testing
 
Microservices for Java Architects (Chicago, April 21, 2015)
Microservices for Java Architects (Chicago, April 21, 2015)Microservices for Java Architects (Chicago, April 21, 2015)
Microservices for Java Architects (Chicago, April 21, 2015)
 
dwproblems.pptx
dwproblems.pptxdwproblems.pptx
dwproblems.pptx
 
Designing Effective Storage Strategies to Meet Business Needs
Designing Effective Storage Strategies to Meet Business NeedsDesigning Effective Storage Strategies to Meet Business Needs
Designing Effective Storage Strategies to Meet Business Needs
 
Designing Effective Storage Strategies to Meet Business Needs
Designing Effective Storage Strategies to Meet Business NeedsDesigning Effective Storage Strategies to Meet Business Needs
Designing Effective Storage Strategies to Meet Business Needs
 
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
 

More from Ram Kedem

More from Ram Kedem (20)

Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Advanced SQL Webinar
Advanced SQL WebinarAdvanced SQL Webinar
Advanced SQL Webinar
 
Managing oracle Database Instance
Managing oracle Database InstanceManaging oracle Database Instance
Managing oracle Database Instance
 
Power Pivot and Power View
Power Pivot and Power ViewPower Pivot and Power View
Power Pivot and Power View
 
SQL Injections - Oracle
SQL Injections - OracleSQL Injections - Oracle
SQL Injections - Oracle
 
SSAS Attributes
SSAS AttributesSSAS Attributes
SSAS Attributes
 
SSRS Matrix
SSRS MatrixSSRS Matrix
SSRS Matrix
 
DDL Practice (Hebrew)
DDL Practice (Hebrew)DDL Practice (Hebrew)
DDL Practice (Hebrew)
 
DML Practice (Hebrew)
DML Practice (Hebrew)DML Practice (Hebrew)
DML Practice (Hebrew)
 
Exploring Oracle Database Architecture (Hebrew)
Exploring Oracle Database Architecture (Hebrew)Exploring Oracle Database Architecture (Hebrew)
Exploring Oracle Database Architecture (Hebrew)
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Deploy SSRS Project - SQL Server 2014
Deploy SSRS Project - SQL Server 2014Deploy SSRS Project - SQL Server 2014
Deploy SSRS Project - SQL Server 2014
 
Pig - Processing XML data
Pig - Processing XML dataPig - Processing XML data
Pig - Processing XML data
 
SSRS Basic Parameters
SSRS Basic ParametersSSRS Basic Parameters
SSRS Basic Parameters
 
SSRS Gauges
SSRS GaugesSSRS Gauges
SSRS Gauges
 
SSRS Conditional Formatting
SSRS Conditional FormattingSSRS Conditional Formatting
SSRS Conditional Formatting
 
SSRS Calculated Fields
SSRS Calculated FieldsSSRS Calculated Fields
SSRS Calculated Fields
 
SSRS Groups
SSRS GroupsSSRS Groups
SSRS Groups
 
Deploy SSIS
Deploy SSISDeploy SSIS
Deploy SSIS
 
MSSQL Server - Automation
MSSQL Server - AutomationMSSQL Server - Automation
MSSQL Server - Automation
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Data Warehouse Basics

  • 2. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Data Warehouse Basics •Data Usage Challenges •OLAP vs. OLTP •Understanding Normalization •OLAP •Star Schema Basics •Snowflake Schema Basics •Understanding Granularity •Auditing
  • 3. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Data Usage Challenges •Databases are usually divided into two separate types –OLTP / OLAP
  • 4. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com OLAP vs. OLTP OLTP SystemOnline Transaction Processing(Operational System) OLAP SystemOnline Analytical Processing(Data Warehouse) Source of data Operational data; OLTPs are the original source of the data. Consolidation data; OLAP data comes from the various OLTP Databases Purpose of data To control and run fundamental business tasks To help with planning, problem solving, and decision support What the data Reveals a snapshot of ongoing business processes Multi-dimensional views of various kinds of business activities Inserts and Updates Short and fast inserts and updates initiated by end users Periodic long-running batch jobs refresh the data Queries Relatively standardized and simple queries Returning relatively few records Often complex queries involving aggregations
  • 5. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com OLAP vs. OLTP OLTP SystemOnline Transaction Processing(Operational System) OLAP SystemOnline Analytical Processing(Data Warehouse) Processing Speed Typically very fast Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes Space Requirements Can be relatively small if historical data is archived Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP Database Design Highly normalized with many tables Typically de-normalized with fewer tables; use of star and/or snowflake schemas Backup and Recovery Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liability Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method
  • 6. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Data Usage Challenges •Databases start out as OLTP (99.99 of times…) •OLAP functionality becomes a need as data accumulates •At some point two databases are required •The OLTP captures and manages daily transactions •The OLAP is periodically loaded with data from OLTP
  • 7. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Normalization •What is Normalization ? •The process of organizing the tables in a relational Database •Eliminates data redundancy •Lowers record locking •Increases efficiency in concurrency •Accomplished by dividing large tables into smaller tables •Tables have relationships defined
  • 8. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Normalization •Form zero
  • 9. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Normalization •First Form •Break each field down to the smallest meaningful value •Remove repeating groups of data and Create a separate table for each set of related data
  • 10. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Normalization •Second Form •Create new tables for data that applies to more than one record in a table •Add a related field (foreign key) to the table
  • 11. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Normalization •Third Form •Remove fields that do not relate to, or provide a fact about, the primary key. •Take the Manager, Dept, and Sector fields and moved to another table. In addistiona field to establish a relationship between the tables should be added
  • 12. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Normalized Structure Challenges •It is usually very inefficient for data extraction •Usually requires multiple table joins to reach all the data •Join queries can be a challenging to write •Join queries can be challenging for the Database Engine •It doesn’t store data in the form needed for data analysis •data is stored in the most detailed form, without aggregation •Data may be stored in multiple, normalized Databases
  • 13. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Star Schema Basics •What is a Star Schema ? •The simplest form of database structure used in a DWH •Answers the basic question : •What happened, who did it, when did they do it.. Etc. •Focuses on one, single business area •What advantaged does a start schema offer ? •Separates data into two main categories •Fact •Dimensions ( Descriptive information about the facts)
  • 14. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Star Schema Basics •Fact vs. Dimensions •Fact (what happened) •Product sold •Customer who bought •Etc. •Dimensions (Attributes that describe what happened) •When the product was sold •Day / Date / year / quarter •Where the product was sold
  • 15. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Star Schema Basics
  • 16. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Star Schema Basics
  • 17. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Fact Tables •A fact table is a collection of measurements •Note the word Measurements •This is usually a number, something we can measure about a specificbusiness process. •Fact table contains a single / multiple facts about a specific process (usually numeric) •Sales amount •Order quantity •Tax amount •Discount amount
  • 18. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Fact Tables •Fact tables may contain multiple measurements only if they are closely related. •A data warehouse will have many fact tables •Each table stores data (measure) for each specific business area) •Products sold Fact Table / shipment details Fact Table •Since fact tables design depends on science and data understanding, there are many ways by which fact tables can be designed.
  • 19. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Dimensions •Dimensions give context to measures (facts) •Dimensions give context, or specific meaning to facts. •The term “Dimension” usually refers to a table of related dimensions.
  • 20. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Dimensions •Example : •A facttable contains numbers of products sold •A date dimension table contains the following “dimensions” of dates pertaining the number of products sold •Date and time (15.09.2013 09:25:32) •Quarter •DayofYear(321) •Week (44) •Weekday (Thursday)
  • 21. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Dimensions •Each individual column in a dimension table is an attribute. •Attribute usually compress or expand data detail •Data can be “discretized” into smaller, summarized groups •Days (365 values) •Weeks (52 Values) •Months (12 values) •Quarters (4 values) •Hour / Minute / Second ..
  • 22. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com UnderstandingDimensions
  • 23. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Snowflake Schema Basics •What is a Snowflake Schema ? •A Star Schema with a little normalization added in •Dimension tables are normalized somewhat •Why use snowflake schema ? •To satisfy data gathering functionality of more advanced data warehousing / mining tools •To logically separate large dimensions tables •To more naturally separate dimensional data •Known customers vs. anonymous customers
  • 24. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Snowflake Schema Basics •One main rule concerning snowflake schema •Don’t use it, Unless you want to or need to.
  • 25. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Snowflake Schema Basics
  • 26. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Granularity •What is meant by the tem granularity in a DWH ? •The level of detail available •What determines Granularity •The level of data loaded into the fact table •For example, per order numbers vs. daily numbers vs. weekly numbers etc. •The number and detail level of dimensions •If we want to look into customer details but we don’t have customer dimension –this data won’t be available
  • 27. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Understanding Granularity •Granularity should be determined during database design. •This change can be made after database was created as well, but it will require much more effort. •This change may involve •Changing fact table structure •Possible changes in dimension tables •Changes in data loading
  • 28. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Auditing •Data warehouses do not store data as it is created. •OLTP databases are populated as business occurs •Source and purpose of data is generally self explanatory •Data is added when transaction occurs •DWH are populated from OLTP data •Based on various conditions •At various times •From various sources
  • 29. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Auditing •Data can be informative based on different aspects •The data itself •The source of the data •The volume of the data •These characteristics usually change over time •Auditing identify these aspects •Usually stored in tables •Describe source, duration of load, who performed the load, etc.
  • 30. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent. ramkedem.com Auditing •SQL Server Integration Services •Provides SSIS logging