DATA WAREHOUSINGDATA WAREHOUSING SUSHIL KULKARNI
A producer wants to know….                              Which are our                               Which are our         ...
Lot of data everywhereyet ...             • I can’t find the data I need                – data is scattered over the netwo...
What is a Data Warehouse?A single, complete andconsistent store of dataobtained from a variety ofdifferent sources madeava...
What users says...• Data should be integrated  across the enterprise• Summary data has a real  value to the organization• ...
What is Data Warehousing?  A process of transforming  data into information and  making it available to users  in a timely...
Evolution• 60’s: Batch reports   – hard to find and analyze information   – inflexible and expensive, reprogram every new ...
Warehouses are Very Large                           Databases              35%              30%              25%Respondent...
Very Large Data Bases• Terabytes -- 10^12 bytes:    Walmart -- 24 Terabytes• Petabytes -- 10^15 bytes:    Geographic Infor...
Data Warehousing --             It is a process• Technique for assembling and  managing data from various  sources for the...
Data Warehouse• A data warehouse is a  – subject-oriented  – integrated  – time-varying  – non-volatile collection of data...
Data Warehouse Subject-oriented                  Customers: Get                   information of                   differe...
Data Warehouse Subject-oriented                  Students: Get                  information about                  various...
Data Warehouse Subject-oriented• Focusing on the modelling and analysis of data for decision makers, not on daily operatio...
Data Warehouse Subject-oriented                      Enterprise                      “Database”                  Customers...
Data Warehouse :Time - variantUse to study trendsand changes
Data Warehouse :Time - variant• The time horizon for the data warehouse is  significantly longer than that of operational ...
Data Warehouse : Non-volatilecannotupdated byend users
Data Warehouse ArchitectureRelationalDatabases                          Optimized Loader             Extraction   ERP Syst...
Data Mart• A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse.• A Data Mart typically reflects the b...
Data Warehouse to Data Mart                           Decision                           Support             Data Mart   I...
DATA MARTS•   Create many DM’s•   Limited scopeExamples:1. Financial DM2. Marketing DM3. Supply chain DM
Generic Architecture of Data  (synonym) Transaction data
Transaction (Operational)                Data• Operational (production) systems create  (massive number of) transactions, ...
Operational Summary DataSummaries are for aspecific time period    Other Examples???and utilize thetransaction data fortha...
Decision Support Summary Data• The data that are used to help make  decisions about the business   – Financial Data, such ...
Data Warehouse for Decision          Support• Putting Information technology to help  the knowledge worker make faster and...
Decision Support• Used to manage and control business• Data is historical or point-in-time• Optimized for inquiry rather t...
Database Schema• Database schema defines the structure of data,  not the values of the data (e.g., first name, last  name ...
Logical Database Schema• Describes data in a way that is familiar to  business users
Physical Database Schema• Describes the data the way it will be stored in an  RDBMS which might be different than the way ...
Metadata• General definition: Data about data !!!   – Examples:       • A library’s card catalog (metadata)         descri...
Business Rules• Highest level of abstraction from  operational (transaction) data• Describes why relationships exist and  ...
General Architecture for Data                        Warehousing•   Source systems•   Extraction, (Clean),    Transformati...
DATA WAREHOUSE SCOPEBroad : Required for companies, Very costly, May be divided according to Depts.Narrow: Required for Pe...
Design of a Data Warehouse: A   Business Analysis Framework• Four views regarding the design of a data  warehouse  – Top-d...
Data Warehouse Design Process•   Top-down, bottom-up approaches or a combination of both    – Top-down: Starts with overal...
Multi-Tiered Architecture                               Monitor                                  &          OLAP Server  o...
Three Data Warehouse Models• Enterprise warehouse  – collects all of the information about subjects spanning    the entire...
Data Mining works with          Warehouse Data                       • Data Warehousing provides                         t...
We want to know ...•   Given a database of 100,000 names, which persons are the    least likely to default on their credit...
Application AreasIndustry                 ApplicationFinance                  Credit Card AnalysisInsurance               ...
Data Mining in Use• Data Mining can be used to track fraud• A Supermarket becomes an information  broker• Basketball teams...
Two Systems• Operational System• Information System
Operational Systems• Run the business in real time• Based on up-to-the-second  data• Optimized to handle large  numbers of...
On Line Transaction Process                (OLTP)It refers to a class ofsystems that facilitateand managetransaction-orien...
On Line Transaction Process               (OLTP)OLTP technology is used in anumber of industries, includingbanking, airlin...
What are Operational Systems?• They are OLTP systems• Run mission critical  applications• Need to work with  stringent per...
RDBMS used for OLTP• Database Systems have been  used traditionally for OLTP  –   clerical data processing tasks  –   deta...
Operational Summary DataSummaries are for aspecific time period   Other Examples???and utilize thetransaction data forthat...
Examples of Operational DataData        Industry Usage             Technology             VolumesCustomer    All        Tr...
So, what’s different?
Application-Orientation vs.      Subject-OrientationApplication-Orientation        Subject-Orientation               Opera...
OLTP vs. Data Warehouse• OLTP systems are tuned for known transactions  and workloads while workload is not known a  prior...
OLTP vs Data Warehouse• OLTP                       • Warehouse (DSS)  –   Application Oriented     – Subject Oriented  –  ...
OLTP vs Data Warehouse• OLTP                        • Data Warehouse  – Performance Sensitive       – Performance relaxed ...
OLTP vs Data Warehouse• OLTP                    • Data Warehouse  – Transaction             – Query throughput is    throu...
To summarize ...• OLTP Systems are  used to “run” a  business                     • The Data Warehouse                    ...
Why Separate Data                Warehouse?• Performance  – Op dbs designed & tuned for known txs & workloads.  – Complex ...
INFORMATION SYSTEMS• Designed to support  decision-making based on   1. Historical data   2. Prediction data.• Designed fo...
INFORMATION SYSTEMS
DIFFERENCECharacteristics   Operational Systems       Informational SystemsPurpose           Real time data entry      Rea...
THANKS!
Upcoming SlideShare
Loading in …5
×

Ch 1 intro_dw

1,163 views

Published on

This gives an idea about Data Warehouse

Published in: Technology, Business
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
1,163
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
49
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Ch 1 intro_dw

  1. 1. DATA WAREHOUSINGDATA WAREHOUSING SUSHIL KULKARNI
  2. 2. A producer wants to know…. Which are our Which are our lowest/highest margin lowest/highest margin Who are my customers Who are my customers customers ? customers ? and what products and what products What is the most What is the most are they buying? are they buying?effective distribution effective distribution channel? channel? What impact will What impact will Which customers Which customers new products/services new products/services are most likely to go are most likely to go have on revenue have on revenue to the competition ? to the competition ? and margins? and margins? What product prom- What product prom- -otions have the biggest -otions have the biggest impact on revenue? impact on revenue?
  3. 3. Lot of data everywhereyet ... • I can’t find the data I need – data is scattered over the network – many versions, subtle differences • I can’t get the data I need – need an expert to get the data • I can’t understand the data I found – available data poorly documented • I can’t use the data I found – results are unexpected – data needs to be transformed from one form to other
  4. 4. What is a Data Warehouse?A single, complete andconsistent store of dataobtained from a variety ofdifferent sources madeavailable to end users ina what they canunderstand and use in abusiness context.[Barry Devlin]
  5. 5. What users says...• Data should be integrated across the enterprise• Summary data has a real value to the organization• Historical data holds the key to understanding data over time• What-if capabilities are required
  6. 6. What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference[Forrester Research, April 1996] Data
  7. 7. Evolution• 60’s: Batch reports – hard to find and analyze information – inflexible and expensive, reprogram every new request• 70’s: Terminal-based DSS and EIS (executive information systems) – still inflexible, not integrated with desktop tools• 80’s: Desktop data access and analysis tools – query tools, spreadsheets, GUIs – easier to use, but only access operational databases• 90’s: Data warehousing with integrated OLAP engines and tools
  8. 8. Warehouses are Very Large Databases 35% 30% 25%Respondents 20% 15% 10% Initial 5% Projected 2Q96 0% Source: META Group, Inc. 5GB 10-19GB 50-99GB 250-499GB 5-9GB 20-49GB 100-249GB 500GB-1TB
  9. 9. Very Large Data Bases• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes• Petabytes -- 10^15 bytes: Geographic Information Systems• Exabytes -- 10^18 bytes: National Medical Records• Zettabytes -- 10^21 bytes: Weather images• Zottabytes -- 10^24 bytes: Intelligence Agency Videos
  10. 10. Data Warehousing -- It is a process• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible• A decision support database maintained separately from the organization’s operational database
  11. 11. Data Warehouse• A data warehouse is a – subject-oriented – integrated – time-varying – non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996
  12. 12. Data Warehouse Subject-oriented Customers: Get information of different prices of a beer Farmers: Harvest information from known access paths
  13. 13. Data Warehouse Subject-oriented Students: Get information about various universities in U.K. Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
  14. 14. Data Warehouse Subject-oriented• Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
  15. 15. Data Warehouse Subject-oriented Enterprise “Database” Customers Orders Transactions Vendors Etc… Data Miners: Etc… • “Farmers” – they know • “Explorers” - unpredictable Copied, organized summarized Data Data Mining Warehouse
  16. 16. Data Warehouse :Time - variantUse to study trendsand changes
  17. 17. Data Warehouse :Time - variant• The time horizon for the data warehouse is significantly longer than that of operational systems – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) – Operational database: current value data• Every key structure in the data warehouse – Contains an element of time explicitly or implicitly, while the key of operational data may or may not contain “time element”
  18. 18. Data Warehouse : Non-volatilecannotupdated byend users
  19. 19. Data Warehouse ArchitectureRelationalDatabases Optimized Loader Extraction ERP Systems Cleansing Data Warehouse Engine AnalyzePurchased Query Data Legacy Data Metadata Repository
  20. 20. Data Mart• A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse.• A Data Mart typically reflects the business rules of a specific business unit within an enterprise.
  21. 21. Data Warehouse to Data Mart Decision Support Data Mart Information Decision Data Support Data Mart Information Warehouse Decision Support Data Mart Information
  22. 22. DATA MARTS• Create many DM’s• Limited scopeExamples:1. Financial DM2. Marketing DM3. Supply chain DM
  23. 23. Generic Architecture of Data (synonym) Transaction data
  24. 24. Transaction (Operational) Data• Operational (production) systems create (massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, etc…• Transactions are the base level of data – the raw material for understanding customer behavior• Unfortunately, operational systems change due to changing business needs• Fortunately, operational systems can usually be changed to support changing business needs• Data warehousing strategies need to be aware of operational system changes
  25. 25. Operational Summary DataSummaries are for aspecific time period Other Examples???and utilize thetransaction data forthat time period
  26. 26. Decision Support Summary Data• The data that are used to help make decisions about the business – Financial Data, such as: • Income Statements (Profit & Loss) • Balance Sheets (Assets – Liabilities = Net Worth) – Sales summaries – Other examples???• Data warehouses maintain this type of data, however financial data “of record” (for audit purposes) usually comes from databases and not the data warehouse (confusing???)• Generally, it is a bad idea to use the same system for analytic and operational purposes
  27. 27. Data Warehouse for Decision Support• Putting Information technology to help the knowledge worker make faster and better decisions – Which of my customers are most likely to go to the competition? – What product promotions have the biggest impact on revenue? – How did the share price of software companies correlate with profits over last 10 years?
  28. 28. Decision Support• Used to manage and control business• Data is historical or point-in-time• Optimized for inquiry rather than update• Use of the system is loosely defined and can be ad-hoc• Used by managers and end-users to understand the business and make judgements
  29. 29. Database Schema• Database schema defines the structure of data, not the values of the data (e.g., first name, last name = structure; Ron Norman = values of the data)• In RDBMS: – Columns = fields = attributes (A,B,C) – Rows = records = tuples (1-7)
  30. 30. Logical Database Schema• Describes data in a way that is familiar to business users
  31. 31. Physical Database Schema• Describes the data the way it will be stored in an RDBMS which might be different than the way the logical shows it
  32. 32. Metadata• General definition: Data about data !!! – Examples: • A library’s card catalog (metadata) describes publications (data) • A file system maintains permissions (metadata) about files (data)• A form of system documentation including: – Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.) – Description of the contents of each field (e.g., start date) – Date when data were loaded – Indication of currency of the data (last updated) – Mappings between systems (e.g., A.this = B.that)• Invaluable, otherwise have to research to find it
  33. 33. Business Rules• Highest level of abstraction from operational (transaction) data• Describes why relationships exist and how they are applied• Examples: – Need to have 3 forms of ID for credit – Only allow a maximum daily withdrawal of $200 – After the 3rd log-in attempt, lock the log-in screen – Accept no bills larger than $20 – Others???
  34. 34. General Architecture for Data Warehousing• Source systems• Extraction, (Clean), Transformation, & Load (ETL)• Central repository• Metadata repository• Data marts• Operational feedback• End users (business)
  35. 35. DATA WAREHOUSE SCOPEBroad : Required for companies, Very costly, May be divided according to Depts.Narrow: Required for Personal information
  36. 36. Design of a Data Warehouse: A Business Analysis Framework• Four views regarding the design of a data warehouse – Top-down view • allows selection of the relevant information necessary for the data warehouse – Data source view • exposes the information being captured, stored, and managed by operational systems – Data warehouse view • consists of fact tables and dimension tables – Business query view • sees the perspectives of data in the warehouse from the view of end-user
  37. 37. Data Warehouse Design Process• Top-down, bottom-up approaches or a combination of both – Top-down: Starts with overall design and planning – Bottom-up: Starts with experiments and prototypes (rapid)• From software engineering point of view – Waterfall: structured and systematic analysis at each step before proceeding to the next – Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around• Typical data warehouse design process – Choose a business process to model, e.g., orders, invoices, etc. – Choose the grain (atomic level of data) of the business process – Choose the dimensions that will apply to each fact table record – Choose the measure that will populate each fact table record
  38. 38. Multi-Tiered Architecture Monitor & OLAP Server other Metadata sources Integrator Analysis Operational Extract Query DBs Transform Data Serve Reports Load Refresh Warehouse Data mining Data MartsData Sources Data Storage OLAP Engine Front-End Tools
  39. 39. Three Data Warehouse Models• Enterprise warehouse – collects all of the information about subjects spanning the entire organization• Data Mart – a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart• Virtual warehouse – A set of views over operational databases – Only some of the possible summary views may be materialized
  40. 40. Data Mining works with Warehouse Data • Data Warehousing provides the Enterprise with a memory• Data Mining provides the Enterprise with intelligence
  41. 41. We want to know ...• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?• If I raise the price of my product by Rs. 2, what is the effect on my ROI?• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?• Which of my customers are likely to be the most loyal? Data Mining helps extract such information
  42. 42. Application AreasIndustry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis
  43. 43. Data Mining in Use• Data Mining can be used to track fraud• A Supermarket becomes an information broker• Basketball teams use it to track game strategy• Cross Selling• Warranty Claims Routing• Holding on to Good Customers• Weeding out Bad Customers
  44. 44. Two Systems• Operational System• Information System
  45. 45. Operational Systems• Run the business in real time• Based on up-to-the-second data• Optimized to handle large numbers of simple read/write transactions• Optimized for fast response to predefined transactions• Used by people who deal with customers, products -- clerks, salespeople etc.• They are increasingly used by customers
  46. 46. On Line Transaction Process (OLTP)It refers to a class ofsystems that facilitateand managetransaction-orientedapplications, typicallyfor data entry andretrieval transactionprocessing
  47. 47. On Line Transaction Process (OLTP)OLTP technology is used in anumber of industries, includingbanking, airlines, mail order,supermarkets, and manufacturing.Applications include electronicbanking, order processing,employee time clock systems, e-commerce, and eTrading. Themost widely used OLTP system isprobably IBMs CICS.
  48. 48. What are Operational Systems?• They are OLTP systems• Run mission critical applications• Need to work with stringent performance requirements for routine tasks• Used to run a business!
  49. 49. RDBMS used for OLTP• Database Systems have been used traditionally for OLTP – clerical data processing tasks – detailed, up to date data – structured repetitive tasks – read/update a few records – isolation, recovery and integrity are critical
  50. 50. Operational Summary DataSummaries are for aspecific time period Other Examples???and utilize thetransaction data forthat time period
  51. 51. Examples of Operational DataData Industry Usage Technology VolumesCustomer All Track Legacy application, flat Small-mediumFile Customer files, main frames DetailsAccount Finance Control Legacy applications, LargeBalance account hierarchical databases, activities mainframePoint-of- Retail Generate ERP, Client/Server, Very LargeSale data bills, manage relational databases stockCall Telecomm- Billing Legacy application, Very LargeRecord unications hierarchical database, mainframeProduction Manufact- Control ERP, MediumRecord uring Production relational databases, AS/400
  52. 52. So, what’s different?
  53. 53. Application-Orientation vs. Subject-OrientationApplication-Orientation Subject-Orientation Operational Data Database Warehouse Credit Loans Customer Card Vendor Trust Product Savings Activity
  54. 54. OLTP vs. Data Warehouse• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse• Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) – e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
  55. 55. OLTP vs Data Warehouse• OLTP • Warehouse (DSS) – Application Oriented – Subject Oriented – Used to run business – Used to analyze – Detailed data business – Current up to date – Summarized and refined – Isolated Data – Snapshot data – Repetitive access – Integrated Data – Clerical User – Ad-hoc access – Knowledge User (Manager)
  56. 56. OLTP vs Data Warehouse• OLTP • Data Warehouse – Performance Sensitive – Performance relaxed – Few Records accessed at – Large volumes accessed a time (tens) at a time(millions) – Mostly Read (Batch – Read/Update Access Update) – Redundancy present – No data redundancy – Database Size 100 – Database Size 100MB - GB - few terabytes 100 GB
  57. 57. OLTP vs Data Warehouse• OLTP • Data Warehouse – Transaction – Query throughput is throughput is the the performance performance metric metric – Thousands of users – Hundreds of users – Managed in entirety – Managed by subsets
  58. 58. To summarize ...• OLTP Systems are used to “run” a business • The Data Warehouse helps to “optimize” the business
  59. 59. Why Separate Data Warehouse?• Performance – Op dbs designed & tuned for known txs & workloads. – Complex OLAP queries would degrade perf. for op txs. – Special data organization, access & implementation methods needed for multidimensional views & queries.• Function – Missing data: Decision support requires historical data, which op dbs do not typically maintain. – Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. – Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
  60. 60. INFORMATION SYSTEMS• Designed to support decision-making based on 1. Historical data 2. Prediction data.• Designed for complex queries or data-mining applications. Examples: 1. Sales trend analysis, 2. Customer segmentation 3. Human resources planning
  61. 61. INFORMATION SYSTEMS
  62. 62. DIFFERENCECharacteristics Operational Systems Informational SystemsPurpose Real time data entry Real and analyze historical data.Primary users Clerks, sales-persons, Managers, business administrations analysts, customersScope of usage Narrow, planned, and Broad, ad hoc, complex simple updates and queries and analysis queriesDesign goal Performance throughput, Ease of flexible access availability and useVolume Many, constant updates Periodical batch updates and queries on one or a and queries requiring few table rows many or all rows
  63. 63. THANKS!

×