Your SlideShare is downloading. ×
Ch 1 intro_dw
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Ch 1 intro_dw

776
views

Published on

This gives an idea about Data Warehouse

This gives an idea about Data Warehouse

Published in: Technology, Business

1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
776
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
1
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. DATA WAREHOUSINGDATA WAREHOUSING SUSHIL KULKARNI
  • 2. A producer wants to know…. Which are our Which are our lowest/highest margin lowest/highest margin Who are my customers Who are my customers customers ? customers ? and what products and what products What is the most What is the most are they buying? are they buying?effective distribution effective distribution channel? channel? What impact will What impact will Which customers Which customers new products/services new products/services are most likely to go are most likely to go have on revenue have on revenue to the competition ? to the competition ? and margins? and margins? What product prom- What product prom- -otions have the biggest -otions have the biggest impact on revenue? impact on revenue?
  • 3. Lot of data everywhereyet ... • I can’t find the data I need – data is scattered over the network – many versions, subtle differences • I can’t get the data I need – need an expert to get the data • I can’t understand the data I found – available data poorly documented • I can’t use the data I found – results are unexpected – data needs to be transformed from one form to other
  • 4. What is a Data Warehouse?A single, complete andconsistent store of dataobtained from a variety ofdifferent sources madeavailable to end users ina what they canunderstand and use in abusiness context.[Barry Devlin]
  • 5. What users says...• Data should be integrated across the enterprise• Summary data has a real value to the organization• Historical data holds the key to understanding data over time• What-if capabilities are required
  • 6. What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference[Forrester Research, April 1996] Data
  • 7. Evolution• 60’s: Batch reports – hard to find and analyze information – inflexible and expensive, reprogram every new request• 70’s: Terminal-based DSS and EIS (executive information systems) – still inflexible, not integrated with desktop tools• 80’s: Desktop data access and analysis tools – query tools, spreadsheets, GUIs – easier to use, but only access operational databases• 90’s: Data warehousing with integrated OLAP engines and tools
  • 8. Warehouses are Very Large Databases 35% 30% 25%Respondents 20% 15% 10% Initial 5% Projected 2Q96 0% Source: META Group, Inc. 5GB 10-19GB 50-99GB 250-499GB 5-9GB 20-49GB 100-249GB 500GB-1TB
  • 9. Very Large Data Bases• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes• Petabytes -- 10^15 bytes: Geographic Information Systems• Exabytes -- 10^18 bytes: National Medical Records• Zettabytes -- 10^21 bytes: Weather images• Zottabytes -- 10^24 bytes: Intelligence Agency Videos
  • 10. Data Warehousing -- It is a process• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible• A decision support database maintained separately from the organization’s operational database
  • 11. Data Warehouse• A data warehouse is a – subject-oriented – integrated – time-varying – non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996
  • 12. Data Warehouse Subject-oriented Customers: Get information of different prices of a beer Farmers: Harvest information from known access paths
  • 13. Data Warehouse Subject-oriented Students: Get information about various universities in U.K. Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
  • 14. Data Warehouse Subject-oriented• Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
  • 15. Data Warehouse Subject-oriented Enterprise “Database” Customers Orders Transactions Vendors Etc… Data Miners: Etc… • “Farmers” – they know • “Explorers” - unpredictable Copied, organized summarized Data Data Mining Warehouse
  • 16. Data Warehouse :Time - variantUse to study trendsand changes
  • 17. Data Warehouse :Time - variant• The time horizon for the data warehouse is significantly longer than that of operational systems – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) – Operational database: current value data• Every key structure in the data warehouse – Contains an element of time explicitly or implicitly, while the key of operational data may or may not contain “time element”
  • 18. Data Warehouse : Non-volatilecannotupdated byend users
  • 19. Data Warehouse ArchitectureRelationalDatabases Optimized Loader Extraction ERP Systems Cleansing Data Warehouse Engine AnalyzePurchased Query Data Legacy Data Metadata Repository
  • 20. Data Mart• A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse.• A Data Mart typically reflects the business rules of a specific business unit within an enterprise.
  • 21. Data Warehouse to Data Mart Decision Support Data Mart Information Decision Data Support Data Mart Information Warehouse Decision Support Data Mart Information
  • 22. DATA MARTS• Create many DM’s• Limited scopeExamples:1. Financial DM2. Marketing DM3. Supply chain DM
  • 23. Generic Architecture of Data (synonym) Transaction data
  • 24. Transaction (Operational) Data• Operational (production) systems create (massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, etc…• Transactions are the base level of data – the raw material for understanding customer behavior• Unfortunately, operational systems change due to changing business needs• Fortunately, operational systems can usually be changed to support changing business needs• Data warehousing strategies need to be aware of operational system changes
  • 25. Operational Summary DataSummaries are for aspecific time period Other Examples???and utilize thetransaction data forthat time period
  • 26. Decision Support Summary Data• The data that are used to help make decisions about the business – Financial Data, such as: • Income Statements (Profit & Loss) • Balance Sheets (Assets – Liabilities = Net Worth) – Sales summaries – Other examples???• Data warehouses maintain this type of data, however financial data “of record” (for audit purposes) usually comes from databases and not the data warehouse (confusing???)• Generally, it is a bad idea to use the same system for analytic and operational purposes
  • 27. Data Warehouse for Decision Support• Putting Information technology to help the knowledge worker make faster and better decisions – Which of my customers are most likely to go to the competition? – What product promotions have the biggest impact on revenue? – How did the share price of software companies correlate with profits over last 10 years?
  • 28. Decision Support• Used to manage and control business• Data is historical or point-in-time• Optimized for inquiry rather than update• Use of the system is loosely defined and can be ad-hoc• Used by managers and end-users to understand the business and make judgements
  • 29. Database Schema• Database schema defines the structure of data, not the values of the data (e.g., first name, last name = structure; Ron Norman = values of the data)• In RDBMS: – Columns = fields = attributes (A,B,C) – Rows = records = tuples (1-7)
  • 30. Logical Database Schema• Describes data in a way that is familiar to business users
  • 31. Physical Database Schema• Describes the data the way it will be stored in an RDBMS which might be different than the way the logical shows it
  • 32. Metadata• General definition: Data about data !!! – Examples: • A library’s card catalog (metadata) describes publications (data) • A file system maintains permissions (metadata) about files (data)• A form of system documentation including: – Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.) – Description of the contents of each field (e.g., start date) – Date when data were loaded – Indication of currency of the data (last updated) – Mappings between systems (e.g., A.this = B.that)• Invaluable, otherwise have to research to find it
  • 33. Business Rules• Highest level of abstraction from operational (transaction) data• Describes why relationships exist and how they are applied• Examples: – Need to have 3 forms of ID for credit – Only allow a maximum daily withdrawal of $200 – After the 3rd log-in attempt, lock the log-in screen – Accept no bills larger than $20 – Others???
  • 34. General Architecture for Data Warehousing• Source systems• Extraction, (Clean), Transformation, & Load (ETL)• Central repository• Metadata repository• Data marts• Operational feedback• End users (business)
  • 35. DATA WAREHOUSE SCOPEBroad : Required for companies, Very costly, May be divided according to Depts.Narrow: Required for Personal information
  • 36. Design of a Data Warehouse: A Business Analysis Framework• Four views regarding the design of a data warehouse – Top-down view • allows selection of the relevant information necessary for the data warehouse – Data source view • exposes the information being captured, stored, and managed by operational systems – Data warehouse view • consists of fact tables and dimension tables – Business query view • sees the perspectives of data in the warehouse from the view of end-user
  • 37. Data Warehouse Design Process• Top-down, bottom-up approaches or a combination of both – Top-down: Starts with overall design and planning – Bottom-up: Starts with experiments and prototypes (rapid)• From software engineering point of view – Waterfall: structured and systematic analysis at each step before proceeding to the next – Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around• Typical data warehouse design process – Choose a business process to model, e.g., orders, invoices, etc. – Choose the grain (atomic level of data) of the business process – Choose the dimensions that will apply to each fact table record – Choose the measure that will populate each fact table record
  • 38. Multi-Tiered Architecture Monitor & OLAP Server other Metadata sources Integrator Analysis Operational Extract Query DBs Transform Data Serve Reports Load Refresh Warehouse Data mining Data MartsData Sources Data Storage OLAP Engine Front-End Tools
  • 39. Three Data Warehouse Models• Enterprise warehouse – collects all of the information about subjects spanning the entire organization• Data Mart – a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart• Virtual warehouse – A set of views over operational databases – Only some of the possible summary views may be materialized
  • 40. Data Mining works with Warehouse Data • Data Warehousing provides the Enterprise with a memory• Data Mining provides the Enterprise with intelligence
  • 41. We want to know ...• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?• If I raise the price of my product by Rs. 2, what is the effect on my ROI?• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?• Which of my customers are likely to be the most loyal? Data Mining helps extract such information
  • 42. Application AreasIndustry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providers Value added dataUtilities Power usage analysis
  • 43. Data Mining in Use• Data Mining can be used to track fraud• A Supermarket becomes an information broker• Basketball teams use it to track game strategy• Cross Selling• Warranty Claims Routing• Holding on to Good Customers• Weeding out Bad Customers
  • 44. Two Systems• Operational System• Information System
  • 45. Operational Systems• Run the business in real time• Based on up-to-the-second data• Optimized to handle large numbers of simple read/write transactions• Optimized for fast response to predefined transactions• Used by people who deal with customers, products -- clerks, salespeople etc.• They are increasingly used by customers
  • 46. On Line Transaction Process (OLTP)It refers to a class ofsystems that facilitateand managetransaction-orientedapplications, typicallyfor data entry andretrieval transactionprocessing
  • 47. On Line Transaction Process (OLTP)OLTP technology is used in anumber of industries, includingbanking, airlines, mail order,supermarkets, and manufacturing.Applications include electronicbanking, order processing,employee time clock systems, e-commerce, and eTrading. Themost widely used OLTP system isprobably IBMs CICS.
  • 48. What are Operational Systems?• They are OLTP systems• Run mission critical applications• Need to work with stringent performance requirements for routine tasks• Used to run a business!
  • 49. RDBMS used for OLTP• Database Systems have been used traditionally for OLTP – clerical data processing tasks – detailed, up to date data – structured repetitive tasks – read/update a few records – isolation, recovery and integrity are critical
  • 50. Operational Summary DataSummaries are for aspecific time period Other Examples???and utilize thetransaction data forthat time period
  • 51. Examples of Operational DataData Industry Usage Technology VolumesCustomer All Track Legacy application, flat Small-mediumFile Customer files, main frames DetailsAccount Finance Control Legacy applications, LargeBalance account hierarchical databases, activities mainframePoint-of- Retail Generate ERP, Client/Server, Very LargeSale data bills, manage relational databases stockCall Telecomm- Billing Legacy application, Very LargeRecord unications hierarchical database, mainframeProduction Manufact- Control ERP, MediumRecord uring Production relational databases, AS/400
  • 52. So, what’s different?
  • 53. Application-Orientation vs. Subject-OrientationApplication-Orientation Subject-Orientation Operational Data Database Warehouse Credit Loans Customer Card Vendor Trust Product Savings Activity
  • 54. OLTP vs. Data Warehouse• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse• Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) – e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December
  • 55. OLTP vs Data Warehouse• OLTP • Warehouse (DSS) – Application Oriented – Subject Oriented – Used to run business – Used to analyze – Detailed data business – Current up to date – Summarized and refined – Isolated Data – Snapshot data – Repetitive access – Integrated Data – Clerical User – Ad-hoc access – Knowledge User (Manager)
  • 56. OLTP vs Data Warehouse• OLTP • Data Warehouse – Performance Sensitive – Performance relaxed – Few Records accessed at – Large volumes accessed a time (tens) at a time(millions) – Mostly Read (Batch – Read/Update Access Update) – Redundancy present – No data redundancy – Database Size 100 – Database Size 100MB - GB - few terabytes 100 GB
  • 57. OLTP vs Data Warehouse• OLTP • Data Warehouse – Transaction – Query throughput is throughput is the the performance performance metric metric – Thousands of users – Hundreds of users – Managed in entirety – Managed by subsets
  • 58. To summarize ...• OLTP Systems are used to “run” a business • The Data Warehouse helps to “optimize” the business
  • 59. Why Separate Data Warehouse?• Performance – Op dbs designed & tuned for known txs & workloads. – Complex OLAP queries would degrade perf. for op txs. – Special data organization, access & implementation methods needed for multidimensional views & queries.• Function – Missing data: Decision support requires historical data, which op dbs do not typically maintain. – Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. – Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
  • 60. INFORMATION SYSTEMS• Designed to support decision-making based on 1. Historical data 2. Prediction data.• Designed for complex queries or data-mining applications. Examples: 1. Sales trend analysis, 2. Customer segmentation 3. Human resources planning
  • 61. INFORMATION SYSTEMS
  • 62. DIFFERENCECharacteristics Operational Systems Informational SystemsPurpose Real time data entry Real and analyze historical data.Primary users Clerks, sales-persons, Managers, business administrations analysts, customersScope of usage Narrow, planned, and Broad, ad hoc, complex simple updates and queries and analysis queriesDesign goal Performance throughput, Ease of flexible access availability and useVolume Many, constant updates Periodical batch updates and queries on one or a and queries requiring few table rows many or all rows
  • 63. THANKS!