02. Data Warehouse and OLAP


Published on

Course materials from Mr. Yudho Giri Sucahyo (MTI UI). Uploaded by Achmad Solichin (http://hotnewsarchive.info)

Published in: Education, Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

02. Data Warehouse and OLAP

  1. 1. Objectives Motivation: Why data warehouse? What is a data warehouse? Why separate DW? y p Conceptual modeling of DW Data Mart Data Warehousing Architectures Data Warehousing and OLAP Data Warehouse Development Lecture 2/DMBI/IKI83403T/MTI/UI Data Warehouse Vendors Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Real-time DW R l Faculty of Computer Science, University of Indonesia 2 Motivation: Why data warehouse? What is a data warehouse? [JH] Construction of data warehouses (DW) involves data Defined in many different ways, but not rigorously. cleaning and data integration important A decision support database that is maintained separately preprocessing step for data mining (DM). from the organization’s ODB. DW provide OLAP for the interactive analysis of Support information processing by providing a solid platform of consolidated, historical data for analysis. multidimensional data, which facilitates effective DM. , “A data warehouse is a subject-oriented, integrated, Data mining functions can be integrated with OLAP time-variant, and nonvolatile collection of data in operations to enhance interactive mining of knowledge. support of management’s decision-making process.” — DW will provide an effective platform for DM. W. H. Inmon While DW Whil DWs are not requirements to do DM, DW store t i t t d DM t Case Study 2: Continental Airlines flies high with its massive amounts of data that can be uses for DM. [DO] real-time data warehouse 3 4
  2. 2. What is a data warehouse? [ET] Subject Oriented Data warehouse Organized around major subjects, such as A physical repository where relational data are specially customer, product, sales. organized to provide enterprise-wide, cleansed data in a standardized format. Provide i l P id a simple and concise view around d i i d Characteristics particular subject issues by excluding data that Subject oriented, Integrated, Time Variant, Non-volatile are not useful in the decision support process. Web-based, Relational/multidimensional, Client/server, Real-time Focusing on the modeling and analysis of data Include metadata for decision makers, not on daily operations or Data warehousing Process of constructing and using data warehouses. transaction processing processing. Requires data integration, data cleaning, and data consolidation. 5 6 Integrated Time Variant Integrate multiple, heterogeneous data sources The time horizon for the data warehouse is significantly Relational databases, flat-files, on-line transaction records longer than that of operational systems. Data cleaning and data integration techniques are g g q Operational database: current value data. applied Data warehouse data: provide information from a historical Ensure consistency in naming conventions, encoding perspective (e g past 5-10 years) (e.g., structures, attribute measures, etc. among different data Every key structure in the data warehouse sources Contains an element of time, explicitly or implicitly E.g., Hotel price: currency, tax, breakfast covered, etc. But the key of operational data may or may not contain “time When d i Wh data is moved to the warehouse, it is converted. d h h i i d element”. 7 8
  3. 3. Non-volatile Non volatile Data Warehouse vs Heterogeneous DBMS vs. A physically separate store of data transformed from the p y y p Traditional heterogeneous DB integration: Build wrappers/mediators on top of multiple, heterogeneous databases. operational environment. Ex: IBM Data Joiner, Informix DataBlade Operational update of data does not occur i the d O i l d fd d in h data Query d i Q driven approach: h When a query is posed to a client site, a metadata-dictionary is used warehouse environment. to translate the query into queries appropriate for the individual Does not require transaction processing, recovery, and heterogeneous sites involved. There queries are then mapped and sent to local query processors. The results returned from the different concurrency control mechanisms y sites are integrated into a global answer set. d l b l Requires only two operations in data accessing: Complex information filtering and integration processes, compete for iinitiall lloading of data and access of data. ii di fd d fd resources. resources Inefficient and potentially expensive for frequent queries, especially for q queries requireing aggregations. q g gg g 9 10 Data Warehouse vs Heterogeneous DBMS (2) vs. DW vs ODB vs. Using DW update-driven approach Major task of ODB OLTP: Information from multiple, heterogeneous sources is integrated in advance Day-to-day operations: purchasing, inventory, banking, and stored in a warehouse for direct querying and analysis. manufacturing, payroll, registration, accounting, etc. Unlike OLTP, DW do not contain the most current information OLTP information. DW serve f data analysis and decision making for d l i dd i i ki OLAP DW brings high performance to the integrated heterogeneous Distinct Features (OLTP vs. OLAP) DB system since data are copied, preprocessed, integrated, copied preprocessed integrated User and system orientation: customer vs. market U d i i k annotated, summarized, and restructured into one data store. Data contents: current, detailed vs. historical, consolidated Query processing in DW does not interfere with the processing Database design: ER + application vs. star + subject vs at local sources View: current, local vs. evolutionary, integrated DW can store and integrate historical information and support g pp Access patterns: update vs. read-only but complex queries read only complex multidimensional queries. 11 12
  4. 4. OLTP vs OLAP Why Separate DW? OLTP OLAP users Clerk, Clerk IT professional Knowledge worker High performance for both systems: g p y function day to day operations decision support DBMS — tuned for OLTP: access methods, indexing, DB design application-oriented subject-oriented concurrency control, recovery data current, up-to-date historical, Warehouse — tuned for OLAP: complex OLAP queries, detailed, flat relational summarized, multidimensional computation of large groups of data at summarized levels, isolated integrated, consolidated multidimensional view, consolidation. , usage repetitive ad-hoc access read/write lots of scans Processing OLAP queries in operational databases would index/hash on prim. key degrade the performance of operational tasks. unit of work short, simple transaction complex query In ODB, concurrency control and recovery mechanisms # records accessed tens millions (locking, logging) are required to ensure the consistency #users thousands hundreds and robustness of transactions. d b f i DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response OLAP read only access. No need for concurrency control and recovery recovery. 13 14 Why Separate DW? (2) Conceptual Modeling of DW Different functions and different data: Data Cube: missing data: Decision support requires historical data which operational DBs do not typically maintain. So, data in ODB is see TSBD Lecture Notes on Visualization of Data Cubes usually far from complete for decision making. y p g Modeling d t M d li data warehouses: dimensions & measurements h di i t data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources. ODB Star schema: A single object (fact table) in the middle connected contain detailed raw data (transactions) which need to be t i d t il d d t (t ti ) hi h dt b to a number of objects (dimension tables one for each tables, consolidated before analysis. dimension). data quality: different sources typically use inconsistent data q y yp y Snowflake schema: A refinement of star schema where the representations, codes and formats which have to be dimensional hierarchy is represented explicitly by normalizing reconciled. the dimension tables. Fact constellations: Multiple fact tables share dimension tables. Also known as galaxy schema 15 16
  5. 5. Example of Star Schema Snowflake Schema Product Date Year Product Day ProductNo Year Month Sales Fact Table ProdName Date ProductNo Month Month Sales Fact Table ProdDesc ProdName Year Date Year Day Category C ProdDesc Date QOH Month Category Product Store Product QOH Store StoreID Cust Store Store City Customer Cust CustId City StoreID Customer State CustName C tN City Cit CustId unit_sales City Country CustCity unit_sales CustName State State Region dollar_sales CustCountry CustCity State dollar_sales CustCountry Yen_sales Potensi Redundansi Country Country Yen_sales Measurements Bandung, Bogor keduanya Country Region ada di Jawa Barat Measurements 17 18 View of Warehouses and Hierarchies Data Cube Date D t Total annual sales 2Qtr of TV in U.S.A. 1Qtr 3Qtr 4Qtr sum TV PC U.S.A USA VCR Importing data sum Country Ca ada Canada Table Browsing Dimension creation Mexico Dimension browsing sum Cube building g Cube browsing 19 20
  6. 6. Data Cube Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data or data, introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill d ill across: iinvolving ( l i (across) more than one fact table. ) th f t t bl Visualization drill through: through the bottom level to its back-end relational tables. OLAP capabilities p More info: 21 Interactive manipulation www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org 22 Data Mart Data Mart DW collects information about subjects that span the A data mart can be either dependent or independent. entire organization, such as customers, products, sales, assets, A dependent data mart is a subset that is created directly and personnel. Its scope is enterprise-wide. from the DW. For DW, fact constellation schema is commonly used Consistent data model since it can model multiple, interrelated subjects. Providing quality data Data Mart is a subset of a DW, focuses on a particular DW must be constructed first subject. Its scope is department-wide. Typically, a data mart Ensures that the user viewing the same version of the data that consisting of a single subject area (e.g. marketing, f l b ( k are accessed by all other d warehouse users d b ll h data h operations). An independent data mart is a small warehouse designed For Data Mart, star or snowflake schema are commonly for department, and i source is not an EDW. f ad d its i EDW used since both are geared towards modeling single subjects, although th star schema i more popular. bj t lth h the t h is l 23 24
  7. 7. Data Warehousing Process Overview Data Warehousing Process Overview The major components of a data warehousing process Data sources Legacy systems, external data providers (e.g. BPS), OLTP, ERP Systems Data extraction Data loading Comprehensive database Metadata Middleware tools 25 26 Data Warehousing Architectures Data Warehousing Architectures 27 28
  8. 8. Data Warehousing Architectures Data Warehousing Architectures 29 30 Data Integration and the ETL Process Data Integration and the ETL Process Various integration technologies: ETL Enterprise Application Integration (EAI) 60-70% of the time in a data-centric project. A technology that provides a vehicle for pushing data from source Extraction: Reading data from one or more databases systems i t a data warehouse t into d t h Transformation Integrating application functionality and is focused on sharing Converting the extracted data from its previous form into the form in functionality across systems which it needs to be so that it can be placed into a DW Traditionally, API. Nowadays, SOA (web services). Load Enterprise Information Integration (EII) Putting the An evolving tool space that promises real-time data integration from data d into a variety of sources, such as relational databases, Web services, and the DW multidimensional databases A mechanism for pulling data from source systems to satisfy a request for information. 31 32
  9. 9. Data Warehouse Development Data Warehouse Development Direct benefits Some best practices for implementing a DW (Weir, 2002): Allowing end users to perform extensive analysis in numerous Project must fit with corporate strategy and business objectives ways There must be complete buy-in to the project by executives, A consolidated view of corporate data (i.e a single version of f ( f managers, managers and users the truth) It is important to manage user expectations about the completed project Better and more timely information The data warehouse must be built incrementally Enhanced system performance. DW frees production Build in adaptability processing because some operational system reporting Managed b b h IT and b i M d by both d business professionals f i l requirements are moved to DSS Develop a business/supplier relationship Simplification of data access Only load data th t h O l l d d t that have been cleansed and are of a quality b l d d f lit understood by the organization Do not overlook training requirements 33 34 Be politically aware Data Warehouse Vendors Data Warehouse Vendors Computer Associates Microsoft Six guidelines to considered when developing a g p g DataMirror Oracle vendor list: Data Advantage Group g p SAS 1. 1 Financial strength Dell Computer Siemens 2. ERP linkages Embarcadero Technologies Sybase 3. Qualified Q lifi d consultants l Business Objects Teradata 4. Market share HP Please visit: 5. Industry experience Hummingbird Data Warehousing Institute (tdwi.com) (tdwi com) 6. Established partnerships p p Hyperion H DM Review (dmreview.com) IBM Informatica 35 36
  10. 10. Real-time Real time DW Real-time Real time DW Traditionally, updated on a weekly basis. Unsuitable for some businesses. Real-time (active) data warehousing ( ) g The process of loading and providing data via a data warehouse as they become available y Levels of data warehouses: 1. Reports what happened 2. Some analysis occurs 3. Provides prediction capabilities, p p , 4. Operationalization 5. Becomes capable of making events happen p g pp 37 38 Real-time Real time DW From DW to DM [JH] Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization t l i i lt i i li ti tools. 39 40
  11. 11. References [JH] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [ET] Efraim Turban et al., Decision Support and Business Intelligence Systems, Pearson, 2007. [DO] David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill, 2007. 41