Data modificationsThe end users of a data warehouse do not directly update the data warehouse.
Data warehouse introduction
Data Warehousing•Aims of information technology:• To help workers in their everyday business activityand improve their productivity – clerical dataprocessing tasks• To help knowledge Employee (executives,managers, analysts) make faster and better decisions– decision support systems•Two types of applications:• Operational applications• Analytical applications
•In most organizations, data about specific parts ofbusiness is there - lots and lots of data, somewhere, insome form.•Data is available but not information -- and not theright information at the right time.•There is a need to• bring together information .• off-load decision support applications from the on-linetransaction systemData Warehousing (Contd..)
Data Warehouse•“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support ofmanagement’s decision-making process.” --- W. H. Inmon•Collection of data that is used primarily in organizationaldecision making•A decision support database that is maintained separatelyfrom the organization’s operational database
Data Warehouse - SubjectOriented•Data that gives information about a particular subject.•Data for Model& Analysis.•Provide a simple and concise view around particularsubject issues by excluding data that are not useful in thedecision support process.
Data Warehouse – Integrated•It Constructed by integrating multiple, heterogeneousdata sources.•Data cleaning and data integration techniques areapplied.•When data is moved to the warehouse, it is converted-•
Data Warehouse - Time Variant•Data is stable in a data warehouse.•Its adds historical as well as current data.•Every key structure in the data warehouse -Contains an element of time, explicitly or implicitly•But the key of operational data may or may notcontain “time element”.
Data Warehouse - Non-Volatile• A physically separate store of data transformed fromthe operational environment.• No update & delete on historical data .•Operational update of data does not occur in the datawarehouse•Appended• Initial loading of data and access of data.
Data modifications & schemadesign• A data warehouse is updated on a regularbasis by the ETL process (run nightly or weekly)using bulk data modification techniques.• Data warehouses often use denormalized orpartially denormalized schemas (such as a starschema) to optimize query performance.
Why Separate Data Warehouse?•Separate & historical data are needed for decision support.•Complex decision .•Missing Data.•Data consolidation.•Data quality.
Advantages of Data Warehousing•High query performance•Queries not visible outside warehouse•Local processing at sources unaffected•Can operate when sources unavailable•Can query data not stored in a DBMS•Extra information at warehouse• Modify, summarize (store aggregates)• Add historical information
Decision Support System• Information technology to help knowledge employees(executives, managers, analysts) make faster andbetter decisions• OLAP is an element of decision support system• Data mining is a powerful, high-performance dataanalysis tool for decision support.
Three-Tier Decision SupportSystems•Warehouse database server• Almost always a relational DBMS, rarely flat files•OLAP servers• Relational OLAP (ROLAP): extended relational DBMS thatmaps operations on multidimensional data to standardrelational operators• Multidimensional OLAP (MOLAP): special-purpose serverthat directly implements multidimensional data and operations•Clients• Query and reporting tools• Analysis tools• Data mining tools
The Complete Decision SupportSystemInformation Sources Data WarehouseServer(Tier 1)OLAP Servers(Tier 2)Clients(Tier 3)OperationalDB’sSemistructuredSourcesextracttransformloadrefreshetc.Data MartsDataWarehousee.g., MOLAPe.g., ROLAPserveOLAPQuery/ReportingData Miningserveserve
Data Sources•Data sources are often the operational systems,providing the lowest level of data.•Data sources are designed for operational use, not fordecision support, and the data reflect this fact.•Multiple data sources are often from different systems,run on a wide range of hardware and much of thesoftware is built in-house or highly customized.•Multiple data sources introduce a large number ofissues -- semantic conflicts.
Creating and Maintaining aWarehouse•Data warehouse needs several tools that automate or support taskssuch as:• Data extraction from different external data sources,operational databases, files of standard applications• Data cleaning (finding and resolving inconsistency in thesource data)• inconsistent field lengths, inconsistent descriptions, inconsistent valueassignments, missing entries and violation of integrity constraints.• optional fields in data entry are significant sources of inconsistent data.
• Integration and transformation of data (between different dataformats, languages, etc.)• Data loading (loading the data into the data warehouse)• checking integrity constraints, sorting, summarizing, etc.• Data replication (replicating source database into the datawarehouse)• used to incrementally refresh a warehouse when sources change• Data refreshment• propagating updates on source data to the data stored in the warehouse• Periodically or immediately• Data archivingCreating and Maintaining aWarehouse
The Data Warehousing Models•Enterprise Warehouse• collects all the information about subjects spanning entireorganization•Data Mart• a subset of corporate-wide data that is of value to a specificgroup of users• its scope is confined to specific, selected groups, such asmarketing data mart• Independent Vs. Dependent (directly from warehouse) datamart•Virtual warehouse• a set of views over operational databases• only some summary views are materialized
Physical Structure of DataWarehouse•There are three basic architectures for constructing adata warehouse:• Centralized• Distributed• Federated• Tiered•The data warehouse is distributed for: loadbalancing, scalability and higher availability
The logical datawarehouse isonly virtual•The central datawarehouse is physical•There exist local datamarts on different tierswhich store copies orsummarization of theprevious tier.Physical Structure of DataWarehouse(Contd..)
Data Processing Models•There are two basic data processing models:• OLTP (On-Line Transaction Processing)• Describes processing at operational sites• aim is reliable and efficient processing of a large numberof transactions and ensuring data consistency.• OLAP (On-Line Analytical Processing)• Describes processing at warehouse• aim is efficient multidimensional processing of large datavolumes.
OLTP vs. OLAP• OLTP OLAP•users Clerk, IT professional Knowledge worker•Function day to day operations decision support•DB design application-oriented subject-oriented•data current, up-to-date historical, summarized• detailed, flat relational multidimensional• isolated integrated,consolidated•usage repetitive ad-hoc•access read/write, lots of scans• index/hash on prim. key•unit of work short, simple transaction complex query•# records accessed tens millions•#users thousands hundreds•DB size 100MB-GB 100GB-TB•metric transaction throughput query throughput, response
OLAP•Main goal: support ad-hoc but complex queryingperformed by business analysts•Interactive process of creating, managing, analyzingand reporting on data•Extends spreadsheet-like analysis to work with hugeamounts of data in a data warehouse•Data exploration and aggregation in various ways•Typical applications include accessing the effectivenessof a marketing campaign, product sales forecasting, spottrends
•Allows a sophisticated user to analyse data using complex,multi-dimensional views•Place key performance indicators (measures) into context(dimensions)• Measures are pre-aggregated• Data retrieval is significantly faster•The proposed cube is made available to business analystswho can browse the data using a variety of tools, making adhoc interatctive and analytical processingOLAP (Contd..)
OLAP Server Architectures•Relational OLAP (ROLAP):• Use relational or extended-relational DBMS to store and managewarehouse data and OLAP middleware to support missing pieces• Include optimization of DBMS backend, implementation ofaggregation navigation logic, and additional tools and services• Greater scalability• schema design: Star, Snowflake, Fact Constellation•Multidimensional OLAP (MOLAP):• Array based multidimensional storage engine (sparse matrixtechniques)• Fast indexing to pre-computed summarized data• Schema design: Cube•Hybrid OLAP (HOLAP):• User flexibility - low level: relational, high level:array
ROLAP•Special schema design: snow flake•Special indexes: bitmap, multi-table join•Proven technology (relational models, DBMS)• Tend to outperform specialized MDDB especiallyon large data sets•Products• IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
Measures and Dimensions•Measures: key performance indicators that you want toevaluate• Typically numerical, including volume, sales and cost• A rule of thumb: if a number makes business sensewhen aggregated, then it is a measure• Examples• Aggregate daily volume to month, quarter and year• Aggregating telephone numbers would not make sense-not measures• Affects what should be stored in the data warehouse
Measures and Dimensions(Contd..)•Dimensions: categories of data analysis• Typical dimensions include product, time,region• A rule of thumb: when a report is requested“by” something, that something is usually adimension• Example• Sales report: view sales by month, by region• Two dimensions needed are time and region
Conceptual Modeling•Star schema.•Snowflake schema.•Fact constellations, or Galaxy schema .
time itemtime_key day item_keysupplierSales FactTableday_of_the_weekitem_namesupplier_keymonthbrandtime_keysupplier_typequartertypeyearitem_keysupplier_keybranch_keybranchlocation_keybranch_keylocationunits_soldbranch_namelocation_keydollars_soldbranch_typecitystreetavg_salescity_keycity_key cityMeasuresprovince_or_streetcountry13•Represent dimensional hierarchy directly by normalizing tables.•Easy to maintain and saves storageExample of SnowflakeSchema
time Shipping FactTableitemtime_key day time_keyitem_keyday_of_the_weekSales FactTableitem_nameitem_keymonthbrandquartertime_keyshipper_keytypeyearsupplier_typeitem_keyfrom_locationbranch_keyto_locationlocationbranchlocation_keydollars_costlocation_keybranch_keyunits_soldunits_shippedstreetbranch_name dollars_soldcitybranch_type province_or_streetavg_salescountryshipperMeasuresshipper_keyshipper_namelocation_keyshipper_type 14Multiple fact tables that share many dimension tablesExample of FactConstellation
Aggregates• Add up amounts for day 1In SQL: SELECT sum(amt) FROM SALEWHERE date = 181
Aggregates (Contd..)• Add up amounts by dayIn SQL: SELECT date, sum(amt) FROMSALEGROUP BY date
• Add up amounts by day, productIn SQL: SELECT date, sum(amt) FROMSALEGROUP BY date, prodIddrill-downrollupAggregates (Contd..)
Points to be noticedabout ROLAP•Defines complex, multi-dimensional data with simple model•Reduces the number of joins a query has to process•Allows the data warehouse to evolve with relatively lowmaintenance•Can contain both detailed and summarized data.•ROLAP is based on familiar, proven, and already selectedtechnologies.•BUT!!!•SQL for multi-dimensional manipulation of calculations.