Your SlideShare is downloading. ×
Basics+of+Datawarehousing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Basics+of+Datawarehousing

439
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
439
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Datawarehouse :Bill Inmon in 1990, which he defined in the following way :"A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection ofdata in support of managements decision making process". He defined the terms in thesentence as follows:Bill Inmons paradigm: Data warehouse is one part of the overall business intelligencesystem. An enterprise has one data warehouse, and data marts source their informationfrom the data warehouse. In the data warehouse, information is stored in 3rd normalform.Subject Oriented:Data that gives information about a particular subject instead of about a companysongoing operations.Integrated:Data that is gathered into the data warehouse from a variety of sources and merged into acoherent whole.Time-variant:All data in the data warehouse is identified with a particular time period.Non-volatile :Data is stable in a data warehouse. More data is added but data is never removed.However, a single-subject data warehouse is typically referred to as a data mart,while data warehouses are generally enterprise in scope.Also, data warehouses can be volatile. Due to the large amount of storage required for adata warehouse, (multi-terabyte data warehouses are not uncommon), only a certainnumber of periods of history are kept in the warehouse. For instance, if three years ofdata are decided on and loaded into the warehouse, every month the oldest month will be"rolled off" the database, and the newest month added.===============================================================Ralph Kimball provided a much simpler definition of a data warehouse.a data warehouse is "a copy of transaction data specifically structured for query andanalysis".Ralph Kimballs paradigm: Data warehouse is the conglomerate of all data marts withinthe enterprise. Information is always stored in the dimensional model.===============================================================
  • 2. Steps :• Requirement Gathering• Physical Environment Setup• Data Modeling• ETL• OLAP Cube Design• Front End Development• Performance Tuning• Quality Assurance• Rolling out to Production• Production Maintenance• Incremental EnhancementsComponents of Dimensional Data Model :Dimension: A category of information. For example, the time dimension.Attribute: A unique level within a dimension. For example, Month is an attribute in the TimeDimension.Hierarchy: The specification of levels that represents relationship between different attributeswithin a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter→Month Day.→ →Fact Table : A fact table is a table that contains the measures of interest. For example, salesamount would be such a measure.A dimensional model includes fact tables and lookup tables. Fact tables connect toone or more lookup tables, but fact tables do not have direct relationships to one another.In designing data models for data warehouses / data marts, the most commonly usedschema types are Star Schema and Snowflake Schema.Star Schema: In the star schema design, a single object (the facttable) sits in the middle and is radially connected to othersurrounding objects (dimension lookup tables) like a star. A starschema can be simple or complex. A simple star consists of one facttable; a complex star can have more than one fact table. Fact tables instar schema are mostly in third normal form (3NF), but dimensionaltables are in de-normalized second normal form (2NF).Snowflake Schema: The snowflake schema (sometimes called snowflake joinschema) is a more complex schema than the star schema because the tableswhich describe the dimensions are normalized.The main advantage of the snowflake schema is the improvement in query performancedue to minimized disk storage requirements and joining smaller lookup tables. The maindisadvantage of the snowflake schema is the additional maintenance efforts needed due tothe increase number of lookup tables
  • 3. Dimensions :what are the types of dimension tablesThere are three types of DimensionsConfirmed Dimensions, Junk Dimensions, Degenerative DimensionsConformed Dimension: A dimension that has exactly the same meaning andcontent when being referred from different fact tables. Comfirmed is some thingwhich can be shared by shared by multiple Fact Tables or multiple Data Marts. Someof the examples are time dimension, customer dimensions, product dimension.Junk Dimensions :Occasionally, there are miscellaneous attributes, such as yes/no attributes orcomment attributes, that don’t fit into tight star schemas. Rather than discarding flagfields and yes/no attributes, place them in a junk dimension. In addition, you canhandle comment and open-ended text attributes by creating a text-based junkdimension.A junk dimension is a convenient grouping of flags and indicators. Its helpful, butnot absolutely required, if theres a positive correlation among the values.what is degenerated dimension?I have a fact table that stores insurance contracts and one important dimension isthe year signed. So the fact table does have many columns, like CUSTOMER_ID,CONTRACT_ID, etc and one column YEAR_SIGNED as varchar(4). TheCUSTOMER_ID is the foreign key column to the DIM_CUSTOMER with all thecustomer date, name address, .... CONTRACT_ID relates to the DIM_CONTRACT withall the contract specific information. Any YEAR_SIGNED? Should I really have aDIM_YEAR_SIGNED and it will have one column only. What other attributes should ayear have?Therefore, we do not create an explicit dimension table and call that YEAR_SIGNEDcolumn a degenerated dimension.Degenerate dimension is a dimension key generated in the fact table that doesnt connected toany dimension table i.e,it corresponds to a dimension table that has no attributes.
  • 4. Types of FactsThere are three types of facts:• Additive: Additive facts are facts that can be summed up through all of the dimensionsin the fact table.• Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.• Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following columns:DateStoreProductSales_AmountThe purpose of this table is to record the sales amount for each product in each store on a dailybasis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because youcan sum up this fact along any of the three dimensions present in the fact table -- date, store, andproduct. For example, the sum of Sales_Amount for all 7 days in a week represent the totalsales amount for that week.Say we are a bank with the following fact table:DateAccountCurrent_BalanceProfit_MarginThe purpose of this table is to record the current balance for each account at the end of each day,as well as the profit margin for each account for each day. Current_Balance andProfit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense toadd them up for all accounts (whats the total current balance for all accounts in the bank?), but itdoes not make sense to add them up through time (adding up all current balances for a givenaccount for each day of the month does not give us any useful information). Profit_Margin is anon-additive fact, for it does not make sense to add them up for the account level or the day level.Types of Fact TablesBased on the above classifications, there are two types of fact tables:
  • 5. • Cumulative: This type of fact table describes what has happened over a period of time.For example, this fact table may describe the total sales by product by store by day. Thefacts for this type of fact tables are mostly additive facts. The first example presentedhere is a cumulative fact table.• Snapshot: This type of fact table describes the state of things in a particular instance oftime, and usually includes more semi-additive and non-additive facts. The secondexample presented here is a snapshot fact table.• ==================================================================factless facts and in which scenario will you use such kinds of fact tablesFactless Fact : very useful fact tables dont have any facts at allFIGURE 1-- A factless fact table for recording student attendance on a daily basis at a college.The five dimension tables contain rich descriptions of dates, students, courses,teachers, and facilities. There are no additive, numeric facts.Which classes were the most heavily attended? Which classes were themost consistently attended? Which teachers taught the most students?Tools : Scalability: How can the system grow as your data storage needs grow? Parallel Processing Support:
  • 6. Popular Relational Databases• Oracle ,Microsoft SQL Server ,IBM DB2,Teradata ,Sybase ,MySQLPopular OS Platforms• Linux• FreeBSD• MicrosoftETL Tools :• IBM WebSphere Information Integration (Ascential DataStage)• Ab Initio• InformaticaOLAP Tool Functionalities1. MOLAP: In this type of OLAP, a cube is aggregated from the relational datasource (data warehouse). When user generates a report request, the MOLAP tool cangenerate the create quickly because all data is already pre-aggregated within thecube.2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube,the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tooltypically comes with a Designer piece, where the data warehouse administrator canspecify the relationship between the relational tables, as well as how dimensions,attributes, and hierarchies map to the underlying database tables.Popular Tools• Business Objects• Cognos• Hyperion• Microsoft Analysis Services• MicroStrategyReporting Tool• Business Objects (Crystal Reports)• Cognos• Actuate==================================================================Questions ?What is Molap and Rolap? What is Diff between Them?
  • 7. multidimensional online analytical processing andrelational online analytical processing. In MOLAP data isstored in form of multidimensional cubes. The advantages ofthis mode is that it provides excellent query performanceand the cubes are built for fast data retrieval. Allcalculations are pre-generated when the cube is created andcan be easily applied while querying data.In ROLAP, the data is stored in relational databases this model givesthe appearance of traditional OLAP’s slicing and dicing functionality.The advantages of this model is it can handle a large amount of dataand can leverage all the functionalities of the relational database.MOLAP has aggregated value stored in cube.Since the data isaggregated, query performance is fast.ROLAP has data sored in relational databases.Here query hasto access the database for retrieving the data every time.Soperformance is slow when compared to molap. Size is largerthan molap.===============================================================What is BCP?Bulk Copy PogramTwo plugins are automatically installed with Data stage.1. BCPLoad plugin-used to bulk load data in single table inMS SQL server.2. OraBulk PluginWhat is Data Mining?Data mining is the process of finding correlations or patterns among dozens of fieldsin large relational databases.Generally, data mining (sometimes called data or knowledge discovery) is theprocess of analyzing data from different perspectives and summarizing it into usefulinformation - information that can be used to increase revenue, cuts costs, or both.These analysts look for patterns hidden in data.how can one connect two fact tables ? is it possible ? how?Fact Tables are connected by confirmed dimensions, Facttables cannot be connected directly, so means of dimensionwe can connect.Example : We_site_id.When should you use a STAR and when a SNOW-FLAKE schema?STAR SCHEMA:-1. If PERFORMANCE is the priority than go forstar schema,since here dimension tables are DE-NORMALIZED.
  • 8. 2. Usually star schema is the best option for end users due toits simple design and navigation.The snowflake schema (sometimes called snowflake joinschema) is a more complex schema than the star schemabecause the tables which describe the dimensions arenormalized.Snowflake schema is nothing but one dimension table will beconnected to another dimension table and so on.1. If a dimension is very sparse (i.e. most of thepossible values for the dimension have no data) and/or a2. dimension has a very long list of attributes which may beused in a query, the dimension table may occupy asignificant proportion of the database and snow flaking maybe appropriate.SNOW-FLAKE SCHEMA:-if MEMORY SPACE is the priority than gofor snoflake schema,since here dimension tables areNORMALIZEDWhat is the difference between OLAP, ROLAP, MOLAP and HOLAP?MOLAP------MOLAP(Multidimensional OLAP), provides the analysis of datastored in a multi-dimensional data cube.ROLAP------ROLAP stands for Relational Online Analytical Process thatprovides multidimensional analysis of data, stored in a Relationaldatabase(RDBMS).HOLAP------HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP canprovide multidimensional analysis simultaneously of data stored in amultidimensional database and in a relational database(RDBMS).DOLAP-----DOLAP(Desktop OLAP or Database OLAP)provide multidimensional analysislocally in the client machine on the data collected from relational ormultidimensional database servers.what is the difference between aggregate table and fact table ? how do youload these two tablesFact tables contains million of records and retriving the records from fact table takestime.where as aggregate table contains limited data from all the required tables,andwe retrive the data it takes less time.
  • 9. Which kind of index is preferred in DWH?Bitmap index is the best one.why because B-tree is suited for unique values(eg: empid) andBitmap is best for repeated values(eg: gender m/f)What are CUBES?The cubes divide the data into subsets that are defined by dimensions.Cube Dimensions MeasuresmscsCampaign AdvertiserDateHourEventsPage GroupSiteUserTypeCount EventsDistinct UsersOrdImpLeafmscsCampaignEvents AdvertiserDateHourEventsPage GroupSiteUserTypeCount EventsDistinct Users===============================================================What are materialized views ? how they can be used in datawarehouse to increase theperformance?MVs are segments similar to tables, in which the output of queries is stored in thedatabase.The following is a common query at Acme Bank:SELECT acc_type, SUM(cleared_bal) totbalFROM accountsGROUP BY acc_type;And the following is an MV, mv_bal, for this query:CREATE OR REPLACE MATERIALIZED VIEW mv_balREFRESH ON DEMAND ASSELECT acc_type, SUM(cleared_bal) totbalFROM accountsGROUP BY acc_type;
  • 10. Now suppose a user wants to get the total of all account balances for the account type Cand issues the following query:SELECT SUM(cleared_bal)FROM accountsWHERE acc_type = C;Because the mv_bal MV already contains the totals by account type, the user could havegotten this information directly from the MV, by issuing the following:SELECT totbalFROM mv_balWHERE acc_type = C;This query against the mv_bal MV would have returned results much more quickly thanthe query against the accounts table. Running a query against the MV will be fasterthan running the original query, because querying the MV does not query the sourcetables.To keep the data in sync, the MV is refreshed from time to time, either manually orautomatically. There are two ways to refresh data in MVs. In one of them, the MV iscompletely wiped clean and then repopulated with data from the sourcetables—a process known as complete refresh. In some cases, however, when thesource tables may have changed very little, it is possible to refresh the MV only forchanged records on the source tables—a process known as fast refresh. Touse fast refresh, however, you must have created the MV as fast-refreshable.Because it updates only changed records, fast refresh is faster than completerefresh. (See the Oracle Database Data Warehousing Guide for more information onrefreshing MVs.)A materialized view can be either read-only, updatable, or writeable. Users cannotperform data manipulation language (DML) statements on read-only materialized views,but they can perform DML on updatable and writeable materialized views.===============================================================What is SQL*Loader and what is it used for?SQL*Loader is a bulk loader utility used for moving data from external files into theOracle database.
  • 11. Is there a SQL*Unloader to download data to a flat file?Oracle does not supply any data unload utilities. Here are some workarounds:Using SQL*Plus ,You can use SQL*Plus to select and format your data and then spoolit to a file.Skipping unwanted data ?One can skip unwanted header records or continue an interrupted load (for example if you runout of space) by specifying the "SKIP=n" keyword. "n" specifies the number of logical rows toskip.sqlldr userid=ora_id/ora_passwd control=control_file_name.ctl skip=4What is data purging ?Explain about Control M JObs detaily?How to execute this.What is the difference between a W/H and an OLTP application?Difference between DSS & OLTP?What is operational data source (ODS)?What is Snow Flake Schema design in database?What is ETL process in Data warehousing?Advantages of de normalized data?What is the difference between choosing a multidimensional database and a relationaldatabase?Mulitidimentional database: OLAP(OnLineAnnaliticalProcessing)Relational database: OLTP(OnLineTransactionProcessing)
  • 12. what is the difference between E-R modelling and Dimendional modelling? and whatare semi additive facts?ER modeling:- focused how data will be efficient for processing (insert, update, delete)- Minimalize (limit to zero) data redundanciesDimensional:- focused how data will be efficient for retrieving(example, by report and analysis tools).- many data redundancies- Consist of Fact and Dimension tableWhat is the difference between aggregate table and materliazed view?Aggregate tables are pre-computed totals in the form of hierarchical mutidimensionalstructurematerliazed view ,is an database object which caches the query result in a concretetable and updates it from the original database table from time to timeAggregate tables are used to speed up the query computing whereas materializedview speed up the data retrieval .How many clustered indexes can u create for a table in DWH?You can have only one clustered index per table.==========================================================
  • 13. ViewsA view takes the output of a query and makes it appear like a virtual table.All operations performed on a view will affect data in the base table and so aresubject to the integrity constraints and triggers of the base table.A View can also be used to improve security by restricting access to a predeterminedset of rows or columns.one View can be based on another, a view can also JOIN a view with a table (GROUPBY or UNION).Read-Only vs Updatable Views The data dictionary viewsALL_UPDATABLE_COLUMNS, DBA_UPDATABLE_COLUMNS, andUSER_UPDATABLE_COLUMNS indicate which view columns are updatable.An updatable view lets you insert, update, and delete rows in the view and propagatethe changes to the target master table.In order to be updatable, a view cannot contain any of the following constructs:SET or DISTINCT operators, an aggregate or analytic function, a GROUP BY, ORDERBY, CONNECT BY, or START WITH clause, a subquery (or collection expression) in aSELECT list or finally (with some exceptions) a JOIN .Views that are not updatable can be modified using an INSTEAD OF trigger.Materialized ViewsMaterialized views are schema objects that can be used to summarize, precompute ,replicate, and distribute dataThe existence of a materialized view is transparent to SQL, but when used for queryrewrites will improve the performance of SQL executionMV are use more for performance improvement.MV helps query rewrite..In shout if u have a MV defined as "select * from sales group byregion_id" and u have a query selct * from sales group by region_id fired on the oracle db. Oraclewill automatically re-write a query and refer it to MV instade of Sales table. Now in DWenvironment this is a big performance improvement. There are some paramters which needs tobe set for this to happen.
  • 14. MV can undergo fast referesh. In short if i have 10 Mill rows in the Fact table and i add 500 rows.Then b making use of MVLOGS oracle will do a fast refresh on the MView. with extra 500 rowsonly.A materialized view provides indirect access to table data by storing the results of aquery in a separate schema object. Unlike an ordinary view, which does not take upany storage space or contain any data.An updatable materialized view lets you insert, update, and delete.You can define a materialized view on a base table, partitioned table or view and youcan define indexes on a materialized view.A materialized view can be stored in the same database as its base table(s) or in adifferent database.A materialized view log is a schema object that records changes to a mastertables data so that a materialized view defined on the master table can be refreshedincrementally.===================================================SynonymsA synonym is an alias for any table, view, materialized view, sequence,procedure, function, or package.A public synonym is owned by the user group PUBLIC and every user in adatabase can access it.A private synonym is in the schema of a specific user who has control over itsavailability to others.Synonyms are used to:- Mask the real name and owner of a schema object- Provide global (public) access to a schema object- Provide location transparency for tables, views, or program units of a remotedatabase.- Simplify SQL statements for database userse.g. to query the table PATIENT_REFERRALS with SQL:
  • 15. SELECT * FROM MySchema.PATIENT_REFERRALS;CREATE PUBLIC SYNONYM referrals FORMySchema.PATIENT_REFERRALS;After the public synonym is created, you can query with a simple SQL statement:SELECT * FROM referrals;

×