Upcoming SlideShare
×

# Basics+of+Datawarehousing

671 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
671
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
33
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Basics+of+Datawarehousing

2. 2. Steps :• Requirement Gathering• Physical Environment Setup• Data Modeling• ETL• OLAP Cube Design• Front End Development• Performance Tuning• Quality Assurance• Rolling out to Production• Production Maintenance• Incremental EnhancementsComponents of Dimensional Data Model :Dimension: A category of information. For example, the time dimension.Attribute: A unique level within a dimension. For example, Month is an attribute in the TimeDimension.Hierarchy: The specification of levels that represents relationship between different attributeswithin a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter→Month Day.→ →Fact Table : A fact table is a table that contains the measures of interest. For example, salesamount would be such a measure.A dimensional model includes fact tables and lookup tables. Fact tables connect toone or more lookup tables, but fact tables do not have direct relationships to one another.In designing data models for data warehouses / data marts, the most commonly usedschema types are Star Schema and Snowflake Schema.Star Schema: In the star schema design, a single object (the facttable) sits in the middle and is radially connected to othersurrounding objects (dimension lookup tables) like a star. A starschema can be simple or complex. A simple star consists of one facttable; a complex star can have more than one fact table. Fact tables instar schema are mostly in third normal form (3NF), but dimensionaltables are in de-normalized second normal form (2NF).Snowflake Schema: The snowflake schema (sometimes called snowflake joinschema) is a more complex schema than the star schema because the tableswhich describe the dimensions are normalized.The main advantage of the snowflake schema is the improvement in query performancedue to minimized disk storage requirements and joining smaller lookup tables. The maindisadvantage of the snowflake schema is the additional maintenance efforts needed due tothe increase number of lookup tables
3. 3. Dimensions :what are the types of dimension tablesThere are three types of DimensionsConfirmed Dimensions, Junk Dimensions, Degenerative DimensionsConformed Dimension: A dimension that has exactly the same meaning andcontent when being referred from different fact tables. Comfirmed is some thingwhich can be shared by shared by multiple Fact Tables or multiple Data Marts. Someof the examples are time dimension, customer dimensions, product dimension.Junk Dimensions :Occasionally, there are miscellaneous attributes, such as yes/no attributes orcomment attributes, that don’t fit into tight star schemas. Rather than discarding flagfields and yes/no attributes, place them in a junk dimension. In addition, you canhandle comment and open-ended text attributes by creating a text-based junkdimension.A junk dimension is a convenient grouping of flags and indicators. Its helpful, butnot absolutely required, if theres a positive correlation among the values.what is degenerated dimension?I have a fact table that stores insurance contracts and one important dimension isthe year signed. So the fact table does have many columns, like CUSTOMER_ID,CONTRACT_ID, etc and one column YEAR_SIGNED as varchar(4). TheCUSTOMER_ID is the foreign key column to the DIM_CUSTOMER with all thecustomer date, name address, .... CONTRACT_ID relates to the DIM_CONTRACT withall the contract specific information. Any YEAR_SIGNED? Should I really have aDIM_YEAR_SIGNED and it will have one column only. What other attributes should ayear have?Therefore, we do not create an explicit dimension table and call that YEAR_SIGNEDcolumn a degenerated dimension.Degenerate dimension is a dimension key generated in the fact table that doesnt connected toany dimension table i.e,it corresponds to a dimension table that has no attributes.
5. 5. • Cumulative: This type of fact table describes what has happened over a period of time.For example, this fact table may describe the total sales by product by store by day. Thefacts for this type of fact tables are mostly additive facts. The first example presentedhere is a cumulative fact table.• Snapshot: This type of fact table describes the state of things in a particular instance oftime, and usually includes more semi-additive and non-additive facts. The secondexample presented here is a snapshot fact table.• ==================================================================factless facts and in which scenario will you use such kinds of fact tablesFactless Fact : very useful fact tables dont have any facts at allFIGURE 1-- A factless fact table for recording student attendance on a daily basis at a college.The five dimension tables contain rich descriptions of dates, students, courses,teachers, and facilities. There are no additive, numeric facts.Which classes were the most heavily attended? Which classes were themost consistently attended? Which teachers taught the most students?Tools : Scalability: How can the system grow as your data storage needs grow? Parallel Processing Support:
6. 6. Popular Relational Databases• Oracle ,Microsoft SQL Server ,IBM DB2,Teradata ,Sybase ,MySQLPopular OS Platforms• Linux• FreeBSD• MicrosoftETL Tools :• IBM WebSphere Information Integration (Ascential DataStage)• Ab Initio• InformaticaOLAP Tool Functionalities1. MOLAP: In this type of OLAP, a cube is aggregated from the relational datasource (data warehouse). When user generates a report request, the MOLAP tool cangenerate the create quickly because all data is already pre-aggregated within thecube.2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube,the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tooltypically comes with a Designer piece, where the data warehouse administrator canspecify the relationship between the relational tables, as well as how dimensions,attributes, and hierarchies map to the underlying database tables.Popular Tools• Business Objects• Cognos• Hyperion• Microsoft Analysis Services• MicroStrategyReporting Tool• Business Objects (Crystal Reports)• Cognos• Actuate==================================================================Questions ?What is Molap and Rolap? What is Diff between Them?
7. 7. multidimensional online analytical processing andrelational online analytical processing. In MOLAP data isstored in form of multidimensional cubes. The advantages ofthis mode is that it provides excellent query performanceand the cubes are built for fast data retrieval. Allcalculations are pre-generated when the cube is created andcan be easily applied while querying data.In ROLAP, the data is stored in relational databases this model givesthe appearance of traditional OLAP’s slicing and dicing functionality.The advantages of this model is it can handle a large amount of dataand can leverage all the functionalities of the relational database.MOLAP has aggregated value stored in cube.Since the data isaggregated, query performance is fast.ROLAP has data sored in relational databases.Here query hasto access the database for retrieving the data every time.Soperformance is slow when compared to molap. Size is largerthan molap.===============================================================What is BCP?Bulk Copy PogramTwo plugins are automatically installed with Data stage.1. BCPLoad plugin-used to bulk load data in single table inMS SQL server.2. OraBulk PluginWhat is Data Mining?Data mining is the process of finding correlations or patterns among dozens of fieldsin large relational databases.Generally, data mining (sometimes called data or knowledge discovery) is theprocess of analyzing data from different perspectives and summarizing it into usefulinformation - information that can be used to increase revenue, cuts costs, or both.These analysts look for patterns hidden in data.how can one connect two fact tables ? is it possible ? how?Fact Tables are connected by confirmed dimensions, Facttables cannot be connected directly, so means of dimensionwe can connect.Example : We_site_id.When should you use a STAR and when a SNOW-FLAKE schema?STAR SCHEMA:-1. If PERFORMANCE is the priority than go forstar schema,since here dimension tables are DE-NORMALIZED.
8. 8. 2. Usually star schema is the best option for end users due toits simple design and navigation.The snowflake schema (sometimes called snowflake joinschema) is a more complex schema than the star schemabecause the tables which describe the dimensions arenormalized.Snowflake schema is nothing but one dimension table will beconnected to another dimension table and so on.1. If a dimension is very sparse (i.e. most of thepossible values for the dimension have no data) and/or a2. dimension has a very long list of attributes which may beused in a query, the dimension table may occupy asignificant proportion of the database and snow flaking maybe appropriate.SNOW-FLAKE SCHEMA:-if MEMORY SPACE is the priority than gofor snoflake schema,since here dimension tables areNORMALIZEDWhat is the difference between OLAP, ROLAP, MOLAP and HOLAP?MOLAP------MOLAP(Multidimensional OLAP), provides the analysis of datastored in a multi-dimensional data cube.ROLAP------ROLAP stands for Relational Online Analytical Process thatprovides multidimensional analysis of data, stored in a Relationaldatabase(RDBMS).HOLAP------HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP canprovide multidimensional analysis simultaneously of data stored in amultidimensional database and in a relational database(RDBMS).DOLAP-----DOLAP(Desktop OLAP or Database OLAP)provide multidimensional analysislocally in the client machine on the data collected from relational ormultidimensional database servers.what is the difference between aggregate table and fact table ? how do youload these two tablesFact tables contains million of records and retriving the records from fact table takestime.where as aggregate table contains limited data from all the required tables,andwe retrive the data it takes less time.
9. 9. Which kind of index is preferred in DWH?Bitmap index is the best one.why because B-tree is suited for unique values(eg: empid) andBitmap is best for repeated values(eg: gender m/f)What are CUBES?The cubes divide the data into subsets that are defined by dimensions.Cube Dimensions MeasuresmscsCampaign AdvertiserDateHourEventsPage GroupSiteUserTypeCount EventsDistinct UsersOrdImpLeafmscsCampaignEvents AdvertiserDateHourEventsPage GroupSiteUserTypeCount EventsDistinct Users===============================================================What are materialized views ? how they can be used in datawarehouse to increase theperformance?MVs are segments similar to tables, in which the output of queries is stored in thedatabase.The following is a common query at Acme Bank:SELECT acc_type, SUM(cleared_bal) totbalFROM accountsGROUP BY acc_type;And the following is an MV, mv_bal, for this query:CREATE OR REPLACE MATERIALIZED VIEW mv_balREFRESH ON DEMAND ASSELECT acc_type, SUM(cleared_bal) totbalFROM accountsGROUP BY acc_type;
10. 10. Now suppose a user wants to get the total of all account balances for the account type Cand issues the following query:SELECT SUM(cleared_bal)FROM accountsWHERE acc_type = C;Because the mv_bal MV already contains the totals by account type, the user could havegotten this information directly from the MV, by issuing the following:SELECT totbalFROM mv_balWHERE acc_type = C;This query against the mv_bal MV would have returned results much more quickly thanthe query against the accounts table. Running a query against the MV will be fasterthan running the original query, because querying the MV does not query the sourcetables.To keep the data in sync, the MV is refreshed from time to time, either manually orautomatically. There are two ways to refresh data in MVs. In one of them, the MV iscompletely wiped clean and then repopulated with data from the sourcetables—a process known as complete refresh. In some cases, however, when thesource tables may have changed very little, it is possible to refresh the MV only forchanged records on the source tables—a process known as fast refresh. Touse fast refresh, however, you must have created the MV as fast-refreshable.Because it updates only changed records, fast refresh is faster than completerefresh. (See the Oracle Database Data Warehousing Guide for more information onrefreshing MVs.)A materialized view can be either read-only, updatable, or writeable. Users cannotperform data manipulation language (DML) statements on read-only materialized views,but they can perform DML on updatable and writeable materialized views.===============================================================What is SQL*Loader and what is it used for?SQL*Loader is a bulk loader utility used for moving data from external files into theOracle database.
12. 12. what is the difference between E-R modelling and Dimendional modelling? and whatare semi additive facts?ER modeling:- focused how data will be efficient for processing (insert, update, delete)- Minimalize (limit to zero) data redundanciesDimensional:- focused how data will be efficient for retrieving(example, by report and analysis tools).- many data redundancies- Consist of Fact and Dimension tableWhat is the difference between aggregate table and materliazed view?Aggregate tables are pre-computed totals in the form of hierarchical mutidimensionalstructurematerliazed view ,is an database object which caches the query result in a concretetable and updates it from the original database table from time to timeAggregate tables are used to speed up the query computing whereas materializedview speed up the data retrieval .How many clustered indexes can u create for a table in DWH?You can have only one clustered index per table.==========================================================
13. 13. ViewsA view takes the output of a query and makes it appear like a virtual table.All operations performed on a view will affect data in the base table and so aresubject to the integrity constraints and triggers of the base table.A View can also be used to improve security by restricting access to a predeterminedset of rows or columns.one View can be based on another, a view can also JOIN a view with a table (GROUPBY or UNION).Read-Only vs Updatable Views The data dictionary viewsALL_UPDATABLE_COLUMNS, DBA_UPDATABLE_COLUMNS, andUSER_UPDATABLE_COLUMNS indicate which view columns are updatable.An updatable view lets you insert, update, and delete rows in the view and propagatethe changes to the target master table.In order to be updatable, a view cannot contain any of the following constructs:SET or DISTINCT operators, an aggregate or analytic function, a GROUP BY, ORDERBY, CONNECT BY, or START WITH clause, a subquery (or collection expression) in aSELECT list or finally (with some exceptions) a JOIN .Views that are not updatable can be modified using an INSTEAD OF trigger.Materialized ViewsMaterialized views are schema objects that can be used to summarize, precompute ,replicate, and distribute dataThe existence of a materialized view is transparent to SQL, but when used for queryrewrites will improve the performance of SQL executionMV are use more for performance improvement.MV helps query rewrite..In shout if u have a MV defined as "select * from sales group byregion_id" and u have a query selct * from sales group by region_id fired on the oracle db. Oraclewill automatically re-write a query and refer it to MV instade of Sales table. Now in DWenvironment this is a big performance improvement. There are some paramters which needs tobe set for this to happen.
14. 14. MV can undergo fast referesh. In short if i have 10 Mill rows in the Fact table and i add 500 rows.Then b making use of MVLOGS oracle will do a fast refresh on the MView. with extra 500 rowsonly.A materialized view provides indirect access to table data by storing the results of aquery in a separate schema object. Unlike an ordinary view, which does not take upany storage space or contain any data.An updatable materialized view lets you insert, update, and delete.You can define a materialized view on a base table, partitioned table or view and youcan define indexes on a materialized view.A materialized view can be stored in the same database as its base table(s) or in adifferent database.A materialized view log is a schema object that records changes to a mastertables data so that a materialized view defined on the master table can be refreshedincrementally.===================================================SynonymsA synonym is an alias for any table, view, materialized view, sequence,procedure, function, or package.A public synonym is owned by the user group PUBLIC and every user in adatabase can access it.A private synonym is in the schema of a specific user who has control over itsavailability to others.Synonyms are used to:- Mask the real name and owner of a schema object- Provide global (public) access to a schema object- Provide location transparency for tables, views, or program units of a remotedatabase.- Simplify SQL statements for database userse.g. to query the table PATIENT_REFERRALS with SQL:
15. 15. SELECT * FROM MySchema.PATIENT_REFERRALS;CREATE PUBLIC SYNONYM referrals FORMySchema.PATIENT_REFERRALS;After the public synonym is created, you can query with a simple SQL statement:SELECT * FROM referrals;