1. 1. DataWarehouse
Data Warehouse Architecture
Data Warehouse definition
A data warehouse isa:
1. subject-oriented
2. integrated
3. timevarying
4. Non-volatile collectionof datainsupportof the management'sdecision-makingprocess.
A data warehouse isacentralizedrepositorythatstoresdatafrommultiple informationsourcesand
transformsthemintoa common,multidimensional datamodel forefficientqueryingandanalysis.
2. OLTP vs. OLAP
We can divide ITsystemsintotransactional (OLTP) andanalytical(OLAP).Ingeneral we canassume that
OLTP systemsprovide source datatodata warehouses,whereasOLAPsystemshelptoanalyze it.
2. - OLTP (On-line Transaction Processing) ischaracterizedbya large numberof shorton-line transactions
(INSERT,UPDATE,DELETE). The mainemphasisforOLTP systemsisputon veryfastqueryprocessing,
maintainingdataintegrityinmulti-accessenvironmentsandaneffectivenessmeasuredbynumberof
transactionspersecond.InOLTP database there isdetailedandcurrentdata,and schemausedtostore
transactional databasesisthe entitymodel (usually3NF).
- OLAP (On-line Analytical Processing) ischaracterizedbyrelativelylow volume of transactions.Queries
are oftenverycomplex andinvolve aggregations.ForOLAPsystemsaresponse time isaneffectiveness
measure.OLAPapplicationsare widelyusedbyDataMiningtechniques.InOLAPdatabase there is
aggregated,historical data,storedinmulti-dimensional schemas(usuallystarschema).
The followingtable summarizesthe majordifferencesbetweenOLTPandOLAPsystemdesign.
OLTP System
Online Transaction Processing
(Operational System)
OLAP System
Online Analytical Processing
(Data Warehouse)
Source of data
Operational data; OLTPs are the original
source of the data.
Consolidation data; OLAP data comes from
the various OLTP Databases
Purpose of
data
To control and run fundamental
business tasks
To help with planning, problem solving, and
decision support
What the data
Reveals a snapshot of ongoing business
processes
Multi-dimensional views of various kinds of
business activities
Inserts and Short and fast inserts and updates Periodic long-running batch jobs refresh the
3. Updates initiated by end users data
Queries
Relatively standardized and simple
queries Returningrelatively few records
Oftencomplex queriesinvolving aggregations
Processing
Speed
Typically very fast
Depends on the amount of data involved;
batch data refreshesandcomplexqueriesmay
take many hours; query speed can be
improved by creating indexes
Space
Requirements
Can be relatively small if historical data
is archived
Larger due to the existence of aggregation
structures and history data; requires more
indexes than OLTP
Database
Design
Highly normalized with many tables
Typicallyde-normalizedwithfewertables;use
of star and/or snowflake schemas
Backup and
Recovery
Backup religiously; operational data is
critical to run the business, data loss is
likelytoentail significant monetary loss
and legal liability
Instead of regular backups, some
environments may consider simply reloading
the OLTP data as a recovery method
3. What is BusinessIntelligence?
BusinessIntelligence (BI) - technologyinfrastructure forgainingmaximuminformationfromavailable
data for the purpose of improvingbusinessprocesses.Typical BIinfrastructure componentsare as
follows:softwaresolutionforgathering,cleansing,integrating,analyzingandsharingdata.Business
Intelligenceproducesanalysisandprovidesbelievable informationtohelpmakingeffectiveandhigh
qualitybusinessdecisions.
The most commonkindsof BusinessIntelligence systemsare:
EIS - Executive InformationSystems
DSS - DecisionSupportSystems
MIS - ManagementInformationSystems
GIS - GeographicInformationSystems
OLAP - Online Analytical Processingandmultidimensional analysis
CRM - CustomerRelationshipManagement
BusinessIntelligence systemsbasedonDataWarehouse technology.A DataWarehouse(DW) gathers
informationfromawide range of company'soperational systems,BusinessIntelligence systemsbased
on it.Data loadedto DW isusuallygoodintegratedandcleanedthatallowstoproduce credible
information whichreflectedsocalled'one versionof the true'.
4. BusinessIntelligence tools
4. The most popularBI toolsonthe marketare:
Oracle - Siebel BusinessAnalyticsApplications
SAS- BusinessIntelligence
SAP - BusinessObjectsXI
IBM - Cognos8 BI
Oracle - HyperionSystem9BI+
Microsoft- AnalysisServices
MicroStrategy - DynamicEnterprise Dashboards
Pentaho- OpenBI Suite
InformationBuilders - WebFOCUSBusinessIntelligence
QlikTech- QlikView
TIBCO Spotfire - Enterprise Analytics
Sybase - InfoMaker
KXEN - IOLAP
SPSS– ShowCase
5. ETL tools
List of the most popularETL tools:
Informatica- PowerCenter
IBM - WebSphere DataStage(FormerlyknownasAscential DataStage)
SAP - BusinessObjectsDataIntegrator
IBM - CognosData Manager (FormerlyknownasCognosDecisionStream)
Microsoft- SQL ServerIntegrationServices
Oracle - Data Integrator(FormerlyknownasSunopsisDataConductor)
SAS- Data IntegrationStudio
Oracle - Warehouse Builder
AB Initio
InformationBuilders - DataMigrator
Pentaho- PentahoData Integration
EmbarcaderoTechnologies - DT/Studio
IKAN - ETL4ALL
IBM - DB2 Warehouse Edition
Pervasive - DataIntegrator
ETL SolutionsLtd. - TransformationManager
Group 1 Software (Sagent) - DataFlow
Sybase - Data IntegratedSuite ETL
Talend- TalendOpenStudio
ExpressorSoftware - ExpressorSemanticDataIntegrationSystem
Elixir- ElixirRepertoire
OpenSys - CloverETL
5. 6. ETL process
ETL (Extract, Transform and Load) is a processindata warehousingresponsibleforpullingdataoutof
the source systemsandplacingitinto a data warehouse.ETLinvolvesthe followingtasks:
- Extracting The Data from source systems(SAP,ERP,otheroprational systems),datafromdifferent
source systemsisconvertedintoone consolidateddatawarehouse formatwhichisreadyfor
transformationprocessing.
- Transforming The Data mayinvolve the followingtasks:
applyingbusinessrules(so-calledderivations,e.g.,calculatingnew measuresanddimensions),
cleaning(e.g.,mappingNULLto 0 or "Male"to "M" and "Female"to"F"etc.),
filtering(e.g.,selectingonlycertaincolumnstoload),
splittingacolumnintomultiplecolumnsandvice versa,
joiningtogetherdatafrommultiple sources(e.g.,lookup,merge),
transposingrowsandcolumns,
applyinganykindof simple orcomplex datavalidation(e.g.,if the first3columnsina row are
emptythenrejectthe rowfrom processing)
- Loading The Data intoa data warehouse ordata repositoryotherreportingapplications