Its an integral part of a database, which holds information about the meta-data i.e. Data about data
Advantages of a Data Dictionary
Creating an informative and well-designed database
Identifying table structures and types
What is Normalization?
Normalization is a method for analyzing and reducing a relational database to its most streamlined form
Maximum data integrity
Best processing performance
Normalizing the Database
What is Indexing?
An index is Database object use to improve the speed of data retrieval operations
Indexes can be created using one or more columns of a database table which are frequently used together
Providing the basis for rapid random lookups and efficient access of ordered records
Index provide function base search to allow case-insensitive search i.e. Upper/Lower case .
Why a data warehouse?
Data - scattered, different versions, subtle differences
Poor data documentation
Requires Data transformation
Traditional data management approach is query driven, i.e., lazy and on-demand
Why a data warehouse? (cont’d)
Query driven approach has its problems
Delay in query processing
Unavailability of a data source
Need to filter and integrate results
Frequent queries are usually inefficient and expensive
Difficult to implement caching
Lack of standards
Need to compete with local processing resources
Data Warehouse Definition
Non-volatile collection of data
Data Warehouse Definition…
The data warehouse is organized around the key subjects (or high-level entities) of the enterprise. Major subjects include
Data Warehouse Definition…
The data housed in the data warehouse are defined using consistent
Data Warehouse Definition…
The data in the warehouse contain a time dimension so that they may be used as a historical record of the business
Data in the data warehouse are loaded and refreshed from operational systems, but cannot be updated by end-users
The Data Warehouse advantage
Data sources are distributed in many businesses
Different encoding of the same entities
A warehouse encompasses the full volume of data in a single unified schema
Managers need different views of the same data
Efficiently supports OLAP operations
The data warehouse advantage (cont’d...)
Improves data quality
Data from a source usually needs “cleaning”
The warehouse acts as a “cleaning buffer”
Thus, minimizes data error
There is clear ROI (Return on Investment) for organizations implementing a data warehouse
Quick and easy access to data
Extensive analysis of data for Decision making
Consolidated view of organizational data
Evolution of Data warehouse
What is a Data Warehouse? . . . . . A Practitioners Viewpoint
A single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context
Ad Hoc Query Tools
Data Mining Tools
Purchased Market Data
Data Mining Tools
Data and Metadata Repository Layer
Oracle Warehouse Builder
Extract, Transformation, and Load (ETL) Layer
Apply Business Rules
Presentation Layer ETL Layer Source Systems Sample Technologies:
Data Warehouse Architecture
Typical Data Warehouse Architecture
Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational Data Store
Logical Data Mart and active Warehouse
Tools used in Data Warehousing Component Product used Purpose Reporting Crystal Reports Create presentation style reports with chart and graphs Querying Access 2000 Create complex ad-hoc queries against a variety of data sources OLAP Crystal Analysis Professional Access data cubes for designing views to pivot, filter and aggregate facts on pre-defined dimensions for specific subject areas Data Mining/Statistical Analysis SAS Statistical Analysis and Churn analysis
Components of Data warehouse
Operational Source System
Data Staging Area
-- Services: Clean, combine and standardize
-- Data Store: Flat files and Relational tables,
-- Processing: Sorting and sequential processing.
Data Presentation Area
-- Data Marts :Data being divided into different blocks of data as per requirement or application area
Data Access Tools
ETL – E xtract T ransform and L oad
Extract Transform and Load (ETL) is a process that involves extracting data from multiple sources in various formats, transforming it to fit business needs, and ultimately, loading it into a target system.
The target system will generally be configured as a data warehouse or data mart, though ETL can refer to a process that loads to any type of data storage structure.
The structure itself will typically be a database, but may also be an application, file or other storage facility.
The purpose of ETL is to reformat, cleanse and standardize data so that it can be analyzed or exchanged to address business needs and/or promote interoperability.
Note that ETT (extraction, transformation, transportation), ETM (extraction, transformation, move), ELT (extraction, load, transform) may be used synonymously with ETL.
ETL Data Flow…
Stand for Extract, Transform and Load
Its a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.
Involves the following tasks: 1. Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing. 2. Transforming the data -
applying business rules ( like derivations, calculating new measures and dimensions),
cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
filtering (e.g., selecting only certain columns to load),
splitting a column into multiple columns and vice versa,
joining together data from multiple sources (e.g., lookup, merge), transposing rows and columns,
applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)
3. Loading the data into a data warehouse or data repository or other reporting applications
ETL Tools Informatica Power Center IBM Websphere DataStage(Formerly known as Ascential DataStage) SAP BusinessObjects Data Integrator IBM Cognos Data Manager (Formerly known as Cognos DecisionStream) Microsoft SQL Server Integration Services Oracle Data Integrator (Formerly known as Sunopsis Data Conductor) SAS Data Integration Studio Oracle Warehouse Builder AB Initio Information Builders Data Migrator Pentaho Pentaho Data Integration Embarcadero Technologies DT/Studio IKAN ETL4ALL IBM DB2 Warehouse Edition Pervasive Data Integrator ETL Solutions Ltd. Transformation Manager Group 1 Software (Sagent) DataFlow Sybase Data Integrated Suite ETL Talend Talend Open Studio Expressor Software Expressor Semantic Data Integration System Elixir Elixir Repertoire OpenSys CloverETL
OLTP-O n L ine T ransaction P rocessing
Facilitate and manage transaction-oriented applications in terms of business or commercial context
E.g.- ATM, electronic banking, order processing, employee time clock systems, e-commerce and many more…
Advantages – simplicity, efficiency and faster
Disadvantages – security, reliability and susceptible to direct attack
OLAP – On L ine A nalytical P rocessing
Generally synonymous with terms such as Decisions Support, Business Intelligence, Executive Information System
A powerful visualization paradigm
OLTP vs. OLAP
Example: Invoice / Bill amount for a specific customer based on CAF Number (or) MDN needs to be found from a transactional system which is ADC Number of customers whose invoice / bill is greater than Rs.1000.00 for the past three months needs to have OLAP system which is DSS
Data Warehouse for Decision Support
Putting Information technology to help the organization make faster and better decisions
Which of my customers are most likely to go to the competition?
What product promotions have the biggest impact on revenue?
How did the share price of software companies correlate with profits over last 10 years?
DSS – D ecision S upport S ystem
An interactive computer based system
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used to understand the business and make judgments
DSS Development Process
Key Result Areas to be analysed in the report
Source System based on which report to be built
Agree upon the business logic and time line for implementation of reports in a phased manner
Logical & Physical data model
Database to suit to business need
Multiple programs are required to develop the database. This involves integration of programs in an optimized manner
Data validation with reference to source system and business rules agreed upon with users
This could be an iterative process till final acceptance by the user
Application development is in accordance to the development process defined at DSS
Delivery of reports in a consistent manner
Release indicates the report is productionised Necessary user guide and training are given to the users to facilitate the use of reports Creation of userid’s and assign access rights for reports Requirement Analysis Application Development Exhaustive Testing Quality Assurance Release Report
Application Areas Industry Application Finance Credit Card Analysis Insurance Claims, Fraud Analysis Telecommunication Call record analysis Transport Logistics management Consumer goods promotion analysis Data Service providers Value added data Utilities Power usage analysis
Benefits of DSS
Improving Personal Efficiency
Expediting Problem Solving
Facilitating Interpersonal Communications
Promoting Learning or Training
Increasing Organizational Control
Need of DSS ... at Different Level in An Organization
Case Study Telecom Industry
DSS Data warehouse Architecture
Source systems which DSS accesses or gets feed from .
Repository database i.e PRODDSS and PRODBILL database
--All Business Objects reports are taken from both of these servers.
For SAP BIW applications there are 2 boxes.
-- One box is the Server for SAP BIW and
-- Other is the application boxe for SAP BIW .
All BIW reports are taken from these boxes.
The data is segregated from the servers using SAN box .
Customer Support Executive
Revenue Assurance Manager
Service Assurance Manager , etc..
Sample Reports Delivery Circle Refund Pendency Report Total refund pendency JAN FEB MAR APR MAY JUNE AP 1 - - - 209 90 300 DL 4 - 2 - 112 23 141 GJ - - - - 411 123 534 KA 1 1 6 - 84 27 119 KL - - - - 31 10 41 MH 1 - 6 10 53 28 98 MP 12 13 53 - 82 - 160 MU - - - - 150 61 211 PB - 20 8 16 52 2 98 RJ - - - 2 9 4 15 TN 1 2 2 13 153 75 246 UP - 5 5 1 58 9 78 WB - - 1 - 97 64 162 Grand Total 20 41 83 42 1501 516 2203
Generic two-level architecture Periodic extraction data is not completely current in warehouse E T L BACK
Independent Data Mart BACK E T L Separate ETL for each independent data mart Data access complexity due to multiple data marts
Dependent data mart with operational data store BACK E T L Single ETL for enterprise data warehouse (EDW) Dependent data marts loaded from EDW
Logical data mart and @active data warehouse BACK BACK E T L Near real-time ETL for @active Data Warehouse Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts ODS and data warehouse are one and the same