Presented By: Deepali Raut- 54
Database Approach and Design DBMS’s minimize the following problems: Data redundancy Data isolation Data inconsistency Designing of Database by means of Tables and Constraints Data Dictionary Normalization Indexing
Database Table
Definition – Data Dictionary Its an integral part of a database, which holds information about the meta-data i.e. Data about data Advantages  of a Data Dictionary Creating an informative and well-designed database Identifying table structures and types
Data Dictionary
What is Normalization? Normalization  is a method for analyzing and reducing a relational database to its most streamlined form   Minimum redundancy Maximum data integrity Best processing performance
Non-Normalized Relation
Normalizing the Database
What is Indexing? An  index  is Database object use to improve the speed of data retrieval operations Indexes can be created using one or more columns of a database table which are frequently used together Providing the basis for rapid random lookups and efficient access of ordered records Index provide function base search to allow case-insensitive search i.e. Upper/Lower case .
 
Why a data warehouse? Data - scattered, different versions, subtle differences Poor data documentation Requires Data transformation Traditional data management approach is query driven, i.e., lazy and on-demand
Why a data warehouse? (cont’d) Query driven approach has its problems Delay in query processing Unavailability of a data source Need to filter and integrate results Frequent queries are usually inefficient and expensive Difficult to implement caching Lack of standards Need to compete with local processing resources
Data Warehouse   Definition Subject-oriented Integrated Time-variant  Non-volatile collection of data
Data Warehouse Definition… Subject-Oriented: The data warehouse is organized around the key subjects (or high-level entities) of the enterprise. Major subjects include Customers Suppliers Revenues Products ,etc.
Data Warehouse Definition… Integrated The data housed in the data warehouse are defined using consistent Naming conventions Formats Encoding Structures Related Characteristics
Data Warehouse Definition… Time-variant The data in the warehouse contain a time dimension so that they may be used as a historical record of the business Non-volatile Data in the data warehouse are loaded and refreshed from operational systems, but cannot be updated by end-users
The Data Warehouse advantage Semantic reconciliation Data sources are distributed in many businesses Different encoding of the same entities A warehouse encompasses the  full volume  of data in a single unified schema Performance Managers need different views of the same data Efficiently supports OLAP operations
The data warehouse advantage (cont’d...) Improves data quality Data from a source usually needs “cleaning” The warehouse acts as a “cleaning buffer” Thus, minimizes data error There is clear ROI (Return on Investment) for organizations implementing a data warehouse Quick and easy access to data Extensive analysis of data for Decision making Consolidated view of organizational data
Evolution of Data warehouse
What is a Data Warehouse?  . . . . . A Practitioners Viewpoint A single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context
Enterprise Data Warehouse Execution Systems CRM ERP Legacy e-Commerce Reporting Tools OLAP Tools Ad Hoc Query Tools Data Mining Tools External Data Purchased Market Data Spreadsheets Oracle SQL Server Teradata DB2 Custom Tools HTML Reports Cognos Business Objects MicroStrategy Oracle Discoverer Brio Data Mining Tools Portals Data and Metadata Repository Layer ETL Tools: Informatica PowerMart Ab Initio Data Stage Oracle Warehouse Builder Custom programs SQL scripts Extract, Transformation, and Load (ETL) Layer Cleanse Data Filter Records Standardize Values Decode Values Apply Business Rules House holding Deduce Records Merge Records Presentation Layer ETL Layer Source Systems Sample Technologies: Metadata Repository PeopleSoft SAP Siebel Oracle Applications Manugistics Custom Systems Data Warehouse Architecture
Typical  Data Warehouse Architecture
DW Architecture Generic Two-Level Architecture  Independent Data Mart  Dependent Data Mart and Operational Data Store  Logical Data Mart and active Warehouse
Tools used in Data Warehousing Component  Product used Purpose Reporting  Crystal Reports  Create presentation style reports with chart and graphs Querying  Access 2000 Create complex ad-hoc queries against a variety of data sources  OLAP Crystal Analysis Professional Access data cubes for designing views to pivot, filter and aggregate facts on pre-defined dimensions for specific subject areas Data Mining/Statistical Analysis SAS  Statistical Analysis and Churn analysis
Components of Data warehouse Operational Source System Data Staging Area  --  Services: Clean, combine and standardize -- Data Store: Flat files and Relational tables,  -- Processing: Sorting and sequential processing. Data Presentation Area -- Data Marts :Data being divided into different blocks of  data as per requirement or application area Data Access Tools
ETL – E xtract  T ransform   and  L oad Extract Transform and Load (ETL) is a process that involves extracting data from multiple sources in various formats, transforming it to fit business needs, and ultimately, loading it into a target system. The target system will generally be configured as a data warehouse or data mart, though ETL can refer to a process that loads to any type of data storage structure.  The structure itself will typically be a database, but may also be an application, file or other storage facility.  The purpose of ETL is to reformat, cleanse and standardize data so that it can be analyzed or exchanged to address business needs and/or promote interoperability.  Note that ETT (extraction, transformation, transportation), ETM (extraction, transformation, move), ELT (extraction, load, transform) may be used synonymously with ETL.
ETL Data Flow…
ETL process Stand for  Extract, Transform and Load Its a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.  Involves the following tasks: 1.  Extracting the data  from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.  2.  Transforming the data  - applying business rules ( like derivations, calculating new measures and dimensions),    cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),   filtering (e.g., selecting only certain columns to load),   splitting a column into multiple columns and vice versa,   joining together data from multiple sources (e.g., lookup, merge),   transposing rows and columns,   applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)  3. Loading the data  into a data warehouse or data repository or other reporting applications
ETL Tools Informatica  Power Center IBM  Websphere DataStage(Formerly known as Ascential DataStage) SAP  BusinessObjects Data Integrator IBM  Cognos Data Manager (Formerly known as Cognos DecisionStream) Microsoft  SQL Server Integration Services Oracle  Data Integrator (Formerly known as Sunopsis Data Conductor) SAS  Data Integration Studio Oracle  Warehouse Builder AB Initio   Information Builders  Data Migrator Pentaho  Pentaho Data Integration Embarcadero Technologies  DT/Studio IKAN  ETL4ALL IBM  DB2 Warehouse Edition Pervasive  Data Integrator ETL Solutions Ltd.  Transformation Manager Group 1 Software (Sagent)  DataFlow Sybase  Data Integrated Suite ETL Talend Talend Open Studio  Expressor Software  Expressor Semantic Data Integration System Elixir  Elixir Repertoire OpenSys  CloverETL
OLTP-O n L ine  T ransaction  P rocessing   Facilitate and manage transaction-oriented applications in terms of business or commercial context E.g.- ATM, electronic banking, order processing, employee time clock systems, e-commerce and many more… Advantages –  simplicity, efficiency and faster Disadvantages –  security, reliability and susceptible to direct attack
OLAP  –  On L ine  A nalytical  P rocessing Generally synonymous with terms such as Decisions Support, Business Intelligence, Executive Information System OLAP is….   Fast Analysis Shared Multidimensional A powerful visualization paradigm
OLTP  vs.  OLAP
Example: Invoice / Bill amount for a specific customer based on CAF Number (or) MDN needs to be found from a transactional system which is ADC Number of customers whose invoice / bill is greater than Rs.1000.00 for the past three months needs to have OLAP system which is DSS
Data Warehouse for Decision Support Putting Information technology to help the organization make faster and better decisions Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How  did the share price of software companies correlate with profits over last 10 years?
DSS – D ecision  S upport  S ystem An interactive computer based system Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used to understand the business and make judgments
 
DSS  Development Process Understand User requirement Business Process Key Result Areas to be analysed in the report Source System based on which report to be built  Agree upon the business logic and time line for implementation of reports in a phased manner Develop Logical & Physical data model Programs Database to suit to business need Multiple programs are required to develop the database.  This involves integration of programs in an optimized manner Testing Involve Data validation with reference to source system and business rules agreed upon with users User Acceptance This could be an iterative process till final acceptance by the user QA ensure Application development is in accordance to the development process defined at DSS Delivery of reports in a consistent manner Release  indicates the report is productionised  Necessary user guide and training are given to the users to facilitate the use of reports Creation of userid’s and assign access rights for reports Requirement Analysis Application Development Exhaustive Testing Quality Assurance Release Report
Application Areas Industry Application Finance Credit Card Analysis Insurance Claims, Fraud Analysis Telecommunication Call record analysis Transport Logistics management Consumer goods promotion analysis Data Service providers Value added data Utilities Power usage analysis
Benefits of DSS Improving Personal Efficiency  Expediting Problem Solving  Facilitating Interpersonal Communications Promoting Learning or Training  Increasing Organizational Control
Need of DSS   ... at Different Level  in An Organization
Case Study Telecom Industry
DSS Data warehouse Architecture
Component Details Source systems which DSS accesses or gets feed from .  -- ADC(Billing), Clarify (Customer Master Data), Interconnect (for CDRs) ETL box on which datastage is installed.  -- To store the in process temporary files.  Repository database i.e  PRODDSS and PRODBILL database  --All Business Objects reports are taken from both of these servers.  For SAP BIW applications there are 2 boxes. -- One box is the Server for SAP BIW and  --  Other is  the application boxe for SAP BIW .  All BIW reports are taken from these boxes. The data is segregated  from the servers using SAN box .
User Involved COO/CIO/CEO Customer Support Executive Revenue Assurance Manager Sales Manager Account Manager Circle Head Service Assurance Manager , etc..
Sample Reports Delivery Circle Refund Pendency Report  Total refund pendency  JAN FEB MAR APR MAY JUNE AP 1 - - - 209 90 300 DL 4 - 2 - 112 23 141 GJ - - - - 411 123 534 KA 1 1 6 - 84 27 119 KL - - - - 31 10 41 MH 1 - 6 10 53 28 98 MP 12 13 53 - 82 - 160 MU - - - - 150 61 211 PB - 20 8 16 52 2 98 RJ - - - 2 9 4 15 TN 1 2 2 13 153 75 246 UP - 5 5 1 58 9 78 WB - - 1 - 97 64 162 Grand Total 20 41 83 42 1501 516 2203
Thank You
Generic two-level architecture Periodic extraction    data is not completely current in warehouse E T L BACK
Independent Data Mart BACK E T L Separate ETL for each  independent  data mart Data access complexity due to  multiple  data marts
Dependent  data mart with  operational data store BACK E T L Single ETL for  enterprise data warehouse (EDW) Dependent  data marts loaded from EDW
Logical data mart and @active data warehouse BACK BACK E T L Near real-time ETL for  @active Data Warehouse Data marts are NOT separate databases, but logical  views  of the data warehouse    Easier to create new data marts ODS  and  data warehouse   are one and the same

Datawarehousing & DSS

  • 1.
  • 2.
    Database Approach andDesign DBMS’s minimize the following problems: Data redundancy Data isolation Data inconsistency Designing of Database by means of Tables and Constraints Data Dictionary Normalization Indexing
  • 3.
  • 4.
    Definition – DataDictionary Its an integral part of a database, which holds information about the meta-data i.e. Data about data Advantages of a Data Dictionary Creating an informative and well-designed database Identifying table structures and types
  • 5.
  • 6.
    What is Normalization?Normalization is a method for analyzing and reducing a relational database to its most streamlined form Minimum redundancy Maximum data integrity Best processing performance
  • 7.
  • 8.
  • 9.
    What is Indexing?An index is Database object use to improve the speed of data retrieval operations Indexes can be created using one or more columns of a database table which are frequently used together Providing the basis for rapid random lookups and efficient access of ordered records Index provide function base search to allow case-insensitive search i.e. Upper/Lower case .
  • 10.
  • 11.
    Why a datawarehouse? Data - scattered, different versions, subtle differences Poor data documentation Requires Data transformation Traditional data management approach is query driven, i.e., lazy and on-demand
  • 12.
    Why a datawarehouse? (cont’d) Query driven approach has its problems Delay in query processing Unavailability of a data source Need to filter and integrate results Frequent queries are usually inefficient and expensive Difficult to implement caching Lack of standards Need to compete with local processing resources
  • 13.
    Data Warehouse Definition Subject-oriented Integrated Time-variant Non-volatile collection of data
  • 14.
    Data Warehouse Definition…Subject-Oriented: The data warehouse is organized around the key subjects (or high-level entities) of the enterprise. Major subjects include Customers Suppliers Revenues Products ,etc.
  • 15.
    Data Warehouse Definition…Integrated The data housed in the data warehouse are defined using consistent Naming conventions Formats Encoding Structures Related Characteristics
  • 16.
    Data Warehouse Definition…Time-variant The data in the warehouse contain a time dimension so that they may be used as a historical record of the business Non-volatile Data in the data warehouse are loaded and refreshed from operational systems, but cannot be updated by end-users
  • 17.
    The Data Warehouseadvantage Semantic reconciliation Data sources are distributed in many businesses Different encoding of the same entities A warehouse encompasses the full volume of data in a single unified schema Performance Managers need different views of the same data Efficiently supports OLAP operations
  • 18.
    The data warehouseadvantage (cont’d...) Improves data quality Data from a source usually needs “cleaning” The warehouse acts as a “cleaning buffer” Thus, minimizes data error There is clear ROI (Return on Investment) for organizations implementing a data warehouse Quick and easy access to data Extensive analysis of data for Decision making Consolidated view of organizational data
  • 19.
  • 20.
    What is aData Warehouse? . . . . . A Practitioners Viewpoint A single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context
  • 21.
    Enterprise Data WarehouseExecution Systems CRM ERP Legacy e-Commerce Reporting Tools OLAP Tools Ad Hoc Query Tools Data Mining Tools External Data Purchased Market Data Spreadsheets Oracle SQL Server Teradata DB2 Custom Tools HTML Reports Cognos Business Objects MicroStrategy Oracle Discoverer Brio Data Mining Tools Portals Data and Metadata Repository Layer ETL Tools: Informatica PowerMart Ab Initio Data Stage Oracle Warehouse Builder Custom programs SQL scripts Extract, Transformation, and Load (ETL) Layer Cleanse Data Filter Records Standardize Values Decode Values Apply Business Rules House holding Deduce Records Merge Records Presentation Layer ETL Layer Source Systems Sample Technologies: Metadata Repository PeopleSoft SAP Siebel Oracle Applications Manugistics Custom Systems Data Warehouse Architecture
  • 22.
    Typical DataWarehouse Architecture
  • 23.
    DW Architecture GenericTwo-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and active Warehouse
  • 24.
    Tools used inData Warehousing Component Product used Purpose Reporting Crystal Reports Create presentation style reports with chart and graphs Querying Access 2000 Create complex ad-hoc queries against a variety of data sources OLAP Crystal Analysis Professional Access data cubes for designing views to pivot, filter and aggregate facts on pre-defined dimensions for specific subject areas Data Mining/Statistical Analysis SAS Statistical Analysis and Churn analysis
  • 25.
    Components of Datawarehouse Operational Source System Data Staging Area -- Services: Clean, combine and standardize -- Data Store: Flat files and Relational tables, -- Processing: Sorting and sequential processing. Data Presentation Area -- Data Marts :Data being divided into different blocks of data as per requirement or application area Data Access Tools
  • 26.
    ETL – Extract T ransform and L oad Extract Transform and Load (ETL) is a process that involves extracting data from multiple sources in various formats, transforming it to fit business needs, and ultimately, loading it into a target system. The target system will generally be configured as a data warehouse or data mart, though ETL can refer to a process that loads to any type of data storage structure. The structure itself will typically be a database, but may also be an application, file or other storage facility. The purpose of ETL is to reformat, cleanse and standardize data so that it can be analyzed or exchanged to address business needs and/or promote interoperability. Note that ETT (extraction, transformation, transportation), ETM (extraction, transformation, move), ELT (extraction, load, transform) may be used synonymously with ETL.
  • 27.
  • 28.
    ETL process Standfor Extract, Transform and Load Its a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. Involves the following tasks: 1. Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing. 2. Transforming the data - applying business rules ( like derivations, calculating new measures and dimensions),   cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),   filtering (e.g., selecting only certain columns to load),   splitting a column into multiple columns and vice versa,   joining together data from multiple sources (e.g., lookup, merge),   transposing rows and columns,   applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing) 3. Loading the data into a data warehouse or data repository or other reporting applications
  • 29.
    ETL Tools Informatica Power Center IBM Websphere DataStage(Formerly known as Ascential DataStage) SAP BusinessObjects Data Integrator IBM Cognos Data Manager (Formerly known as Cognos DecisionStream) Microsoft SQL Server Integration Services Oracle Data Integrator (Formerly known as Sunopsis Data Conductor) SAS Data Integration Studio Oracle Warehouse Builder AB Initio   Information Builders Data Migrator Pentaho Pentaho Data Integration Embarcadero Technologies DT/Studio IKAN ETL4ALL IBM DB2 Warehouse Edition Pervasive Data Integrator ETL Solutions Ltd. Transformation Manager Group 1 Software (Sagent) DataFlow Sybase Data Integrated Suite ETL Talend Talend Open Studio Expressor Software Expressor Semantic Data Integration System Elixir Elixir Repertoire OpenSys CloverETL
  • 30.
    OLTP-O n Line T ransaction P rocessing Facilitate and manage transaction-oriented applications in terms of business or commercial context E.g.- ATM, electronic banking, order processing, employee time clock systems, e-commerce and many more… Advantages – simplicity, efficiency and faster Disadvantages – security, reliability and susceptible to direct attack
  • 31.
    OLAP – On L ine A nalytical P rocessing Generally synonymous with terms such as Decisions Support, Business Intelligence, Executive Information System OLAP is…. Fast Analysis Shared Multidimensional A powerful visualization paradigm
  • 32.
    OLTP vs. OLAP
  • 33.
    Example: Invoice /Bill amount for a specific customer based on CAF Number (or) MDN needs to be found from a transactional system which is ADC Number of customers whose invoice / bill is greater than Rs.1000.00 for the past three months needs to have OLAP system which is DSS
  • 34.
    Data Warehouse forDecision Support Putting Information technology to help the organization make faster and better decisions Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years?
  • 35.
    DSS – Decision S upport S ystem An interactive computer based system Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used to understand the business and make judgments
  • 36.
  • 37.
    DSS DevelopmentProcess Understand User requirement Business Process Key Result Areas to be analysed in the report Source System based on which report to be built Agree upon the business logic and time line for implementation of reports in a phased manner Develop Logical & Physical data model Programs Database to suit to business need Multiple programs are required to develop the database. This involves integration of programs in an optimized manner Testing Involve Data validation with reference to source system and business rules agreed upon with users User Acceptance This could be an iterative process till final acceptance by the user QA ensure Application development is in accordance to the development process defined at DSS Delivery of reports in a consistent manner Release indicates the report is productionised Necessary user guide and training are given to the users to facilitate the use of reports Creation of userid’s and assign access rights for reports Requirement Analysis Application Development Exhaustive Testing Quality Assurance Release Report
  • 38.
    Application Areas IndustryApplication Finance Credit Card Analysis Insurance Claims, Fraud Analysis Telecommunication Call record analysis Transport Logistics management Consumer goods promotion analysis Data Service providers Value added data Utilities Power usage analysis
  • 39.
    Benefits of DSSImproving Personal Efficiency Expediting Problem Solving Facilitating Interpersonal Communications Promoting Learning or Training Increasing Organizational Control
  • 40.
    Need of DSS ... at Different Level in An Organization
  • 41.
  • 42.
    DSS Data warehouseArchitecture
  • 43.
    Component Details Sourcesystems which DSS accesses or gets feed from . -- ADC(Billing), Clarify (Customer Master Data), Interconnect (for CDRs) ETL box on which datastage is installed. -- To store the in process temporary files. Repository database i.e PRODDSS and PRODBILL database --All Business Objects reports are taken from both of these servers. For SAP BIW applications there are 2 boxes. -- One box is the Server for SAP BIW and -- Other is the application boxe for SAP BIW . All BIW reports are taken from these boxes. The data is segregated from the servers using SAN box .
  • 44.
    User Involved COO/CIO/CEOCustomer Support Executive Revenue Assurance Manager Sales Manager Account Manager Circle Head Service Assurance Manager , etc..
  • 45.
    Sample Reports DeliveryCircle Refund Pendency Report Total refund pendency JAN FEB MAR APR MAY JUNE AP 1 - - - 209 90 300 DL 4 - 2 - 112 23 141 GJ - - - - 411 123 534 KA 1 1 6 - 84 27 119 KL - - - - 31 10 41 MH 1 - 6 10 53 28 98 MP 12 13 53 - 82 - 160 MU - - - - 150 61 211 PB - 20 8 16 52 2 98 RJ - - - 2 9 4 15 TN 1 2 2 13 153 75 246 UP - 5 5 1 58 9 78 WB - - 1 - 97 64 162 Grand Total 20 41 83 42 1501 516 2203
  • 46.
  • 47.
    Generic two-level architecturePeriodic extraction  data is not completely current in warehouse E T L BACK
  • 48.
    Independent Data MartBACK E T L Separate ETL for each independent data mart Data access complexity due to multiple data marts
  • 49.
    Dependent datamart with operational data store BACK E T L Single ETL for enterprise data warehouse (EDW) Dependent data marts loaded from EDW
  • 50.
    Logical data martand @active data warehouse BACK BACK E T L Near real-time ETL for @active Data Warehouse Data marts are NOT separate databases, but logical views of the data warehouse  Easier to create new data marts ODS and data warehouse are one and the same

Editor's Notes

  • #3 The data model is a diagram that represents the entities in the database and their relationships. An entity is a person, place, thing, or event about which information is maintained. A record generally describes an entity. An attribute is a particular characteristic or quality of a particular entity. The primary ke y is a field that uniquely identifies a record. Secondary keys are other field that have some identifying information but typically do not identify the file with complete accuracy.