AGENDA Data warehouse and BI overview Data warehouse Data Flow Staging Area Transformation Loading ETL tools Data Marts Business Intelligence (BI) OLAP BIG DATA
DATA WAREHOUSE AND BI OVERVIEW• A data warehouse is a database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.• In addition to a relational database, a data warehouse environment includes an extraction, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.• Business intelligence (BI) is defined as the ability for an organization to take all its capabilities and convert them into knowledge. This produces large amounts of information which can lead to the development of new opportunities for the organization.
FEW KEY IMPORTANT WORDS• Business Operation• Business Intelligence• Business Management• Operational System• Data Warehouse• Operational Data store• Data Mart• Meta Data Management
STEPS TO CREATE A DATAWAREHOUSE• Understand the business problem to be solved• Gather requirements• Determine appropriate end user technology to support the solution• Build a prototype• Develop data warehouse data model• Map the DW requirements based on the user’s requirement definitions• Generate ETL code• Test the DW• Once validate, move the data and code to Production
SUBJECT• Referred as subject oriented data warehouse• Subject refers to data subject or major category of data relevant to business.• Subset of enterprise data and consist of related entities and relationship.• Examples Customers,Products,Sales,Geo
ENTITY• Defined as person ,place, thing concept or relevant in which an enterprise has both interest and capability to capture and store information• Primary entity – defined as an entity that does not depend on any other entity for its existance• SUBTYPE Entity – is logical division of or category of a parent (super type) entity. Examples – Customers can be Wholesale customers and Retail customers. Both inherits parent attributes of parent entity.• Attribute - It handles a group of data for an entity that can occur multiple times.• Associative Entity - it depends upon 2 or more entities for its existence . Like Orders consists of Customer and Items purchased.• Primary Key – Servers as unique identifier for an Entity and is used in the physical database to locate a record for storage or access
CHARACTERISTICS OF A PRIMARY KEY (PK)• The key is never NULL• The key is unique and unique by design and not by circumstances• The key is persistence over the time• The key is manageable – consists of integers and characters strings and no embedded symbols or odd characters• The key should not contain any embedded intelligence
RELATIONSHIP• Relationship documents the business rules associating two entities together. The relationship is used to describe how the two entries are naturally linked to each other.• Example Customers can place orders.• Cardinality *** - denotes the maximum number of occurrence of one entity to another that can relate to another entity. Usually these are expressed as “ONE” or “MANY”• Identifying Relationship – An identifying relationship means that the child table cannot be uniquely identified without the parent • Example... Account (AccountID, AccountNum, AccountTypeID) PersonAccount (AccountID, PersonID, Balance) Person(PersonID, Name) • The Account to PersonAccount relationship and the Person to PersonAccount relationship are identifying because the child row (PersonAccount) cannot exist without having been defined in the parent (Account or Person). In other words: there is no personaccount when there is no Person or when there is no Account.• NON Identifying relationship - A non-identifying relationship is one where the child can be identified independently of the parent • Example... Account( AccountID, AccountNum, AccountTypeID ) AccountType( AccountTypeID, Code, Name, Description ) • The relationship between Account and AccountType is non-identifying because each AccountType can be identified without having to exist in the parent table.
NORMALIZATION• Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one t) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.• The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through three (third normal form or 3NF).
FIRST NORMAL FORM (1NF)• Eliminate duplicative columns from the same table.• Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).• The first rule dictates that we must not duplicate data within the same row of a table. Within the database community, this concept is referred to as the atomicity of a table. Tables that comply with this rule are said to be atomic.• Let’s explore this principle with a classic example – a table within a human resources database that stores the manager-subordinate relationship. For the purposes of our example, we’ll impose the business rule that each manager may have one or more subordinates while each subordinate may have only one manager.
OptionSTUDENT 1: Make a determinant of the repeating group (or the multivalued attribute) a part of the Stud_ID Name Course_ID Units primary key. 101 Lennon MSI 250 3.00 101 Lennon MSI 415 3.00 125 Johnson MSI 331 3.00 Composite Primary Key STUDENT Stud_ID Name Course_ID Units 101 Lennon MSI 250 3.00 101 Lennon MSI 415 3.00 125 Johnson MSI 331 3.00
Option 1: Make a of the repeating determinant group (or the multivalued attribute) a part of Composite the primary key. Primary Key STUDENT Stud_ID Name Course_ID Units 101 Lennon MSI 250 3.00 101 Lennon MSI 415 3.00 125 Johnson MSI 331 3.00
Option 2: Remove the entire repeating group from the relation. Create another relation which would contain all the attributes of the repeating group, plus the primary key from the first relation. In this new relation, the primary key from the original relation and the determinant of the repeating group will comprise a primary key. STUDENT Stud_ID Name Course_ID Units 101 Lennon MSI 250 3.00 101 Lennon MSI 415 3.00 125 Johnson MSI 331 3.00
STUDENT Stud_ID Name 101 Lennon 125 JonsonSTUDENT_COURSEStud_ID Course Units 101 MSI 250 3 101 MSI 415 3 125 MSI 331 3
SECOND NORMAL FORM (2NF)• Goal: Remove Partial Dependencies Composite Partial Dependencies Primary Key STUDENT Stud_ID Name Course_ID Units 101 Lennon MSI 250 3.00 101 Lennon MSI 415 3.00 125 Johnson MSI 331 3.00
CUSTOMER STUDENT_COURSEStud_ID Name Course_ID Units Stud_ID Cours _ID e 101 Lennon MSI 250 3.00 101 MSI 250 101 Lennon MSI 415 3.00 101 MSI 415 125 Johnson MSI 331 3.00 125 MSI 331STUDENT COURSE Stud_ID Name Course_ID Units 101 Lennon MSI 250 3.00 101 Lennon MSI 415 3.00 125 Johnson MSI 331 3.00
THIRD NORMAL FORM (3NF)• Goal: Get rid of transitive dependencies. Transitive DependencyEMPLOYEE Emp_ID F_Name L_Name Dept_ID Dept_Name 111 Mary Jones 1 Acct 122 Sarah Smith 2 Mktg
THIRD NORMAL FORM (3NF)• Remove the attributes, which are dependent on a non-key attribute, from the original relation. For each transitive dependency, create a new relation with the non-key attribute which is a determinant in the transitive dependency as a primary key, and the dependent non-key attribute as a dependent.EMPLOYEE Emp_ID F_Name L_Name Dept_ID Dept_Name 111 Mary Jones 1 Acct 122 Sarah Smith 2 Mktg
THIRD NORMAL FORM (3NF)EMPLOYEE Emp_ID F_Name L_Name Dept_ID Dept_Name 111 Mary Jones 1 Acct EMPLOYEE 122 Sarah Smith 2 Mktg Emp_ID F_Name L_Name Dept_ID 111 Mary Jones 1 122 Sarah Smith 2 DEPARTMENT Dept_ID Dept_Name 1 Acct 2 Mktg
ZACHMAN FRAMEWORK FOR ENTERPRISE ARCHITECTURES
ZACHMAN FRAMEWORK FOR ENTERPRISE ARCHITECTURES• As you can see from Figure 4, there are 36 intersecting cells in a Zachman grid—one for each meeting point between a players perspective (for example, business owner) and a descriptive focus (for example, data.). As we move horizontally (for example, left to right) in the grid, we see different descriptions of the system—all from the same players perspective. As we move vertically in the grid (for example, top to bottom), we see a single focus, but change the player from whose perspective we are viewing that focus.• The first suggestion of the Zachman taxonomy is that every architectural artifact should live in one and only one cell. There should be no ambiguity about where a particular artifact lives. If it is not clear in which cell a particular artifact lives, there is most likely a problem with the artifact itself.• The second suggestion of the Zachman taxonomy is that an architecture can be considered a complete architecture only when every cell in that architecture is complete. A cell is complete when it contains sufficient artifacts to fully define the system for one specific player looking at one specific descriptive focus.• The third suggestion of the Zachman grid is that cells in columns should be related to each other. Consider, for example, the data column (the first column) of the Zachman grid. From the business owners (Brets) perspective, data is information about the business. From the database administrators perspective, data is rows and columns in the database.
ZACHMAN GRID5 ways in which the Zachman grid can help in the development of a enterprise architecture• Ensure that every stakeholders perspective has been considered for every descriptive focal point.• Improve the client’s artifacts themselves by sharpening each of their focus points to one particular concern for one particular audience.• Ensure that all of client’sbusiness requirements can be traced down to some technical implementation.• Convince client’s technical team isnt planning on building a bunch of useless functionality.• Convince Client that the business folks are including her IT folks in their planning.
THE OPEN GROUP ARCHITECTURE FRAMEWORK (TOGAF)• TOGAF is the Architecture Development Method• TOGAF divides an enterprise architecture into four categories, as follows• Business architecture—Describes the processes the business uses to meet its goals• Application architecture—Describes how specific applications are designed and how they interact with each other• Data architecture—Describes how the enterprise datastores are organized and accessed• Technical architecture—Describes the hardware and software infrastructure that supports applications and their interactions• Zachman tells you how to categorize your artifacts. TOGAF gives you a process for creating them.
DAY-TO-DAY EXPERIENCE OF CREATING AN ENTERPRISE ARCHITECTUREWILL BE DRIVEN BY THE ADM A high-level view
PHASE A & PHASE B• The culmination of Phase A will be a Statement of Architecture Work, which must be approved by the various stakeholders before the next phase of the ADM begins. The output of this phase is to create an architectural vision for the first pass through the ADM cycle. Architect will guide Client into choosing the project, validating the project against the architectural principles established in the Preliminary Phase, and ensure that the appropriate stakeholders have been identified and their issues have been addressed.• The Architectural Vision created in Phase A will be the main input into Phase B. Client’s goal in Phase B is to create a detailed baseline and target business architecture and perform a full analysis of the gaps between them.• Phase B is quite involved—involving business modeling, highly detailed business analysis, and technical-requirements documentation. A successful Phase B requires input from many stakeholders. The major outputs will be a detailed description of the baseline and target business objectives, and gap descriptions of the business architecture.
PHASE C• Develop baseline data-architecture description• Review and validate principles, reference models, viewpoints, and tools• Create architecture models, including logical data models, data-management process models, and relationship models that map business functions to CRUD (Create, Read, Update, Delete) data operations• Select data-architecture building blocks• Conduct formal checkpoint reviews of the architecture model and building blocks with stakeholders• Review qualitative criteria (for example, performance, reliability, security, integrity)• Complete data architecture• Conduct checkpoint/impact analysis• Perform gap analysis• The most important deliverable from this phase will be the Target Information and Applications Architecture.
PHASE D & PHASE E• Phase D completes the technical architecture—the infrastructure necessary to support the proposed new architecture. This phase is completed mostly by engaging with Client’s infrastructure and technical team.• Phase E evaluates the various implementation possibilities, identifies the major implementation projects that might be undertaken, and evaluates the business opportunity associated with each. The TOGAF standard recommends that Client’s first pass at Phase E "focus on projects that will deliver short-term payoffs and so create an impetus for proceeding with longer-term projects.―• A good starting place to look for such projects is the organizational pain-points that initially convinced by client’s CEO to adopt an enterprise architectural-based strategy
PHASE F , PHASE G & PHASE H• Phase F is closely related to Phase E. In this phase, Teri works with MedAMores governance body to sort the projects identified in Phase E into priority order that include not only the cost and benefits (identified in Phase E), but also the risk factors• In Phase G, Client takes the prioritized list of projects and creates architectural specifications for the implementation projects. These specifications will include acceptance criteria and lists of risks and issues• The final phase is H. In this phase, Client modifies the architectural change-management process with any new artifacts created in this last iteration and with new information that becomes available