Venugopal krishnan flexible dw models 2014 jul_ieg


Published on

Flexible Data Warehouse Data Models

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Venugopal krishnan flexible dw models 2014 jul_ieg

  1. 1. Information Excellence Harvesting Information Excellence Information Excellence 2014 Jul Knowledge Share Session Venugopal Krishnan, Sr. Consultant, TEG, TCS Flexible (Data warehousing) data models Hosted by
  2. 2. Flexible (Data warehousing) data models Introduction to Temporal data Basic Concepts & Definitions Temporal & Bi-Temporal Representations Temporal Databases Temporal representations in DW Modeling Temporal data Anchor Modeling Table Elimination Data Vault Modeling Other flexible models NOSQL Data Models Venugopal Krishnan: IEG Session 2014 Jul
  3. 3. Venugopal Krishnan 3 Venugopal Krishnan is Senior Consultant with the Technology Excellence Group of Insurance & Healthcare Services Division, Tata Consultancy Services. Ltd., Bangalore. Venu has a Master of Science in Mathematics from Mahatma Gandhi University, Kerala followed by a Post Graduate Certification In Computer & Software Engineering from S.E.R.C (Supercomputer Education & Research Center), Indian Institute of Science, Bangalore. During his 24+ years of overall industry experience, Venu worked earlier with Flytxt Pvt.Ltd, Cognizant(USA&India), Oracle(USA), Emirates Airlines (Dubai, UAE), Tata Unisys Ltd., and Patni Computers in various capacities as Group Manager/Director, Project/Delivery/Program Manaager, Senior Principal Consultant, Lead Analyst etc. for software services & implementation, product development and Technology management related actvities. Venu’s primary focus areas are Database, Data management, Data architecture, and Data warehousing. For the past 16+ years, Venu has been architecting, designing and developing data warehouse & BI platforms for customers across the world. He has lot of experience and expertise in Oracle & related tools, data warehousing and data architecture. Apart from work, Venu is interested in participating in technical forums and conducting training sessions in Oracle, Data warehousing and Data management related areas. He is associated with Oracle Users India Group, Information Excellence Group etc., and has been an active volunteer for many technical events/symposiums organized by the Information Excellence Group. On a personal note, Venu likes travelling and listening to classical music. Venu is a core member of the Information Excellence Volunteer Team for the past three years, with significant commitment and contributions to the growth of the IEG community. Venugopal Krishnan Senior Consultant, Technology Excellence Group Insurance & Healthcare Services Division, TCS
  4. 4. FLEXIBLE DATA MODELS FOR DW (Data Architecture) By VENUGOPAL KRISHNAN Senior Consultant, TCS
  5. 5.  Introduction to Temporal data  Basic Concepts & Definitions  Temporal & Bi-Temporal Representations  Temporal Databases  Temporal representations in Data warehousing  Temporal vs SCD  Modeling Temporal data  Anchor Modeling  Table Elimination  Data Vault Modeling  Other flexible models  NOSQL Data Models AGENDA
  6. 6. Introduction to Temporal Data History data is important for Analysis. How to manage History data? How do we answer the following questions? When were things like our data says they were? When did our data say that things were like that!?
  7. 7. Introduction contd.. Temporal data Data that represents a state in time, such as the land-use patterns of Hong Kong in 1990, or total rainfall in Honolulu on July 1, 2009. Temporal data has a time period associated with it. Temporal Data Collection Examples : 1.0 Data regarding the change of cropland worldwide from 1700 To 1992. The percentage changes over time. 2.0 Sea surface temperature changes with each successive month from 1997 to 2000. 3.0 Oil and Gas production rates changing over 1994.
  8. 8. 1994 time stamp of the oil and gas production of a production field in Wyoming in ArcMap. When visualized over time, the pie charts on the map indicate the changing oil and gas production rates from each producing well (red is gas in barrels of oil equivalent, and green is oil in barrels). The graph shows production through time for the entire field: gas (red), oil (green), and water (blue). Temporal Data Collection Example: Oil and Gas production rate changes
  9. 9. Healthcare: Patient histories need to be maintained Insurance: Claims and accident histories are required Finance: Stock price histories need to be maintained. Personnel management: Salary and position history need to be maintained Banking: Credit histories Examples of Temporal data in different industries
  10. 10. How does temporal implementation differs from SCD? The SCDs (1,2 and 3) that were proposed by Kimball can be described as poor man’s solutions to historization of dimensions. • While SCDs are simple to understand and provides good response time, a change in a dimensional attribute effectively changes the context for all facts captured prior to the change. • This can only be tracked by using temporal structures. • Actual time of change is not captured in Dimensions. • Checking when it is OK to refer to which DWH IDs is not possible. • Only Temporal structures can efficiently handle early & late arriving facts.
  11. 11. Basic Concepts • Temporal data changes over time. When data changes over time, It is referred to as changing from real world perspective or business perspective or valid perspective. • Changes can be independent of real world and business perspective. e.g. Data changes in paper/computer file/databases. These are called changes from a transactional perspective. • Data could change in the real world and not be changed in the database. • Data may be changed in the database when it has not changed in the real world. So they are orthogonal. • The data in the database may be changed at the same time that is changes in the real world, but there are no guarantees!!
  12. 12. Basic Concepts Contd.. Questions: Who were current clients on last May 1st? (Valid Time) On last May 1st, who were listed as current clients? (Transaction Time) The above are two different questions. Valid Time – When were things like our data says they were! Transaction Time – When did our data say that things were like that!
  13. 13. Temporal Data = Data which changes over time Temporal Data Structure = Data structure which stores a history of how data changed over time. Valid Temporal Data = Data which changes over time from a real-world or business perspective. Valid Temporal Data Structure = Data structure which stores a history of how data changed from a real-world or business perspective. Transaction Temporal Data = Data which changes over time from a data storage device (or database for convenience) perspective Transaction Temporal Data Structure = Data structure which stores a history of how data changed from a data storage device perspective. Non-temporal Data = data which does not change over time. Non-temporal Data Structure = data structure which does not store a history of how data changed from any perspective. Definitions
  14. 14. Definitions Contd..... Bitemporal data is: • Data which changes both from a real world or business perspective and from a database(transactional) perspective. • Bitemporal Data = data which changes over two dimensions of time independently. • Bitemporal Data Structure = data structure which stores a history of how data changed from two independent perspectives. • The real world or business time is termed as VALID TIME. • The database time is termed as TRANSACTION TIME. Bitemporal data: • Is the only way to have a complete audit trail of what you knew and when you knew it. • Gives you a reproducible history of data from a business perspective. • Provides very accurate data with full support for different types of corrections. • Can alleviate the need for complex, convoluted, and subjective database design techniques as well as eliminate the need for redundant “snapshot” data stores.
  15. 15. Example: Consider the biography of John Laker: (address where john stayed from 1975 till 2001). -Born on April 3, 1975 in the Kids Hospital, Medicine County. -Son of Jack Laker and Jane Laker. -Born in Smallville. -Birth registration done on April 4, 1975. -After graduation started to live in Bigtown from August 26,1994. -Registered the address change on December 27, 1994. -Passed away on April 1, 2001. -Reported and registered on same day. In a non-temporal model, we will store the Name and Address in a table. T(name,address) with name as the primary key. The above model cannot store/handle the address changes.
  16. 16. Non-Temporal Example contd... Date Real world status Database information April 3, 1975 John is born Nothing April 4, 1975 John's father officially reports the birth John's information is inserted into the database.(John lives in Smallville) August 26,1994 After graduation,John moves to Bigtown,forgets to register his house address. John lives in Smallville December 27,1994 John registers his new address. John's address is updated.(John lives in Bigtown) April 1, 2001 John dies Information is deleted(There is no person called John Laker)
  17. 17. Temporal representation The record has two fields, valid_from and valid_to. Based on the date of birth of John, the Valid_from will be the date of birth, and valid_to is not known and it might change in the Future. Person(John Laker, Smallville, 3-Apr-1975, ∞). After John reports his new address in Big Town on Aug 27,1994, a new entry is made into the database as follows: Person(John Laker, Big Town, 27-Aug-1994, ∞). The earlier record is updated as follows with the Valid_To time set to 26-Aug-1994.: Person(John Laker, Smallville, 3-Apr-1975, 26-Aug-1994). When John dies, the database is again updated as follows: Person(John Laker, Big Town, 27-Aug-1994, 1-Apr-2001).
  18. 18. Bitemporal Representation The temporal representation only depicted the business valid time, not the time the information was recorded in the database. Bi-temporal representations provide the transaction recorded time also by providing 2 additional fields: Transaction_From and Transaction_To. The following records explain the bitemporal representation: Person(John Laker, Smallville, 3-Apr-1975, ∞, 4-Apr-1975, 27-Dec-1994). Person(John Laker, Smallville, 3-Apr-1975, 26-Aug-1994, 27-Dec-1994, ∞ ). Person(John Laker, Bigtown, 27-Aug-1994, ∞, 27-Dec-1994, 2-Feb-2001 ). Person(John Laker, Bigtown, 27-Aug-1994, 1-Jun-1995, 27-Dec-1994, 2- Feb-2001). Person(John Laker, Beachy, 1-Jun-1995, 3-Sep-2000, 2-Feb-2001, ∞ ). Person(John Laker, Bigtown, 3-Sep-2000, ∞, 2-Feb-2001, 1-Apr-2001 ). Person(John Laker, Bigtown, 3-Sep-2000, 1-Apr-2001, 1-Apr-2001, ∞ ).
  19. 19. Bitemporal implementation in Databases(SQL:2011 std) 1.0 Oracle 12c – Has a new feature called Temporal Validity. Uses a new PERIOD FOR clause. e.g. ALTER table Dept ADD (v_start DATE, v_end DATE, PERIOD FOR vt(v_start,v_end)); (or PERIOD FOR <column>) • vt is the period and is a hidden column. • The details of the period are stored in the dictionary table SYS_FBA_PERIOD. • Supports only conventional DMLs. • Supports TEMPORAL FLASHBACK QUERY. e.g: SELECT * FROM dept AS OF <VERSIONS> PERIOD FOR vt TO_DATE <BETWEEN > ‘2015-01-01’ order by deptno; • Temporal flashback queries are not enabled in a multitenant configuration. • Oracle 12c does not support temporal joins and temporal aggregations. Oracle 11G Workspace Manager - Version enabled tables, valid time support EXECUTE DBMS_WM.EnableVersioning ('employees', 'VIEW_WO_OVERWRITE', FALSE, TRUE); - Version enables the table CREATE TYPE WM_PERIOD AS OBJECT (validFrom TIMESTAMP WITH TIME ZONE, validTill TIMESTAMP WITH TIME ZONE); -- WM_PERIOD can be used to specify a valid time range for a version enabled table.
  20. 20. Bitemporal implementation in Databases 2.0 DB2 10 CREATE TABLE policy_info (policy_id CHAR(4) NOT NULL, coverage INT NOT NULL, bus_start DATE NOT NULL, bus_end DATE NOT NULL, sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW BEGIN, sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END, create_id TIMESTAMP(12) GENERATED ALWAYS AS TRANSACTION START ID, PERIOD BUSINESS_TIME(bus_start, bus_end), PERIOD SYSTEM_TIME(sys_start, sys_end)); 3.0 Teradata 13.0 CREATE MULTISET TABLE Prop_Owner ( customer_number INTEGER, property_number INTEGER, property_VT PERIOD(DATE) NOT NULL AS VALIDTIME, property_TT PERIOD (TIMESTAMP(6) WITH TIME ZONE) NOT NULL AS TRANSACTIONTIME); 4.0 TimesDB (Oracle) Supports Valid Time and Transaction Time fields in ANSI SQL
  21. 21. Multi Temporal models Tri-Temporal Data • Adds a decision time also to the valid and transaction times. • Decision time describes the date and time a decision was made. E.g: Scott becomes Manager. The decision to change the job description from “Analyst” to “Manager” is made on June 24, 2014. It is irrelevent when this change is entered into the system and also irrelevent when scott is officially a Manager.
  22. 22. Tri-Temporal Data Model VT - Valid Time (With Temporal Validity) DT - Decision Time (With Temporal Validity TT - Transaction Time (With FDA) (versions_startscan, versions_endscan) Oracle 12c supports the tri-temporal feature.
  23. 23. Temporal representations in Data warehousing Rows in a dimension table are not associated with time. New rows are simply added. Changes in values of dimension rows with known source identifiers are either simply overwritten or a new row with new surrogate key (with old source system Id) is added based on the slowly changing dimensions concept. For some kind of analysis, dimensions should also be historized, particularly for comparison of measures across different time periods. Example: How did buying habits of customers change over the last 5 years based on where they live? (History of addresses of customers will need to be kept!).
  24. 24. Temporal representations in Data warehouses Typical Star Schema Time Policy_Fact Prof_Center --------------------------- ----------------- Product <foreign keys> PC_ID ------------- PREMIUM_AMT PC_NAME PROD_ID LOSS_AMT DIV_ID ….. EXPENSE_AMT DIV_NAME PROFIT_AMT ………. CUSTOMER --------------------------- Compare Profits Over the years CLIENT_ID -Grouped by business divisions CLIENT_NAME -Grouped by client ratings CLIENT_RATING ..............
  25. 25. Temporal representations in Data warehouses What happens, over time? • Business divisions change (e.g. profit centers are shifted)? • Ratings of clients change? • Two clients merge (e.g., primary insurers in the reinsurance business)? • Geography changes (merges,splits,inactivations etc...) Let us suppose that the dimension heirarchies are: - Product (LOB heirarchy) - Profit Center ->Division -> Group - Customer ->Country->Continent->etc… Let us see how temporal representation handles the changes and historization efficiently for COUNTRY for example:
  26. 26. Temporal representations in Data warehouses Possible changes to COUNTRY dimension: • New value addition • Old value replaced by new value • Invalidation (value no more to be used) • Merge (n values merged into a new value) • Split (Old value divided into n values) • Move (position change in heirarchy) Principle 1.0 Add valid begin and end times in dimensions using object table(country) and single property table (CountryNames). 2.0 Enable foreign keys in fact tables refer to the unchanging IDs in object tables. 3.0 Use the 6th normal form basics to arrive at an efficient model for temporal data representation.
  27. 27. Temporal Representations in Data warehouses Modified Star Schema design (Sample for Customer (Country)) Country CountryNames --------------------- ----------------------------- CountryID CountryID VTimeBeg VTimeBeg VTimeEnd VTimeEnd CountryName CountrySuccession Population -------------------------------- ----------------------- ID – Original ID CountryID SuccID –Direct successor Year CurrID – Ultimate Successor Time ----------------------------
  28. 28. Modeling Temporal data Anchor Modeling – An agile modeling technique using sixth normal form for structurally and temporally evolving data. Flexibility of Anchor Models: • Historization • Null handling – Eliminates NULL • Orphans – Early arriving facts • Separation of Concerns (Start with a small common base and gradually develop into an EDW). • Prototyping Components of Anchor Model: • Anchors • Knots • Attributes • Ties Anchor Model is based on Sixth Normal Form.
  29. 29. 6NF means that every relation consists of a candidate key plus no more than one other (non-key) attribute. Examples: Item {ProductCode, Eff_start_date, Eff_end_date} ItemName {ProductCode*, Name} ItemDesc {ProductCode*, Description} ItemPrice {ProductCode*, Price} Sixth Normal Form
  30. 30. Anchors - Entities • Primarily the surrogate key of the entity. • Has metacolumns that contains: • Batch information • File information • Meta columns should answer the questions WHEN?WHERE?HOW? e.g: Customer (Customer_ID) <#42>
  31. 31. KNOTS –Shared Properties • Shared attributes of the Anchor which is more or less static. • Contains Surrogate key for the Knotted entity. • Contains an attribute value representing the type of the knot • Conatins Metacolumns e.g: The gender of a person <#1, ‘Male’’> <#42, #1> - KNOTted attribute for Customer. Representation of KNOTS KNOTS are represented as follows in an Anchor diagram.
  32. 32. Attributes – Properties Contains: • The foreign key of the belonging Anchor • An attribute value • Historization columns • Metacolumns E.g: Surname of a person <#42, ‘Unknown’, 2004-06-19> Representing Attributes
  33. 33. TIES – Relationships Contains: • Foreign Keys of the related Anchors (which may be an n-tuple) • Historization columns • Metacolumns e.g: Children of a Person <#42, #4711> Representation of TIES.
  34. 34. Anchor Modeling Example 1 The source system supplies the demanded information in two separate source files according to the structure presented below. Analysis has determined each attribute’s ability to change and categorised each attribute into business keys, slowly changing attributes, rapidly changing attributes and meta data. » File 1: * Business Key - Customer Number * Slowly Changing Attribute – Name * Slowly Changing Attribute – Birth Date * Slowly Changing Attribute – Martial Status * Rapidly Changing Attribute – Income * Meta Data – Changed Date * Meta Data - From Date » File 2: * Business Key - Customer Number * Slowly Changing Attribute – Tax Zone * Rapidly Changing Attribute – Loyalty Value * Meta Data - From Date
  35. 35. Anchor Modeling Example 1 Contd…. An anchor model is created as follows (without defining views) » Business keys are loaded into anchors » Each attribute is divided into their own attribute-tables together with technical Meta data. » Attribute with more constant content (such as codes) are created as knots with historic ties holding the status for a specific anchor. Thus reducing a lot of overlapping information minimizing data volumes and providing multiple purpose tables. » Temporal views hold the complete entity provided to subscribers. » New additional data (regardless if the information comes in a new or extended file) is added to new attribute table and completed by extending all views with that attribute. » New historical data can be added without affecting any current information in an instant!
  36. 36. Anchor Modeling Example 1 contd.... The model will look like the following: Marital Status_ID ------------------------ Marital_Status Created_Dt Customer_ID ----------------------------- Customer_No Valid_From_Dt Valid_To_Dt Created_Dt Customer_ID Marital_Status_ID Valid_from_Dt Valid_To_Dt Created_Dt Tax_Zone_ID ------------------------------ Tax_Zone Created_Dt Customer_ID Tax_Zone_ID Valid_from_Dt Valid_To_Dt Created_Dt Customer_ID --------------------- Birth_Date Valid_From_Dt Valid_To_Dt Created_Dt Customer_ID --------------------- Name Valid_from_Dt Valid_To_Dt Created_DT Customer_ID Valid_from_Dt Valid_To_Dt ---------------------- Income Created_Dt
  37. 37. Example 2 Anchor - CU_Customer (CU_ID) Knot - Gen_Gender (Gen_ID, Gen_Gender_Name) Attributes - CUDOB_Customerdateofbirth, (CU_ID,Customerdateofbirth) Ties - CUHH_Customer_Household (CU_ID, HH_ID, HOW_ID, CUHH_Fromdate) Sample values in each object: CU_Customer = (#42,#43,#44) Gen_Gender = (#1,’Male’,#2,’Female) CUDOB_Customerdateofbirth = (#42,1963-08-13,#43,1970-09- 24,#44,1958-12-10) CUGEN_Gender (Knotted attribute) = (#42, #1,#43,#1,#44, #2) CUHH_Customer_Household = (#42,#43,#11,1984-11-20, #42, #44,#18, 1990-04-12)
  38. 38. Customer Store Purchase Item PriceList Inventory Gender CustomerClass HouseholdOwner VisitingFrequencyInter val CustomerDateOfBirth CustomerNumber CustomerName CustomerGender Customer_Address Customer_Household Card_Customer Anchor Modeling Complete Example
  39. 39. Anchor Modeling Complete Example Contd.. Select top 5 * from CU_Customer; - CU_ID ------------ 1 2 3 4 5 Select top 5 * from GEN_gender; GEN_ID GEN_Gender ------------ --------------------------- 1 Male 2 Female Select top 5 * from CUDOB_CustomerDateofBirth; CU_ID CUDOB_CustomerDateOfBirth ---------- -------------------------------------------- 1 1905-03-02 2 1905-07-02 3 1908-09-14 4 1910-02-03 5 1912-04-01 Select top 5 * from CUHH_Customer_Houshold; CU_ID HH_ID HOW_ID CUHH_FromDate ---------- ---------- ------------ ------------------------- 1 1 1 2009-02-13 1 895 0 2009-09-21 2 2 1 2006-10-17 3 3 1 2002-08-20 4 4 1 1993-08-29
  40. 40. Model Evolution CU CUGEN CUNAM CU DOB GEN CUSAL
  41. 41. Typical Anchor Model Example ANCHOR KNOT Historized Attribute Static Attribute Static TIE Historized TIE
  42. 42. Physical Implementation Abstraction layer through views and functions created to reduce complexity due to large number of tables. • Complete View: Denormalization of an anchor table along with its attributes. Constructed Using outer join of anchor table with all its attributes. • Latest View: A view based on the complete view, where only the latest values for historized attributes are included.(Uses a sub- select) • Point-in-Time Function: A function for an anchor with a time point as an argument returning a data set. It is based on the complete view where the latest value of each attribute before or at the time point is included. (A sub-select with a condition that historization time is latest one that is earlier than the time point). • Interval Function: Function using 2 time points to return a data set from the anchor.
  43. 43. Table Elimination Utilized by modern query optimizers to improve the query performance.  Tables that does not contain selected attributes are automatically eliminated from the execution and plan. This can happen if: • No column from a table T is selected: OR • Number of rows returned is not affected by the join with T:  Views and functions defined earlier are created to take advantage of table elimination. • Use anchor table as the left table in the join for view • Attributes must be left outer joined. The left join ensures that the number of rows returned is at least as many as in the anchor table.
  44. 44. Table Elimination Example Oracle optimizer starting with 10gR2 provides the table elimination feature. There are 2 cases when Oracle will eliminate a redundant table: 1.0 Optimizer eliminates tables that are redundant due to primary- foreign key constraints. e.g. create table jobs ( job_id NUMBER PRIMARY KEY, job_title VARCHAR2(35) NOT NULL, min_salary NUMBER, max_salary NUMBER ); create table departments ( department_id NUMBER PRIMARY KEY, department_name VARCHAR2(50) ); create table employees ( employee_id NUMBER PRIMARY KEY, employee_name VARCHAR2(50), department_id NUMBER REFERENCES departments(department_id), job_id NUMBER REFERENCES jobs(job_id) ); select e.employee_name from employees e, departments d where e.department_id = d.department_id; The above query has join to department redundant. Optimizer re- writes the query as follows: select e.employee_name from employees e where e.department_id is not null; Oracle 11g Optimizer also eliminates tables that are anti-joined or semi-joined.
  45. 45. Table Elimination Contd….. 2.0 Outer Join Table Elimination e.g: create table projects ( project_id NUMBER UNIQUE, deadline DATE, priority NUMBER ); alter table employees add project_id number; select e.employee_name, e.project_id from employees e, projects p where e.project_id = p.project_id (+); Since Outer join guarantees the occurrence of every row in employee at least once, and the unique constraint on project.project_id guarantees every row of employee will match at most one row in projects, the project table is redundant and optimizer will eliminate the table from the outer join. The Optimizer will rewrite the query as follows: select e. employee_name from employees e;
  46. 46. Advantages of Anchor Modeling 1.0 Ease of modeling • Expressive concepts and notation – Constructed using small number of expressive concepts. • Historization by Design – Managing different versions is simpler. • Agile Development – Facilitates iterative and flexible modeling. • Reusability and Automation. 2.0 Simplified Database Maintenance. • Ease of attribute changes • Absense of NULL values • Simple Index design – clustered/B-tree indexes. • No Updates, Only Inserts! 3.0 High performance databases • High run-time performance – few columns per table, table elimination. • Efficient storage – Smaller size than normalized databases. • Less Index space needed. • Reduced deadlock issues – Only Inserts!!!
  47. 47. Data Vault Modeling • A flexible data modeling technique built for data warehousing especially when implemented on MPP-environments. • Removes any need for multiple data storages as it stores information as it is delivered to the data warehouse, thereby automatically supporting compliance issues (Basically we divide the information into chunks of information regarding a specific business entity or more precise a business key). Created by Dan Linstedt, the definition is as follows: “A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.”
  48. 48. Data Vault Modeling Data Vault model is comprised of three basic types of tables: HUB - Contains a list of unique business keys having its own surrogate key. LNK - Establishes relationships between business keys (typically hubs, but links can link to other links). SATELLITE - Holds descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension). Data Vault Steps A simplified process can be described in a few steps, take a source, define the business key, and separate your target model into 4 types per source. * Business keys - Hub * Slowly Changing Attributes - Satellite * Rapidly Changing Attributes – Satellite * Business key relationships – Links
  49. 49. Data Vault Modeling Hubs and Satellites Customer Hub Customer_Name satellite
  50. 50. Data Vault Benefits • Scalable and Flexible architecture • Iterative/Agile/Adaptive Data warehousing • Near-Real-Time Loads • History data re- loads • In-Database data mining • Terabytes to Petabytes of information (Big Data) • Incremental build out • Seamless integration of unstructured data • Dynamic Model Adaptation – self healing • Business rule changes (with Ease) Data Vault Modeling
  51. 51. Data Vault Modeling Data Vault Issues: Business Perspective • Data in the DV is not “cleansed or quality checked”. • Using a DV forces examination of source data processes, and source business processes. • Businesses believe their existing operational reports are “right”, the DV architecture proves this is not always the case. • Business Users from different units MUST agree on the elements (scope) they need in the Data Vault before parts of it can be built. • Currently there is only one source of information exchange, there are no books on the Data Vault (yet). • Some businesses fight the idea of implementing a new architecture, they claim it is yet unproven.
  52. 52. Data Vault Modeling Data Vault Issues: Technical Perspective: • Data Vault model introduces many many joins • Data Vault model is based on MPP computing, not SMP computing, and is not necessarily a clustered architecture. • Data Vault contains all deltas, only houses deletes and updates as status flags on the data itself. • Data must be made into information BEFORE delivering to the business. • Stand-alone tables for calendar, geography, and sometimes codes and descriptions are acceptable. • 60% to 80% of source data typically is not tracked by change, forcing a re-load and delta comparison on the way into the DV. • Businesses must define the metadata on a column based level in order to make sense of the Data Vault storage paradigm.
  53. 53. Data Vault Modeling Steps for Data Vault modeling Step1: Establish the Business Keys, Hubs Step 2: Establish the relationships between the Business Keys, Links Step 3: Establish description around the Business Keys, Satellites Step 4: Add Standalone components like Calendars and code or descriptions for decoding in Data Marts Step 5: Tune for query optimization, add performance tables such as Bridge tables and Point-In-Time structures
  54. 54. Data Vault Modeling Example 1 Let’s assume we have a case to integrate customer data into our data warehouse. The source system supplies the demanded information in two separate source files according to the schema presented below. Analysis has determined each attribute’s ability to change and categorised each attribute into business keys, slowly changing attributes, rapidly changing attributes and meta data. » File 1: * Business Key - Customer Number * Slowly Changing Attribute – Name * Slowly Changing Attribute – Birth Date * Slowly Changing Attribute – Martial Status * Rapidly Changing Attribute – Income * Meta Data – Changed Date * Meta Data - From Date » File 2: * Business Key - Customer Number * Slowly Changing Attribute – Tax Zone * Rapidly Changing Attribute – Loyalty Value * Meta Data - From Date
  55. 55. Data Valut Example Contd….. A data vault model is created as follows: » Business keys are inserted into the Customer Hub, including load date and source information from both files. » Slowly changing attributes from file 1 are inserted into a specific satellite for that information only. » Rapidly changing attributes are divided one by one to specific satellites for each file separately. » Slowly changing attributes from file 2 are inserted into a specific satellite for that information only. » No synchronisation is needed at load time since relationships are created on the fly using valid from dates. » New additional data (regardless if the information comes in a new or extended file) is added to new satellites according to the principle of rapid or slowly changing attributes.
  56. 56. Data Valut Example Contd….. The data vault model for the example would be as follows:
  57. 57. Data Vault Example 2 3NF to Data Vault conversion: Consider the following 3NF: SK fields are Surrogate Keys, BK fields are Business Keys
  58. 58. Data Vault Example -2 3NF to Data Vault conversion: The data vault model looks like the following: 1) Instead of each master table in 3NF, we add a hub and a satellite. 2) Instead of the transactional table, we add Link table and Satellite. 3) Instead of the joins between master tables, we add Link tables.
  59. 59. Data Vault Example -2 3NF to Data Vault conversion: Adding attributes/entities into the data vault model is very easy: Attributes like customer demographics, and new table named Delivery can be added without any changes to existing tables.
  60. 60. Data Vault Modeling example - 3 3NF model
  61. 61. Data Vault Modeling example - 3 Data Vault Hub Design
  62. 62. Data Vault Modeling example - 3 Data Vault Hubs, Links and Satellites Design
  63. 63. Data Vault Modeling example - 3 Completed Design
  64. 64. Other Flexible Data models The other flexible models are based on Decomposition Storage Model(DSM). One of the popular DSM is the Index Table Model which is used primarily in SAAS environements. DSM Structure: • Records stored as set of binary relations. • Each relation corresponds to a single attribute and holds <key, value> pairs. • Each relation is stored twice. One cluster indexed by key and the other cluster indexed by value. Example: ACCT TYPE OVERDRAWN? MIN BAL 335 690 Checking N 122 Saving 100 NSM
  65. 65. DSM structure: DSM Other Flexible Data models ACCT 335 690 122 ACCT OVERDRAWN 690 N ACCT MIN BAL 122 100 ACCT TYPE 690 Checking 122 Saving
  66. 66. Example: Distributed relations R1 R2 Other Flexible Data models SS# NAME DOB 123-45-6789 Lara 6/11/76 987-56-3488 Nicole 3/30/79 SS# NAME DOB 987-56-3488 Nicole 3/30/79 346-09-0227 Amber 9/17/80 NSM R1.SS# 123-45-6789 987-56-3488 R2.SS# 987-56-3488 346-09-0227 SS# NAME 123-45-6789 Lara 987-56-3488 Nicole 346-09-0227 Amber SS# DOB 123-45-6789 6/11/76 987-56-3488 3/30/79 346-09-0227 9/17/80 DSM Note: R1 and R2 are in different distributed databases.
  67. 67. Advantages of DSM: • Eliminates Null values • Supports distributed relations (very useful in cloud environments). • Manging delta is easier. • Simple storage structure. • Unform access method (key based and attribute based access only). • Basis for Columnar and NOSQL data models Drawbacks of DSM: • DSM uses more storage (between 1 to 4 times of NSM). • Modification of an attribute require 3 disk writes(2 for record, 1 for index), 2 disk writes for an Insert. • Retrieval query performance depends on the following: • Number of projected attributes • Size of intermediate results (due to joins) • Number of records to be retrieved. Other Flexible Data models
  68. 68. Index Table Model • Primarily used in SAAS environments. • Comprises of a base table and a number of supporting tables. The base table contains all columns common to all individual tenant tables with an additional column called Index. • Each supporting table has 2 columns, one for index and the other for a column which is not common among all tenants. • If there are “n” non-common columns among the private tables, then this model will have “n” supporting tables apart from the base table. • Reduces the sparsity among the tables. • Index provides better access to the required information than other methods. • This model is based on Decomposition Storage Model(DSM). Other Flexible Data models
  69. 69. Example: Original Table Index Table model Index Table Model contd.. Base Table Index Tenant_ID AID Name 1 17 1 ACME 2 17 2 GUMM 3 42 1 BANNER 4 35 1 BALE Index Hospital 1 St.Mary 2 Manipal Index No_Beds 1 135 2 1045 ACCOUNT AID NAME HOSPITAL NO.OF BEDS 1 ACME ST.MARY 135 2 GUMP STATE 1042 Other Flexible Data models
  70. 70. NOSQL Data Models
  71. 71. NOSQL data modeling! Do we need it? • Schema less, yet need data structure based on application data access path (one data access path per data structure). • Data modeling ends up in the code of the application (No change required for physical data structure). • Data architect involvement is crucial in NOSQL implementation. NOSQL Data Models
  72. 72. Typical Scenario Webinar Recording information. Structure: Device IP Address, Program and Date (Primary key), other related information like duration,content etc... Sample data :,Hadoop,20140521230000 Query types: • SELECT * FROM recording where device_ip = ‘’; • SELECT COUNT(*) FROM recording group by program; • SELECT COUNT(*) FROM recording group by date; The above are possible in an RDBMS, how about in a NOSQL database? NOSQL Data Models
  73. 73. NOSQL System Families • Key-Value pair model is the simplest, yet powerful model. One of the drawbacks of this model is the inability to support key range processing. • Ordered Key-Value overcomes this limitation and improves aggregation capabilities. It does not provide value modeling. • Big table (Column Family) model supports value modeling through modeling map-of-maps-of-maps namely column families, columns, and timestamped versions. • Document databases handles arbitrary complexity, and support database managed indexes. Indexes by field names. • Graph model has evolved from Ordered Key-Value models with additional support for heirarchical modeling. (Graph is an abstract representation of set of objects (Nodes) some of which are connected by links (relationships). NOSQL Data Models
  74. 74. Examples of NOSQL Databases Key-Value stores : Oracle NOSQL, Redis, Kyoto BigTable(Column Family): Apache Hbase, Apache Cassandra, Google Spanner/F1 Document : MongoDB, CouchDB Graph : NEO4J, FlockDB NOSQL Data Models
  75. 75. NOSQL Data Models 1.0 Data Denormalization (Applicable to Key-Value stores, Document databases, BigTable databases) 2.0 Aggregation (Applicable to Key-Value stores, Document databases, BigTable databases) 3.0 Application Side Joins (Applicable to Key-Value stores, Document databases, BigTable databases and Graph databases) 4.0 Enumerable Keys (Applicable to Key-Value stores) 5.0 Dimensionality reduction 6.0 Index Table (Applicable to BigTable Databases) Data Modeling Techniques
  76. 76. NOSQL Data Models Normalization & Aggregation
  77. 77. Application side Joins NOSQL Data Models
  78. 78. Enumerable Keys • Use Ordered keys to traverse data Example: By creating a sequence id for messageID, the composite key userID_messageID will enable traversing the previous and succeeding messages for any given messageID. • Group data into buckets based on the ordered attribute. Example: Create bucket based on time(day). Using this, mail box can be traversed forward or backward starting from any date. Dimensionality Reduction Map multidimensional data to a Key-Value model or to a non- multidimensional model using Dimensionality reduction methods. Example: Geohash NOSQL Data Models
  79. 79. NOSQL Data Models Index Table
  80. 80. Heirarchy Modeling Techniques 1.0 Tree Aggregation (Key-Value stores, Document databases) 2.0 Adjacency Lists 3.0 Materialized Paths 4.0 Nested Sets 5.0 Batch Graph Processing NOSQL Data Models
  81. 81. Tree Aggregation NOSQL Data Modeling Techniques Efficient when the entire tree is accessed once. Search, direct access and updates could be inefficient.
  82. 82. Adjacency lists • Simple way of graph modeling. • Each node is modeled as an independent record and contains arrays of direct ancestors and descendents. • Enables traversing the graph by parents or children. Inefficient for deep or wide traversals. Inefficient for accessing an entire tree for a given node. NOSQL Data Models
  83. 83. NOSQL Data Models Materialized paths • Attribute each node by identifiers of all its parent and children. • Avoids recursive traversals of tree-like structures.
  84. 84. NOSQL Data Models Nested Sets Store leafs of the tree in an array, and map each non-leaf node to a range of leafs.
  85. 85. Nested Documents Flattening Example: Name:John NOSQL Data Models Math:Excellent Poetry:Poor ...... Approach 1 : Name: John Skill: Math,Poetry,.... Level:Excellent,Poor,... Query: Skill:Poetry AND Level:Excellent Approach 2 : Name: John Skill_1: Math Level_1:Excellent Skill_2: Poetry Level_2:Poor .. Query: OR (skill_i:Poetry and level_i: Excellent) Query:SkillAndLevel: Distance(Excellent Poetry)=0 Approach 3: Name:John SkillAndLevel:Math Excellent Poetry Poor .....
  86. 86. Typical Scenario using NOSQL (From Slide 69) • Data storage structures created based on all anticipated data access paths • Each of the data structures support a single data access path. • Example using a Column Family structure: Additional access paths can be supported by • Creating secondary indexes (available in latest versions). • Creating additional column families with different key combinations. NOSQL Data Models
  87. 87. Relational to NoSQL Example: NOSQL Data Models • Get user by user id • Get item by item id • Get all the items that a particular user likes • Get all the users who like a particular item Typical Queries: Relational Model
  88. 88. Column Family structure NOSQL Data Models Column Key = Column Name
  89. 89. Approaches: 1.0 Normalized entities. • Cannot support join queries. 2.0 Normalized entities with custom indexes. • Supports join operations, but cannot get the details of all attributes. 3.0 Normalized entities with denormalized indexes. • Supports all the queries mentioned. 4.0 Partially denormalized indexes. • Super columns are hard to maintain and it becomes messy. NOSQL Data Models
  90. 90. NOSQL Data Models Typical Data Model for a Column Family database (Approach 3) Title and Name are de-normalized in User_By_Item and in Item_By_User.
  91. 91. NOSQL Data Models Typical Data Model for a Column Family database (Approach 3 with Timestamp) The above model supports time based queries (e.g. Most Recent) in addition.
  92. 92. Questions??
  93. 93. THANK YOU
  94. 94. Community Focused Volunteer Driven Knowledge Share Accelerated Learning Collective Excellence Distilled Knowledge Shared, Non Conflicting Goals Validation / Brainstorm platform Mentor, Guide, Coach Satisfied, Empowered Professional Richer Industry and Academia About Information Excellence Group Progress Information Excellence Towards an Enriched Profession, Business and Society
  95. 95. About Information Excellence Group Reach us at: blog: presentations: linked in: Facebook: Google+: twitter: #infoexcel email: Have you enriched yourself by contributing to the community Knowledge Share..