Data Design Process and Considerations


Published on

Data Design Process Considerations and Practices from Balaji Venkataraman, Data Architect, Dell Presented in Information Excellence Group in 2013 Nov

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Design Process and Considerations

  1. 1. Information Excellence Hosted by 2013 Oct Knowledge Share Session Harvesting Information Excellence Balaji Venkataraman, Data Architect, Dell Data Design Process, Considerations and Practices Information Excellence
  2. 2. Data Design Process Considerations and Practices Balaji Venkataraman, Data Architect, DELL Acknowledgement Attila Finta – Chief Architect, Dell EBI - My Mentor [Content sourced from his work] DVR Subrahmanyam – Intel - Support, Guidance and Encouragement
  3. 3. Balaji Venkataraman Balaji Venkataraman, Data Architect, Dell Information Technology Data Warehouse/ Business Intelligence Professional having 16+ years of Industry Experience which includes being part of the Architecture and Design teams of Data Warehouses like that of Dell’s. He has previously worked for Delphi-TVS, PSI Data Systems and iGate Corporation. Played several Individual contributor roles like Support, Developer, ETL Designer, Data Designer, Information Architect, etc. Currently a member of the Analytics and Business Intelligence Innovation Team at Dell, Bangalore. 3
  4. 4. Agenda • Data Design and Challenges • Data Design Process and Deliverables • Considerations for Standards and Best Practices • Data Profile, Quality, Metadata, ILM and Columnar 2
  5. 5. EDW: PB Scale 1000s of Modeled Entities 1000s of User Maintained Data 100s of Schemas 1000s Users
  6. 6. Data Design? • Data Architecture is defining, organizing, cataloging – logically and physically – the information of the enterprise that is electronically represented, stored and exchanged in terms of its creation, meaning and utilization. • Data modeling in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques. – To manage data as a resource – For the integration of information systems – For designing databases/data warehouses (aka data repositories) • Data models provide a structure for data used within information systems by providing specific definition and format. 4
  7. 7. Considerations / Challenges in Data Modeling • Presentation of models to non-technical audiences • Selling the value of data modeling to the business • More focus on business needs, with less focus on implementation and the final product or “the perfect data model” • More emphasis on conceptual modeling, which also may help data modelers to be more sought out in the emerging non-relational world • Adaptation of relational approaches to accommodate Big Data and other emerging technologies • Better engagement with NoSQL and other nontraditional databases • More education/training of current practitioners in the newest trends and technologies • Growth of new data architects, data modelers, and other data experts within colleges and universities • Changing from a control-oriented mindset where the model is the only focus, to a service-oriented mindset that focuses on communication and marketing 5 Concerns: • Visibility – Data Designers need to provide more visibility • Speed – Agile Adoption • Quality - Good rather than Perfect • Perspective –Incremental better than absolutes • Collaboration – Work closely with other stake holders – ETLAs / DBAs • Skills – Expand Skillsets
  8. 8. Types of Data Models • Conceptual – First step in organizing the data requirements – Consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes • Logical – Describes the structure of some domain of information in a normalized fashion. This consists of meaningful, descriptive, non-circular entity and attribute definitions – FK relationships to EDW entities already implemented • Physical – Describes the physical means used to store data. This is concerned with partitions, CPUs, storage, and the like 6
  9. 9. Data Designer Core KSEs • Business systems analysis • Industry DW architectures • Data profiling • Data modeling – 3NF and dimensional • Database basics • Data Modeling Tools – DB features Aptitudes & Interests of a Great Data Architect / Analyst / Designer • Curiosity: "Why?", "What if?", "How?“ • Ability to move from the conceptual and abstract to specific and back again • Ability to visualize the big picture and the myriad details and how the latter affects the former • Ability to clearly communicate: (1) convey, explain, illustrate, (2) hear, listen, elicit , understand -- verbally and in writing • Ability to “speak the language” of the business and the technical – and translate between them C3 communication coordination collaboration – Forward engineering • Source to target mapping specification 7 The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. ~ George Bernard Shaw
  10. 10. Data Design Process • Analysis Analysis • Logical modeling • Physical design Logical modeling Physical design Deliverables: • Source data profile * • Data model: logical/physical • DDL: generated from model • Source to target (S2T) mapping 8
  11. 11. Why Do You Need a Logical Data Model? Provides discipline and structure Facilitates communication Common understanding of business concepts Cross-functional and unbiased What goes into a Data Model ? It graphically represents the data requirements and data organization of the business – – – Identifies those things about which it is important to track information (entities) Facts about those things (attributes) Associations between those things (relationships) Subject-oriented, designed in Third Normal Form – one fact in one place in the right place 9
  12. 12. Data Design Process – Reviews / Check Points 10
  13. 13. Source Data Profiling • Examine the nature, scope, content, meaning, structure of data from the source system proposed for inclusion in the DW • Determine quality of the data: – Completeness – Consistency – Integrity • Determine its fit into the DW – Does it belong? Is it meaningful, useful, necessary? – How does it integrate with, complement, supplement what already exists in the DW? 11
  14. 14. Source Data Profiling – What To Look For? • Does the content match the name and expected information • Candidates to omit – Columns 100% null, or containing only 1 value • Candidates to transform/conform/default – Code values that are similar or semantically the same as existing code sets in the DW but are inconsistent, e.g. country codes • Candidates to – Normalize – Create FKs – Compress values – Reduce column size – Default value 12
  15. 15. Source to Target Specification (S2T) • Data designer – Has examined the source data structure, content, and meaning – Understands the business requirements of the DW – Designs the DW target structures for the data • Therefore the data designer is in the best position to specify the movement and transformation of data from the source structure to target structure in the DW … “connecting the dots” – Column specific – Transformation and validation rules in pseudo-code 13
  16. 16. Data Warehouse Data Layer Source – SOR / User Maintained Data STG – Copy of Raw Data from Source within DW - Transient in Nature - Minimize Impact on Source Base – DW Integration Layer - Subject Oriented - Integrated - Single Version of the Truth - Business Validated Source for all data in DW • Package – Data Marts – Custom Views; – Built for BI Performance – SLAs – Designed to Reduce System Load 14
  17. 17. Data Design Principles • Model-driven design – Generating DB executables from CASE tool – CASE tool is where the data design is created, maintained, and documented – Source code (DDL) must be generated from the CASE tool – Data model and DDL versions are controlled artifacts • Build with performance in mind – Careful PK selection – Value compression – Minimize row size as much as possible › “Vertically partition” where it makes sense, i.e. split one large table into multiple tables having the same PI and partitioning • Extend existing logical data models – Don’t create new models from scratch: extend/modify existing models – Start with the logical data model, make changes there, not in the physical model – allow the tool to generate derive the physical names 15
  18. 18. Industry Logical Data Models • Teradata’s iLDM – covers Manufacturing, Finance, Banking, Retail, etc • Oracle • ARDM – Applied Resource Data Management – Health Care • Universal Data Model – Len Silverston • IBM’s Advanced Data Model 16
  19. 19. Model Management • Integrated model library environment, enabling cross-model analysis and reporting • Semi-automated version management, enabling a single data model to have multiple versions stored together with the exact deltas recorded and maintain by the tool, rather than model snapshots that exist as separate files stored in manually maintained folder structures – reduces the number of copies of models and prevents duplication of metadata • Underlying relational repository storing all data model metadata together, enabling not only repository-wide, cross-model reporting but also metadata extraction for integration with (for example) source-to-target mapping metadata and BI layer metadata • Integrated model platform can also enable (in a future phase) – automated monitoring, scorecarding, and other quality checks on a broader scale – streamlining and decentralizing some model administration functions while maintaining necessary coordination and quality controls • Collateral benefit: reducing storage and I/O demand on Pub Shares by transferring data models to a different platform 17
  20. 20. Basic Data Modeling Standards (1) • Standard audit columns (a.k.a. “plumbing columns”) ”) on all tables › E.g. DW_SRC_SITE_ID, DW_LD_GRP_VAL, DW_INS_UPD_DTS • Data naming of entities and attributes – Meaningful and specific business names, sufficiently qualified › E.g. “ Last Name”, not simply “Last Name” › Entity names should be unique across the enterprise and the DW, e.g. “ Organization”, not simply “Organization” – Auto abbreviation of table and column names by data modeling tool › Based on the logical, unabbreviated name › Using standard abbreviations list – Attributes must end with a class word, e.g., ID, CD, DT, AMT, etc. › Payment 18 , Payment , or Payment , not simply Payment
  21. 21. Basic Data Modeling Standards (2) • Use standard domains (pre-defined in the modeling tool) for common attributes › E.g. SO_NBR varchar(20), BU_ID integer, ITM_NBR varchar(30) • Business definitions for all entities and attributes – Non-circular, e.g. not “Order Type Code is a code representing types of orders” – Make a good faith effort. Involve the BSA, business and source SMEs; give them a XLS to edit › Include sample/common values based on data profiling › If no one can tell you what it means and how it’s used then push back: “Then it is useless for reporting or analysis, and so should be omitted from the DW” › If the PM and BSA still insist it must come into the DW then document its definition: (1) USE THIS DEFINITION: “No definition available from source data stewards or end-users. Do not use this attribute. Its meaning and quality are unknown." And (2) PROVIDE SAMPLE VALUES in the definition, preferably values that have been used recently and frequently. 19
  22. 22. Data Modeling Best Practices • Always begin with a normalized model – Normalization requires understanding the data – A normalized data model is inherently optimized for › No data redundancy › Wide variation of access paths › High data integrity › Efficient data maintenance • Normalized means every non-PK attribute of in the entity is wholly dependent on the entire PK • Define the unique natural/business key of each table – If the PK is simply a Surrogate Key (SK) adopted from the source, wherever possible identify the Natural Key, i.e. what makes the record unique in business terms (which drives the source system to generate a new SK value). – If the NK is different than the PK it can be reflected in the model as an Alternate Key (unique index) 20
  23. 23. Data Modeling Best Practices • Name new attributes consistently with the rest of the DW • Include the parent/reference entity in the model and create FK relationships to it – Ensures data type compatibility – Inherits name and definitions • Relationships should be explicitly defined in the data model for two reasons – Referential integrity – Depict join keys • When including a reference entity from another data model (e.g. Sales Order in the Manufacturing model) color it gray and do not generate 21
  24. 24. Data Modeling Best Practices • Definitions – Natural Key (NK) = the column(s) used as the immutable unique identifier meaningful to business users, i.e. the “business key”. The business key is usually the PK used within and provided by the source system of record (SOR). – Surrogate Key (SK) = the immutable, non-intelligible, unique numeric identifier (system-generated during ETL) corresponding globally unique NK to a • Surrogate Keys – Useful to generate in the DW where › Type 2 SCD to support reproducible “as was” point-in-time reporting › Key integration from disparate sources with colliding key values – Comes with overhead of lookups in the load and extra joins in retrieval 22
  25. 25. Data Modeling Best Practices • “Natural keys” or business keys – Facilitate natural joins of disparate data across the DW not consciously designed for integration – Should be used only where the DW team has assurance from Enterprise Architecture and the business segment IT that the business key is global and largely immune to source system changes or additions • When DW-generated SKs can be avoided: – No requirement for Type 2 SCD or key integration – Business keys fulfill the basic requirements of a good PK: › Unchanging, unique – Avoiding overhead of SK lookup – Eliminating the need for table joins simply to obtain the business key for a FK 23
  26. 26. Data Modeling Best Practices • Data that requires versioned history should reside in separate tables from data that does not, even if the identity and granularity of the data is the same – Example: Foobar Header (non-CDC, record updated in place) and Foobar Status History (CDC, changes are inserted as new records). – For the CDC table a DATE attribute (such as status date or update date) can be added to the Natural Key. – Note how the status and status date were also included in the non-CDC table, where status and status date will always be updated with the latest values, making a join to the Status History table unnecessary when the only thing required is the current status. 24
  27. 27. Data Modeling Best Practices • If a parent/reference entity doesn’t exist but it would help portray join paths and enable data type propagation , then create one but do not generate the table 25
  28. 28. Data Modeling Best Practices • Unintelligible IDs and Code attributes – Needs a reference entity in the data model to decode that ID/Code value – In some cases it is acceptable to simply document the code descriptions in the attribute definition, if the set of valid values is small and stable – › The field is of no value in reporting or analysis › We should exclude it from the DW in transactional data • Every entity should have relationships – Every entity in a data model should have an explicit relationship to at least one other – If not then how can the data be used? Is it really an island of data that has nothing to do with anything else in company? 26
  29. 29. Data Modeling Best Practices • Get input from anyone you can – Ad hoc, informal with other data designers, SMEs, etc. – Aim to maintain readability › Relationships not routed under entities › Entities not piled on top of one another › Minimize relation lines on top of one another › Create custom submodels (ERwin subject areas) to portray particular sets of entities – and insight 27 can understand a data model can contribute a valuable
  30. 30. Design with Performance in Mind • Build performance enhancing features into the physical data model from the start – Avoid the common mistake of “Build First, Tune Later” – Any design that meets the functional and business requirements but lacks in performance will always need to be revisited, creating rework, delays, and user dissatisfaction • Considerations – Load: consider “vertical partitioning” i.e. splitting a table into 2 or more, with the same PK if … › There are two separate sources for an entity, for different attributes › Some data for the entity is volatile (e.g. status) while most of the data remains static, or some attributes should have CDC history but others do not – Retrieval: ditto if … › Will data from two particular tables nearly always be retrieved together? › Will 80% of the queries on a 100 column table access only 20% of the columns? And vice versa? 28
  31. 31. Indexing • Index liberally from the beginning, then monitor usage and pare down – When in doubt additional Indexes, then monitor how much it is actually used by the optimizer – If it is rarely used in normal production access then it is not worth the operational overhead of maintaining it, so drop the index. – If it is a multi-column index and not all columns are used, drop and recreate the index with just the columns needed. • In general, add extra indexes to: – Foreign Keys – Other join columns – Frequently used filter columns, e.g. certain date fields 29
  32. 32. Right-sizing columns • Make columns the right size: large enough to accommodate all anticipated values – and no larger • Our principal guide is the size in the source system: – Field size and max length of actual values – Be aware that the source system may have columns that are much large than necessary. › This often occurs when a company implements a purchased application that has been designed to accommodate the needs of many other purchasers of the application. › Example: An EDW receives Customer Hub/Master data from a COTS application called Initiate. There Country Code is defined as Varchar(60). But in EDW it is ISO 2-character country codes in this field. So we don’t need to make that column in the Base table Varchar(60) when all it ever contains is Char(2). 30
  33. 33. Right-sizing columns (cont’d) • Corollary: “Know thy data” i.e., – In the previous example we discovered that EDW actual data requires only 3% of the size. (Not only that, it can be a CHAR with value compression.) – Other examples: › A Varchar(30) code column. When we look at the data we see that it always contains one of three values, none of which is longer than 9 bytes. … And in EDW we can make it char(20) compress (‘X’,’Y’,’Z’) › A Varchar(20) column that contains two values, e.g. FOREIGN, DOMESTIC. Can this be converted to a Foreign Flag char(1)? 31
  34. 34. Right-sizing columns (cont’d) • Why bother with this? – Shorter rows => more rows per block => more efficient I/O – Varchars don’t always save space, and have performance impacts • What about the risks? Longer data may be added to the source system later! – Talk with source system and business SMEs – Show them your data profiling results – “Split the difference” › A Varchar(600) column in the source but the max actual length of current values = 60. We can double it and round up, to something like 150 – Key: perform “due diligence” 32
  35. 35. Data Type / Domain Practices: Amounts • In general use the AMOUNT standard domain, DECIMAL(18,3) – This should suffice for most uses. If the business requires greater scale, it is acceptable to specify 18,4 or 18,5 › Avoid making it larger than Decimal 18 (unless storing the global GDP or the US federal debt) – In many DBMS’s, Decimal (18) requires 8 bytes, while Decimal(19) and larger takes 16 bytes • Ensure the currency of the amount is clearly indicated – If the column will contain amounts of varying currencies then a separate column is needed to specify the currency code for the amount – If the amount is always in USD or EUR, then include the currency code in the column name, e.g. TXN_USD_AMT 33
  36. 36. Data Type / Domain Practices: Dates and Timestamps When the source provides a Timestamp column (a single field containing both Date and Time) … • Verify that the column truly contains a real, non-zero Time value. If it contains only a date value then make the target a Date only column, not a Timestamp. – • Date columns are more efficient than Timestamp columns. DATE data type requires only 4 bytes, but 10 for TIMESTAMP If … – – The source does not provide parallel a Date-only column and – • Timestamp column contains a real, non-zero Time value and It is likely that BI consumption will be concerned mainly with the Date only regardless of the Time component Then create two target columns: a TIMESTAMP with the full date and time, and a DATE column with the date only. – 34 This provides more options for queries and joins and reduces the need to use data functions in queries.
  37. 37. Data Type / Domain Practices: Quantities • If a quantitative numeric column contains data of varying units of measure then a separate column is required identifying the unit of measure for the quantity • If the quantity is always in a single unit of measure then include the unit of measure in the column name , e.g. HARD_DRIVE_CAPACITY_GB_QTY 35
  38. 38. Uniqueness • Where possible, have a unique index – PK or AK defined on the table • Traditionally we have refrained from enforcing uniqueness in the DW using the DB constraints, but … – The cost-based query optimizer loves unique indexes because this enables it to create more efficient queries 36
  39. 39. Partitioning • Generally refers to “horizontal partitioning” – Easy way to remember: think of a table as a spreadsheet that you are going to cleave down the middle … will you split it vertically or horizontally? • The main idea: the system and the user access the data in the table more efficiently most of the time if we create logical sections – If you have a table with a lot of rows and you know that the data is quite frequently accessed filtering on one or more particular columns, then consider range-partitioning on them, particularly dates. – Partition elimination is one of the most powerful query performance boosters, occurring when the partition key is used in the WHERE clause in the SELECT statement. – Note that partitioning incurs some overhead on load, and also in queries when the partition key is not part of the WHERE clause. 37
  40. 40. Consider Columnar Tables Columnar Suitability Does the system have spare CPU resources? -- Is using Insert/Select to load tables possible within the environment? Columnar Basics Review Primarily for Read optimized environments applications
  41. 41. Not Null • Wherever possible define non-quantitative columns as NOT NULL, especially FKs. – This may seem counter-intuitive because of the overhead of constraint enforcement by the DBMS – But the query optimizer built into the DBMS creates more efficient query plans based on this knowledge • Define quantitative columns – amounts, quantities, counts, etc. – as NOT NULL only if the source data is also not null • Where necessary designate default values – Specified by the business, or … – ‘N/A’ or 0 for a column where a value is not expected in 100% of source records or … – Unknown or -1 for a column where a valid value is expected but was not received 39
  42. 42. Metadata Search 40 Confidential
  43. 43. ILM • ILM = Information Lifecycle Management … – All information undergoes a "life" sequence: – (1) it is created, (2) it used/useful and meaningful for a definable period of time, and (3) at some point it is no longer useful and should be destroyed. • Archive Data to a Cheaper storage; Purge old data • Benefits: – Mitigates capital investment required to support growth of the data warehouse – Reduced table sizes boosts performance on EDW • Two aspects actively managed: – Policy – Implementation - 41
  44. 44. Questions 42
  45. 45. About Information Excellence Group Community Focused Volunteer Driven Knowledge Share Accelerated Learning Collective Excellence Distilled Knowledge Shared, Non Conflicting Goals Validation / Brainstorm platform Progress Information Excellence Towards an Enriched Profession, Business and Society Information Excellence Mentor, Guide, Coach Satisfied, Empowered Professional Richer Industry and Academia
  46. 46. You Can Help this Community Grow Something to Feel Genuinely Happy and Proud About Host Us Two Hour Monthly Session Half Day deep dive Session Full Day Summit Session Speaker Support Recommend Speakers Suggest Topicss Progress Information Excellence Towards an Enriched Profession, Business and Society Thank you for Hosting us Today All Our Sessions are Free for participants; All Support and Sponsorship in Non Cash mode Information Excellence
  47. 47. About Information Excellence Group Reach us at: blog: linked in: facebook: presentations: twitter: #infoexcel email: Have you enriched yourself by contributing to the community Knowledge Share.. Information Excellence