Data modelingzone geoffrey-clark-v2


Published on

I presented these slides at Data Modeling Zone Europe on September 24 2013.

Published in: Travel, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Jeff Kibler @ Infobright
  • Data modelingzone geoffrey-clark-v2

    1. 1. Physical Database Design for MPP and Columnar Databases Geoffrey Clark Principal at Lucidata, Inc. September 2013 copywrite, Lucidata, 2013
    2. 2. Conceptual, Logical, Physical • Conceptual links to Business Strategy. – This is now becoming more quantitative • Logical maps to the Business Semantics. – Con-way example • Physical maps to your Data Stores – These will be more varied and heterogeneous in the future, due to specialization. copywrite, Lucidata, 2013
    3. 3. HBR Business Strategy The New Dynamics of Competition, Michael D. Ryall, Harvard Business Review, June 2013 Michael Porter’s Five Forces has dominated strategic and competitive analysis since 1979. This analysis has largely been conceptual in nature. Quantitative analysis on structured data in context is changing the nature of business culture, and improving business decisions. This drives the demand for data modeling and management. copywrite, Lucidata, 2013
    4. 4. Design and Evolution • Hierarchies – 14th Century Europe and the Financial Revolution – Aggregations & Allocations • Cards, Tapes – physical analog media • Computer Science – Moore’s Law • Processor Speed Improvements • Memory Improvements • Media Improvements – Punch Cards, Tape, Disk, Memory • Design for Context & the Future – Character encoding - Internationalization – Calendars – Gregorian, Fiscal, Lunar, ... Y2K? • Files and Fields – Separation of Data and Metadata – Modern versions -> XML, JSON • Joins! – Data Sets – Super types, Sub types – Associations describe Networks! copywrite, Lucidata, 2013
    5. 5. Technology’s Improvement Pace copywrite, Lucidata, 2013
    6. 6. ... and Demand Forecast copywrite, Lucidata, 2013
    7. 7. Separation of Church and State • Operational uses – Capture the data, hand-entered <- validation – A Data Flow, such as Order to Cash cycle – Con-way example of PRO(-gressive) numbers • Analytical uses – Desire for reports, Reporting crashes the Operational cycle, Cash flow problem. – Banished from OLTP, go make an ODS copywrite, Lucidata, 2013
    8. 8. The Star Schema The purpose of business computers is to sort data. A graphical representation of sorted data is called a ‘Star Schema’. – Michael Silves, Principal at Datamorphosis • The right design at the right time, becomes default doctrine for DW – Early RDBMS (Relational Data Base Management Systems) • Low memory, slow disks, slow CPU • Big Demand, with questions that spanned the datasets • Performance issues over large datasets – Interview Business people to get questions • Pre-process the data, based on business questions – Separation into Dimensions and Facts/Metrics • Link to Business Semantics • OLAP (On-Line Analytical Processing) • Educate Users on Aggregation and Allocation • Conformed Dimensions across Departments to give an Enterprise-wide view of the data. • But as technology changes, problems emerge – Ad-hoc questions require redesign & rework – With business hierarchies when one concept is both a fact & dimension, e.g. Shipment – Fact tables become difficult to distribute for MPP ... e.g. Teradata prefers a normalized DW • Example – transportation networks copywrite, Lucidata, 2013
    9. 9. Example – Multi-Modal Freight • Shipments are agreements between a Carrier and a Shipper to move goods between two places. • Shipments can be split into “ProFreight” (which is assigned a cost via activity-based costing). • Shipments/ProFreight are composed of Freight handling units. • Freight can be “re-tendered” to another carrier, in which case is is linked to the original and the new Shipment. • Freight moves between places on one or many “VFCs” or Containers. • Containers are moved between places on Trips. copywrite, Lucidata, 2013
    10. 10. Kimball on Transportation, 3NF copywrite, Lucidata, 2013
    11. 11. Kimball on Transportation, Star copywrite, Lucidata, 2013
    12. 12. Table Level DW diagram copywrite, Lucidata, 2013
    13. 13. Dim Modeling Dogma • “Our carefully normalized data model can not be translated into a star schema... “ – Dimensional modeling is necessary in order to generate correct queries – Any (normalized) data model can be transformed in a dimensional model... – ... and there exists an algorithm to do it copywrite, Lucidata, 2013
    14. 14. Dim Modeling Example copywrite, Lucidata, 2013
    15. 15. Star option considered copywrite, Lucidata, 2013
    16. 16. Bridge table (remember, we tried this) We tried this with hesmith When selecting a main hierarchy is has too much of a downside, and you don’t have a weight factor … copywrite, Lucidata, 2013
    17. 17. Multi-fact option considered copywrite, Lucidata, 2013
    18. 18. Oracle’s Algorithmic approach copywrite, Lucidata, 2013
    19. 19. Basic DW diagram copywrite, Lucidata, 2013
    20. 20. Build Dimensional Model in BI copywrite, Lucidata, 2013
    21. 21. Freight moves through Networks copywrite, Lucidata, 2013
    22. 22. Information Factory & MPP • Normalized Base – Integrate data once • Source -> Normalized -> Denormalized -> OK • Source -> Denormalized? -> Un-normalized -> ? – Detect problems and fix them once! • Does not preclude Data Marts • Massive Parallel Processing – Data distribution • Optimizations – Broadcast, Co-location, Re-distribution • Scalability, the quest for 1:1 • Normalized data - reduced IO, better match for copywrite, Lucidata, 2013
    23. 23. Bob Conway’s Rapid Methodology copywrite, Lucidata, 2013
    24. 24. Core Model with many Roles Transaction Tables Reference Tables copywrite, Lucidata, 2013
    25. 25. Power of Conformed Dimensions copywrite, Lucidata, 2013
    26. 26. Example Data Model & Hierarchy copywrite, Lucidata, 2013
    27. 27. Data Flow and Usage copywrite, Lucidata, 2013
    28. 28. Cubes and In-memory BI • Multi-Dimensional OLAP (MOLAP) – Drag-and-Drop OLAP environment, analysts become capable of self-service. – Dealt with Ragged Hierarchies, common in Financial data such as General Ledger (GL) – Limited by memory size – Pressure for more dimensionality floods cube size, build times from relational sources exceed load windows ... • Relational OLAP (ROLAP) copywrite, Lucidata, 2013
    29. 29. But a network this size choked it copywrite, Lucidata, 2013
    30. 30. Columnar vs Row-wise • Physically store data by Column vs Row – Rather like Fifth Normal Form. – If Semantically Organized, then Rapid Response to user’s ad-hoc aggregation requests. – Prefers batch loading, always loads once per column, even if loading one row. • Continues to Appear and Operate as a normal Row-wise cousin. copywrite, Lucidata, 2013
    31. 31. Columnar IO example Compression becomes much more effective Reading a Column is like reading a Row copywrite, Lucidata, 2013
    32. 32. Design Pattern for Log Data Data Stewards for Master Data Data Stewards for Metadata Architects integrate data and metadata Architects organize data for analysis with physical in mind Architects identify levels for analysis, and distributionColumnar MPP copywrite, Lucidata, 2013
    33. 33. Importance of Reference Data copywrite, Lucidata, 2013
    34. 34. Infobright’s Database Landscape 2011 copywrite, Lucidata, 2013
    35. 35. Analytic Database Comparison Actian ParAccel IBM Netezza HP Vertica Green plum Tera data Sybase IQ copywrite, Lucidata, 2013
    36. 36. Gartner’s Magic Quadrant copywrite, Lucidata, 2013
    37. 37. Hadoop (Cloudera & Hortonworks) “Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI copywrite, Lucidata, 2013
    38. 38. Hadoop for Analytics? Analytics performs best on Structured Data, for good reasons. Maintain MPP strengths in the solution through Architecture. copywrite, Lucidata, 2013
    39. 39. Message from Hortonworks (Hadoop) “Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI, Lucidata, 2013
    40. 40. Hadoop as ETL copywrite, Lucidata, 2013
    41. 41. Data Flow Reference Architecture copywrite, Lucidata, 2013
    42. 42. Message from Neo4J NoSQL copywrite, Lucidata, 2013
    43. 43. Message from MongoDB (NoSQL), Lucidata, 2013
    44. 44. Message from Couchbase (NoSQL), Lucidata, 2013