The thinking persons guide to data warehouse design


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Ultimately you want a physical engine that obsoletes your need to worry about logical design
  • This is what Teradata was doing way back in the day.
  • Microsoft Data allegro story
  • Translate to CSV format and then load.
  • When you monitor, you’re monitoring your design for the most part – SQL excluded
  • Drizzle implementing my design. Some patches out there that help with some of this: userstats-v2
  • The thinking persons guide to data warehouse design

    1. 1. The Thinking Person’s Guide to Data Warehouse Design Robin Schumacher VP Products Calpont
    2. 2. Agenda Building a logical design Transitioning to a physical design Monitoring and tuning the design
    3. 3. Building a logical design
    4. 4. Why care about design…?
    5. 5. What is the key component for success? In other words, what you do with your MySQL Server – in terms of physical design, schema design, and performance design – will be the biggest factor on whether a BI system hits the mark… * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. *
    6. 6. First – get/use a modeling tool
    7. 7. The logical design for OLTP
    8. 8. Simple reporting databases OLTP Database Read Shard One Reporting Database Application Servers End Users ETL Just use the same design on a different box… Replication
    9. 9. Horror story number one…
    10. 10. The logical design for analytics/data warehousing
    11. 11. Logical Design Considerations <ul><li>Datatypes are more generally defined, not directed toward a database engine </li></ul><ul><li>Entities aren’t designed for performance necessarily </li></ul><ul><li>Redundancy is avoided, but simplicity is still a goal </li></ul><ul><li>Bottom line: you want to make sure your data is correctly represented and is easily understood (new class of user today) </li></ul>
    12. 12. Manual horizontal partitioning Modeling technique to overcome large data volumes
    13. 13. Manual Vertical Partitioning Modeling technique to overcome wide tables/rows
    14. 14. Pro’s/con’s to manual partitioning <ul><li>More tables to manage </li></ul><ul><li>More referential integrity to manage </li></ul><ul><li>More indexes to manage </li></ul><ul><li>Joins oftentimes needed to accomplish query requests </li></ul><ul><li>Oftentimes, a redesign is needed because the rows / columns you thought you’d be accessing together change; it’s hard to predict ad-hoc query traffic </li></ul><ul><li>Less I/O if design holds up </li></ul><ul><li>Easy to prune obsolete data </li></ul><ul><li>Possibly less object contention </li></ul>Pro’s Con’s
    15. 15. The bottom line on logical modeling <ul><li>Use a modeling tool to capture your designs </li></ul><ul><li>Do not utilize a third-normal form design for analytics; keep it simple and understandable </li></ul><ul><li>Manual partitioning is OK in some cases, but,.. </li></ul><ul><li>Let the database engine do the work for you </li></ul>
    16. 16. Transitioning to a physical design
    17. 17. SQL or NoSQL…? Row or Column database…? How to scale…? Should I worry about High availability…? Index or no…? How should I partition my data…? Is sharding a good idea…?
    18. 18. General list of top BI database design decisions <ul><li>General architecture / data orientation </li></ul><ul><li>Storage engine selection </li></ul><ul><li>Physical table/Index partitioning </li></ul><ul><li>Indexing creation and placement </li></ul><ul><li>Optimizing data loads </li></ul>
    19. 19. Divide & conquer is the best approach <ul><li>Whether you choose to go NoSQL, Shard with normal or special MySQL engines, use MPP storage engines, or something similar, divide & conquer is your best friend </li></ul><ul><li>You can scale-up and divide & conquer to a point, but you will hit disk, memory, or other limitations </li></ul><ul><li>Scaling up and out is the best future proof methodology </li></ul>
    20. 20. Divide & conquer via sharding
    21. 21. What technologies you should be looking at * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. *
    22. 22. Row or column-based engine? Medium-very large data Small-medium data Very dynamic; query patterns change Know exactly what to index; won’t change Need very fast loads; little DML Will be doing lots of single inserts/deletes Only need subset of columns for query Will need most columns in a table for query Yes, Column-based tables! Yes, Row-based tables!
    23. 23. Column vs. row orientation A column-oriented architecture looks the same on the surface, but stores data differently than legacy/row-based databases…
    24. 24. Example: InfiniDB vs. “Leading” row DB InfiniDB takes up 22% less space InfiniDB loaded data 22% faster InfiniDB total query times were 65% less InfiniDB average query times were 59% less Notice not only are the queries faster, but also more predictable * Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data
    25. 25. Why not use both…? <ul><li>You can create a hybrid system where you use row-based tables and column-based tables in the same instance and same database </li></ul><ul><li>Use InnoDB for OLTP or MyISAM for certain read operations </li></ul><ul><li>Use column-based tables for analytics, data marts, or warehouses </li></ul><ul><li>You can scale out with column tables and use row-based tables locally </li></ul>
    26. 26. Why not use both…?
    27. 27. Most used DW Storage engines internal to MySQL MyISAM Archive Memory CSV <ul><li>High-speed query/insert engine </li></ul><ul><li>Non-transactional, table locking </li></ul><ul><li>Good for data marts, small warehouses </li></ul><ul><li>Compresses data by up to 80% </li></ul><ul><li>Fastest for data loads </li></ul><ul><li>Only allows inserts/selects </li></ul><ul><li>Good for seldom accessed data </li></ul><ul><li>Main memory tables </li></ul><ul><li>Good for small dimension tables </li></ul><ul><li>B-tree and hash indexes </li></ul><ul><li>Comma separated values </li></ul><ul><li>Allows both flat file access and editing as well as SQL query/DML </li></ul><ul><li>Allows instantaneous data loads </li></ul>Also:Merge for pre-5.1 partitioning
    28. 28. What about NoSQL options? <ul><li>Standard model is not relational </li></ul><ul><li>Typically don’t use SQL to access the data </li></ul><ul><li>Take up more space than column databases </li></ul><ul><li>Lack special optimizers / features to reduce I/O </li></ul><ul><li>Really are row-oriented architectures that store data in ‘column families, which are expected to be accessed together (remember logical vertical partitioning?) Individual columns cannot be accessed independently </li></ul><ul><li>Will be faster with individual insert and delete operations </li></ul><ul><li>Will normally be faster with single row requests </li></ul><ul><li>Will lag in typical analytic / data warehouse use cases </li></ul>
    29. 29. Partitioning – not ‘if’ but ‘how’ <ul><li>mysql> CREATE TABLE part_tab </li></ul><ul><li>-> ( c1 int ,c2 varchar(30) ,c3 date ) </li></ul><ul><li>-> PARTITION BY RANGE (year(c3)) (PARTITION p0 VALUES LESS THAN (1995), </li></ul><ul><li>-> PARTITION p1 VALUES LESS THAN (1996) , PARTITION p2 VALUES LESS THAN (1997) , </li></ul><ul><li>-> PARTITION p3 VALUES LESS THAN (1998) , PARTITION p4 VALUES LESS THAN (1999) , </li></ul><ul><li>-> PARTITION p5 VALUES LESS THAN (2000) , PARTITION p6 VALUES LESS THAN (2001) , </li></ul><ul><li>-> PARTITION p7 VALUES LESS THAN (2002) , PARTITION p8 VALUES LESS THAN (2003) , </li></ul><ul><li>-> PARTITION p9 VALUES LESS THAN (2004) , PARTITION p10 VALUES LESS THAN (2010), </li></ul><ul><li>-> PARTITION p11 VALUES LESS THAN MAXVALUE ); </li></ul><ul><li>mysql> create table no_part_tab (c1 int,c2 varchar(30), c3 date); </li></ul><ul><li>*** Load 8 million rows of data into each table *** </li></ul><ul><li>mysql> select count(*) from no_part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31'; </li></ul><ul><li>+----------+ </li></ul><ul><li>| count(*) | </li></ul><ul><li>+----------+ </li></ul><ul><li>| 795181 | </li></ul><ul><li>+----------+ </li></ul><ul><li>1 row in set (38.30 sec) </li></ul><ul><li>mysql> select count(*) from part_tab where c3 > date '1995-01-01' and c3 < date '1995-12-31'; </li></ul><ul><li>+----------+ </li></ul><ul><li>| count(*) | </li></ul><ul><li>+----------+ </li></ul><ul><li>| 795181 | </li></ul><ul><li>+----------+ </li></ul><ul><li>1 row in set (3.88 sec) </li></ul>90% Response Time Reduction
    30. 30. Partitioning – Stripe your Partitions <ul><li>CREATE TABLE T1 (col1 INT, col2 CHAR(5), col3 DATE) ENGINE=MYISAM </li></ul><ul><li>PARTITION BY HASH(col1) </li></ul><ul><li>( </li></ul><ul><li>PARTITION P1 </li></ul><ul><li>DATA DIRECTORY = '/appdata1/data', </li></ul><ul><li>PARTITION P2 </li></ul><ul><li>DATA DIRECTORY = '/appdata2/data', </li></ul><ul><li>PARTITION P3 </li></ul><ul><li>DATA DIRECTORY = '/appdata3/data’, </li></ul><ul><li>PARTITION P4 </li></ul><ul><li>DATA DIRECTORY = '/appdata4/data’ </li></ul><ul><li>); </li></ul>Note that striping only works for some engines (e.g. MyISAM, Archive) and for only certain operating systems (e.g. the option is ignored on Windows). You can use the REORGANIZE PARTITION command to move current partitions to new devices.
    31. 31. Partitioning – Smart Data Pruning <ul><li>mysql> delete from t2 where </li></ul><ul><li>-> c3 > date '1995-01-01' and c3 < date '1995-12-31'; </li></ul><ul><li>Query OK, 805114 rows affected (47.41 sec) </li></ul>Most data warehouses have pruning or obsolete data operations that remove unwanted data. Using partitioning allows you to much more quickly and efficiently remove obsolete data: mysql> alter table t1 drop partition p1; Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 VS. The DROP PARTITION is A DDL operation, which runs much faster than a DML DELETE.
    32. 32. Index Creation and Placement <ul><li>If query patterns are known and predictable, and data is relatively static, then indexing isn’t that difficult </li></ul><ul><li>If the situation is a very ad-hoc environment, indexing becomes more difficult. Must analyze SQL traffic and index the best you can </li></ul><ul><li>Over-indexing a table that is frequently loaded / refreshed / updated can severely impact load and DML performance. Test dropping and re-creating indexes vs. doing in-place loads and DML. Realize, though, any queries will be impacted from dropped indexes </li></ul><ul><li>Index maintenance (rebuilds, etc.) can cause issues in MySQL (locking, etc.) </li></ul><ul><li>Remember some storage engines don’t support normal indexes (Archive, CSV) </li></ul><ul><li>Remember that a benefit of (most) column databases is that they do not need or use indexes </li></ul>
    33. 33. Optimizing for data loads <ul><li>The two biggest killers of load performance are (1) very wide tables for row-based tables; (2) many indexes on a table; </li></ul><ul><li>Stating the obvious, LOAD DATA INFILE and the high-speed loaders of column-based engines are the fastest way to load data vs. singleton or array insert statements </li></ul><ul><li>Column-based tables typically load faster than row-based tables with load utilities, however they will experience slower insert/delete rates than row-based tables </li></ul><ul><li>Loading data in primary key format helps some engines (e.g. InnoDB). </li></ul>
    34. 34. Optimizing for data loads <ul><li>Move the data as close to the database as possible; avoid having applications on remote machines do data manipulations and send data across the wire a row at a time – perhaps the worst way to load data </li></ul><ul><li>Oftentimes good to create staging tables then use procedural language to do data modifications and/or create flat files for high speed loaders </li></ul><ul><li>Loading data via time-based order helps some column databases like InfiniDB; logical range partitioning is then possible </li></ul>
    35. 35. Monitoring and tuning the design
    36. 36. Three performance analysis methods Bottleneck analysis Workload analysis Ratio analysis
    37. 37. Bottleneck analysis <ul><li>The focus of this methodology is the answer to the question “what am I waiting on?” </li></ul><ul><li>With MySQL, unfortunately, it can be difficult to determine latency in the database server </li></ul><ul><li>Lock contention rarely an issue in data warehouses </li></ul><ul><li>New MySQL performance schema has a ways to go in my opinion to be truly useful for bottleneck analysis </li></ul><ul><li>Problems found in bottleneck analysis translate into better lock handling in the app, partitioning improvements, better indexing, or storage engine replacement </li></ul>
    38. 38. Workload analysis <ul><li>The focus of this methodology is the answer to three questions: (1) Who’s logged on?; (2) What are they doing?; (3) How is my machine handing it? </li></ul><ul><li>Monitor active and inactive sessions. Keep in mind idle connections do take up resources </li></ul><ul><li>I/O and ‘hot objects’ a key area of analysis </li></ul><ul><li>Key focus should be on SQL statement monitoring and collection; something that goes beyond standard pre-production EXPLAIN analysis </li></ul>
    39. 39. Horror story number two…
    40. 40. The pain of slow SQL * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.
    41. 41. Workload analysis <ul><li>SQL analysis basically becomes bottleneck analysis, because you’re asking where your SQL statement is spending its time </li></ul><ul><li>Once you have collected and identified your ‘top SQL’, the next step is to do tracing and interrogation into each SQL statement to understand its execution </li></ul><ul><li>Historical analysis is important too; a query that ran fine with 5 million rows may tank with 50 million or with more concurrent users </li></ul><ul><li>Design changes usually involve data file striping, indexing, partitioning, or parallel processing additions </li></ul>
    42. 42. Ratio analysis <ul><li>Least useful of all the performance analysis methods </li></ul><ul><li>May be OK to get a general rule of thumb as to how various resources are being used </li></ul><ul><li>Do not be misled by ratios; for example, a high cache hit ratio is sometimes meaningless. Databases can be brought to their knees by excessive logical I/O </li></ul><ul><li>Design changes from ratios typically include the altering of configuration parameters and sometimes indexing </li></ul>
    43. 43. Conclusions <ul><li>Design is the #1 contributor to the overall performance and availability of a system </li></ul><ul><li>With MySQL, you have greater flexibility and opportunity than ever before to build well-designed data warehouses </li></ul><ul><li>With MySQL, you now have more options and features available than ever before </li></ul><ul><li>The above translates into you being able to design data warehouses that can be future proofed: they can run as fast as you’d like (hopefully) and store as much data as you need (ditto) </li></ul>
    44. 44. For More Information <ul><li>Download InfiniDB Community Edition </li></ul><ul><li>Download InfiniDB documentation </li></ul><ul><li>Read InfiniDB technical white papers </li></ul><ul><li>Read InfiniDB intro articles on MySQL dev zone </li></ul><ul><li>Visit InfiniDB online forums </li></ul><ul><li>Trial the InfiniDB Enterprise Edition: </li></ul>
    45. 45. The Thinking Person’s Guide to Data Warehouse Design Robin Schumacher [email_address]