NENUG Apr14 Talk - data modeling for netezza
Upcoming SlideShare
Loading in...5
×
 

NENUG Apr14 Talk - data modeling for netezza

on

  • 420 views

 

Statistics

Views

Total Views
420
Views on SlideShare
417
Embed Views
3

Actions

Likes
0
Downloads
16
Comments
0

1 Embed 3

http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NENUG Apr14 Talk - data modeling for netezza NENUG Apr14 Talk - data modeling for netezza Presentation Transcript

  • Data Modeling and Netezza Biju Nair NENUG Talk 30-Apr-2014
  • Motivation and Goal • Motivation – Performance degradation with data volume • Goal – Highlight considerations while modeling for NZ 2
  • Data Modeling • Logical Data Modeling – Business domain data representation – Independent of DBMS technology – NZ is an appliance for analytics • Set based processing vs row based processing • De-normalization • Snow flake/Star schema • Physical Data Modeling – Takes in the DBMS features and constraints – Need to understand the DBMS architecture 3
  • NZ Architecture Host Snippet Processors Snippet Processors Snippet Processors Data Data Data - Parse, Optimize and Compile query - Schedule snippets - Distribute data Executes snippets - Shared nothing MPP - Custom IP backbone - Appliance efficiency is maximized when - snippet processors can run independently i.e. data independence - All snippet processors are utilized uniformly 4
  • Snippet Processors Data Accelerator (FPGA) CPU Compute Host Reads compressed data Un-compress Remove columns Restrict rows (where) Perform computation Send data to host - Disk reads are incredibly slow relative to other components especially seek time - While the CPU overhead is reduced, volume of data read will impact performance 5
  • Data Storage Data Snippet Processors Host Extend Page Extend Extend … Meta-Data - Meta data identifies extends/pages to read or skip 6
  • Modeling Priorities • Utilization of all snippet processors – Need to be able to utilize uniformly • Maximize MPP capability of snippet processors – Ideally snippet processors should be independent • Minimize data read from disk – Minimize data stored • Improve computation in snippet processor – Compounded with data volume will help performance 7
  • Snippet Processor Utilization Data Distribution Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 3,MA,0414,… 2,CA,0113,… 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… Distribute by state - NZ will pick one of the columns to distribute if none specified in table definition - First column in the table Data Skew 8
  • Snippet Processor Utilization Data Distribution Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… Distribute by mo-yr - Snippet processors are utilized uniformly - What if most of the query is on for the current month? - Processing skew 9
  • Snippet Processor Utilization Data Distribution Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 4,CA,0414,… 2,CA,0113,… 3,MA,0414,… 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… 4,CA,0414,… Distribute random - Snippet processors are utilized uniformly - Helps prevent processing skew 10
  • 3,ORD1,ITEM1,… 4,ORD1,ITEM1,… 2,ORD1,ITEM1,… 1,ORD1,ITEM1,… Snippet Processor Utilization Data Distribution and Table Joins Host Snippet Processors Snippet Processors Snippet Processors 4,CA,0414,… 1,MA,1212,… 3,MA,0414,… 2,CA,0113,… 1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,… Distribute random - Snippet processors are utilized uniformly - Makes snippet processors dependent on others impacting MPP maximization Need to redistribute data from both tables 11
  • 3,ORD1,ITEM1,…2,ORD1,ITEM1,… 1,ORD1,ITEM1,… 4,ORD1,ITEM1,… Snippet Processor Utilization Data Distribution and Table Joins Host Snippet Processors Snippet Processors Snippet Processors 4,CA,0414,… 1,MA,1212,… 3,MA,0414,… 2,CA,0113,… 1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,… Distribute on Join column - cid - Snippet processors are utilized uniformly - Makes snippet processors dependent on others impacting MPP maximization - Better than the previous scenario Need to redistribute data from one table 12
  • 3,ORD1,ITEM1,…2,ORD1,ITEM1,… 1,ORD1,ITEM1,… 4,ORD1,ITEM1,… Snippet Processor Utilization Data Distribution and Table Joins Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 4,CA,0414,… 2,CA,0113,… 3,MA,0414,… Distribute both tables on Join column - cid - Snippet processors are utilized uniformly - Makes snippet processors independent maximizing MPP 13
  • Snippet Processor Utilization Data Distribution • Identify keys to distribute data uniformly – Avoid data and processing skew – Try using join columns as the distribution keys – Choose same data types for join columns • If table size is small random distribution is fine – If one of the join table is small, NZ will broadcast • Redistribution may not be an overkill for small data – For e.g., selecting a small number if columns 14
  • Snippet Processor Utilization Distribution and Query Time 0 0.5 1 1.5 2 2.5 3 3.5 Random 1 Correct Distribution 2 Correct Distribution Query Time For Different Distributions Time (min) 15
  • Snippet Processor Utilization Join Column Type and Query Time 0 0.5 1 1.5 2 2.5 Incorrect Data Types Correct Data Types Join query time - same and diff data types Time (min) 16
  • Minimize Data Read From Disk Zone Maps • Data types which supports Zone Maps – All integer data types • int1 • int2 • int4 • int8 – Date – Timestamp 17 Refer to the product manual for the version of NZ used for the complete list of zone map able data types
  • Data Snippet Processors Host Extend 1 Page Extend 2 Extend 3 … Meta-Data - May end up reading all data from disk Minimize Data Read From Disk Zone Maps Table column (cid) is numeric(10,0) No zone map for cid 18
  • Data Snippet Processors Host Extend Page Extend Extend … Meta-Data - Zone maps can be used to minimize data read from disk Minimize Data Read From Disk Zone Maps Table column (cid) is bigint 19
  • Minimize Data Read From Disk Zone Maps and Query Time 1.3 1.35 1.4 1.45 1.5 1.55 1.6 Incorrect Data Type Correct Data Type Query time with and without zone map Time (min) 20
  • Minimize Data Read From Disk Clustered Base Tables • NZ stores data with same organize keys closely • Additional data types are zone map able – char – varchar – nchar – nvarchar – float – double – bool – time – Interval • Helps improve performance of multi table join 21
  • Extend Extend Extend Minimize Data Read From Disk Clustered Base Tables Table distributed on cid 1,MA,Boston,… 5,CA,LA,… 3,FL,Tampa,,… 1,MA,Salem,… 5,CA,SF,… 3,FL,Orlando,… 1,MA,Lowell,… 5,CA,Pasadena, … 3,FL,Miami,… Extend Extend Extend Table distributed on cid organize on state 1,MA,Boston,… 1,MA,Salem,… 1,MA,Lowell,… 3,FL,Tampa,,… 3,FL,Orlando,… 3,FL,Miami,… 5,CA,LA,… 5,CA,SF,… 5,CA,Pasadena, … Extend 1,AL,Alabama,… 5,CA,California, … 10,FL,Florida,… State 22,MA,Mass,… 22
  • Minimize Data Read From Disk Clustered Base Table 0 0.5 1 1.5 2 2.5 No Org+Correct Dist Org+Correct Dist Query Time with and without org Time (min) 23
  • Minimize Data Read From Disk Materialized Views • View with frequently used columns of base table – Unlike views, materialized view stores data – Reduced data read from disk – Additional storage required – Need to be refreshed if base table data changes • Can be used as an index against base table – Stores location of base table data loc in a column 24
  • Extend Extend Extend Minimize Data Read From Disk Materialized Views T_EMP distributed on cid 1,MA,Mike,… 5,CA,Fally,… 3,FL,Chris,… 4,MA,Robert,… 7,CA,Mary,… 2,FL,Justin,… 6,MA,Harini,… 8,CA,Mike,… 9,FL,Martha,… SELECT ID, NAME FROM T_EMP; Extend MV on T_EMP order by state, name 5,CA,Fally 7,CA,Mary 8,CA,Mike 3,FL,Chris 2,FL,Justin 9,FL,Martha 6,MA,Harini 1,MA,Mike 4,MA,Robert SELECT * FROM T_EMP WHERE STATE = ‘CA’ 25
  • Minimize Data Read From Disk Materialized Views 0 0.2 0.4 0.6 0.8 1 1.2 1.4 No MV With MV Query time on single table -MV vs No MV Time (min) 26
  • Minimize Data Read From Disk Materialized Views 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 No MV With MV Join Query Time - With and without MV Time (min) 27
  • Minimize Data Read From Disk Minimize Data Stored • Choose storage efficient data types – Difference between bigint and int is 4 bytes • Use char instead of varchar if the data length is fixed – varchar has a 2 byte overhead • Define columns as “not null” where possible • Store only the required data in table columns • Encode duplicate data stored in rows 28
  • Improve Computation in Snippet Processor NZ Object Definitions • Define columns as “not null” where possible – Removes logic to check nulls • Define table keys and relationships – Helps NZ query optimizer to generate efficient code 29
  • Slideshare: http://www.slideshare.net/bijugs Contact: bnair@asquareb.com 30