Your SlideShare is downloading. ×
NENUG Apr14 Talk - data modeling for netezza
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NENUG Apr14 Talk - data modeling for netezza

907
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
907
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
92
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Modeling and Netezza Biju Nair NENUG Talk 30-­‐Apr-­‐2014
  • 2. Mo@va@on and Goal • Mo@va@on – Performance degrada@on with data volume • Goal – Highlight considera@ons while modeling for NZ 2
  • 3. Data Modeling • Logical Data Modeling – Business domain data representa@on – Independent of DBMS technology – NZ is an appliance for analy@cs • Set based processing vs row based processing • De-­‐normaliza@on • Snow flake/Star schema • Physical Data Modeling – Takes in the DBMS features and constraints – Need to understand the DBMS architecture 3
  • 4. NZ Architecture Host Snippet Processors Snippet Processors Snippet Processors Data Data Data -­‐ Parse, Op+mize and Compile query -­‐ Schedule snippets -­‐ Distribute data Executes snippets -­‐ Shared nothing MPP -­‐ Custom IP backbone -­‐ Appliance efficiency is maximized when -­‐ snippet processors can run independently i.e. data independence -­‐ All snippet processors are u@lized uniformly 4
  • 5. Snippet Processors Data Accelerator (FPGA) CPU Compute Host Reads compressed data Un-­‐compress Remove columns Restrict rows (where) Perform computa+on Send data to host -­‐ Disk reads are incredibly slow rela@ve to other components especially seek @me -­‐ While the CPU overhead is reduced, volume of data read will impact performance 5
  • 6. Data Storage Host Snippet Processors Data Extend Page Extend Extend … Meta-­‐Data -­‐ Meta data iden@fies extends/pages to read or skip 6
  • 7. Modeling Priori@es • U@liza@on of all snippet processors – Need to be able to u@lize uniformly • Maximize MPP capability of snippet processors – Ideally snippet processors should be independent • Minimize data read from disk – Minimize data stored • Improve computa@on in snippet processor – Compounded with data volume will help performance 7
  • 8. Snippet Processor U@liza@on Data Distribu+on Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 3,MA,0414,… 2,CA,0113,… 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… Distribute by state Data Skew -­‐ NZ will pick one of the columns to distribute if none specified in table defini@on -­‐ First column in the table 8
  • 9. Snippet Processor U@liza@on Data Distribu+on Host 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… Snippet Processors Snippet Processors Distribute by mo-­‐yr Snippet Processors 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… -­‐ Snippet processors are u@lized uniformly -­‐ What if most of the query is on for the current month? -­‐ Processing skew 9
  • 10. Snippet Processor U@liza@on Data Distribu+on Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 4,CA,0414,… 2,CA,0113,… 3,MA,0414,… 1,MA,1212,… 2,CA,0113,… 3,MA,0414,… 4,CA,0414,… Distribute random -­‐ Snippet processors are u@lized uniformly -­‐ Helps prevent processing skew 10
  • 11. Snippet Processor U@liza@on Data Distribu+on and Table Joins 3,ORD1,ITEM1,… 2,ORD1,ITEM1,… 4,ORD1,ITEM1,… 1,ORD1,ITEM1,… Host Snippet Processors Snippet Processors Snippet Processors 4,CA,0414,… 1,MA,1212,… 3,MA,0414,… 2,CA,0113,… 1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,… Distribute random Need to redistribute data from both tables -­‐ Snippet processors are u@lized uniformly -­‐ Makes snippet processors dependent on others impac@ng MPP maximiza@on 11
  • 12. Snippet Processor U@liza@on Data Distribu+on and Table Joins 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 1,ORD1,ITEM1,… 4,ORD1,ITEM1,… Host Snippet Processors Snippet Processors Snippet Processors 4,CA,0414,… 1,MA,1212,… 3,MA,0414,… 2,CA,0113,… 1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,… Distribute on Join column -­‐ cid -­‐ Snippet processors are u@lized uniformly -­‐ Makes snippet processors dependent on others impac@ng MPP maximiza@on -­‐ Becer than the previous scenario Need to redistribute data from one table 12
  • 13. Snippet Processor U@liza@on Data Distribu+on and Table Joins 2,ORD1,ITEM1,… Distribute both tables on Join column -­‐ cid 3,ORD1,ITEM1,… 1,ORD1,ITEM1,… 4,ORD1,ITEM1,… Host Snippet Processors Snippet Processors Snippet Processors 1,MA,1212,… 4,CA,0414,… 2,CA,0113,… 3,MA,0414,… -­‐ Snippet processors are u@lized uniformly -­‐ Makes snippet processors independent maximizing MPP 13
  • 14. Snippet Processor U@liza@on Data Distribu+on • Iden@fy keys to distribute data uniformly – Avoid data and processing skew – Try using join columns as the distribu@on keys – Choose same data types for join columns • If table size is small random distribu@on is fine – If one of the join table is small, NZ will broadcast • Redistribu@on may not be an overkill for small data – For e.g., selec@ng a small number if columns 14
  • 15. Snippet Processor U@liza@on Distribu+on and Query Time 3.5 3 2.5 2 1.5 1 0.5 0 Query Time For Different Distribu+ons Random 1 Correct Distribu@on 2 Correct Distribu@on Time (min) 15
  • 16. Snippet Processor U@liza@on Join Column Type and Query Time 2.5 2 1.5 1 0.5 0 Join query +me -­‐ same and diff data types Incorrect Data Types Correct Data Types Time (min) 16
  • 17. Minimize Data Read From Disk Zone Maps • Data types which supports Zone Maps – All integer data types • int1 • int2 • int4 • int8 – Date – Timestamp 17 Refer to the product manual for the version of NZ used for the complete list of zone map able data types
  • 18. Minimize Data Read From Disk Table column (cid) is numeric(10,0) Host Snippet Processors Data Extend 1 Page Extend 2 … Extend 3 Zone Maps Meta-­‐Data -­‐ May end up reading all data from disk No zone map for cid 18
  • 19. Minimize Data Read From Disk Table column (cid) is bigint Host Snippet Processors Data Extend Page Extend Extend … Zone Maps Meta-­‐Data -­‐ Zone maps can be used to minimize data read from disk 19
  • 20. Minimize Data Read From Disk Zone Maps and Query Time 1.6 1.55 1.5 1.45 1.4 1.35 1.3 Query +me with and without zone map Incorrect Data Type Correct Data Type Time (min) 20
  • 21. Minimize Data Read From Disk Clustered Base Tables • NZ stores data with same organize keys closely • Addi@onal data types are zone map able – char – varchar – nchar – nvarchar – float – double – bool – @me – Interval • Helps improve performance of mul@ table join 21
  • 22. Minimize Data Read From Disk Extend Extend Extend Clustered Base Tables Table distributed on cid 1,MA,Boston,… 5,CA,LA,… 3,FL,Tampa,,… 1,MA,Salem,… 5,CA,SF,… 3,FL,Orlando,… 1,MA,Lowell,… 5,CA,Pasadena, … 3,FL,Miami,… Table distributed on cid organize on state Extend 1,MA,Boston,… 1,MA,Salem,… 1,MA,Lowell,… Extend 3,FL,Tampa,,… 3,FL,Orlando,… 3,FL,Miami,… Extend 5,CA,LA,… 5,CA,SF,… 5,CA,Pasadena, … State Extend 1,AL,Alabama,… 5,CA,California, … 10,FL,Florida,… 22,MA,Mass,… 22
  • 23. Minimize Data Read From Disk Clustered Base Table 2.5 2 1.5 1 0.5 0 Query Time with and without org No Org+Correct Dist Org+Correct Dist Time (min) 23
  • 24. Minimize Data Read From Disk Materialized Views • View with frequently used columns of base table – Unlike views, materialized view stores data – Reduced data read from disk – Addi@onal storage required – Need to be refreshed if base table data changes • Can be used as an index against base table – Stores loca@on of base table data loc in a column 24
  • 25. Minimize Data Read From Disk Extend Extend Extend Materialized Views T_EMP distributed on cid 1,MA,Mike,… 5,CA,Fally,… 3,FL,Chris,… 4,MA,Robert,… 7,CA,Mary,… 2,FL,Jus+n,… 6,MA,Harini,… 8,CA,Mike,… 9,FL,Martha,… SELECT ID, NAME FROM T_EMP; MV on T_EMP order by state, name Extend 5,CA,Fally 7,CA,Mary 8,CA,Mike 3,FL,Chris 2,FL,Jus+n 9,FL,Martha 6,MA,Harini 1,MA,Mike 4,MA,Robert SELECT * FROM T_EMP WHERE STATE = ‘CA’ 25
  • 26. Minimize Data Read From Disk Materialized Views 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Query +me on single table -­‐MV vs No MV No MV With MV Time (min) 26
  • 27. Minimize Data Read From Disk Materialized Views 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Join Query Time -­‐ With and without MV No MV With MV Time (min) 27
  • 28. Minimize Data Read From Disk Minimize Data Stored • Choose storage efficient data types – Difference between bigint and int is 4 bytes • Use char instead of varchar if the data length is fixed – varchar has a 2 byte overhead • Define columns as “not null” where possible • Store only the required data in table columns • Encode duplicate data stored in rows 28
  • 29. Improve Computa@on in Snippet Processor NZ Object Defini+ons • Define columns as “not null” where possible – Removes logic to check nulls • Define table keys and rela@onships – Helps NZ query op@mizer to generate efficient code 29
  • 30. 30 bnair@asquareb.com blog.asquareb.com https://github.com/bijugs @gsbiju

×