Your SlideShare is downloading. ×
White paper on Spool space in teradata
White paper on Spool space in teradata
White paper on Spool space in teradata
White paper on Spool space in teradata
White paper on Spool space in teradata
White paper on Spool space in teradata
White paper on Spool space in teradata
White paper on Spool space in teradata
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

White paper on Spool space in teradata

5,419

Published on

This White Paper on Spool Space in Teradata was presented by Nazir Iqbal at Wipro where he works at present. …

This White Paper on Spool Space in Teradata was presented by Nazir Iqbal at Wipro where he works at present.

TERADATA Spool Space is unused Perm Space that it used for running queries.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,419
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
208
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Spool Space in Teradata - Nazir Iqbal 07-March-2012 1
  • 2. Table of Contents 1. Introduction………………………………………………………………………………………….3 1.1 Spool Space………………………………………………………………………………….3 1.2 Spool Space and Capacity Planning………………………………………………….3 1.3 Spool Space Categories………………………………………………………………….4 1.4 Spool Space Allocation……………………………………………………………………4 2. Causes of spool space error and how to minimize it……………………………………4 3. Know the data…………………………………………………………………………………………5 4. Primary Index…………………………………………………………………………………………5 5. Multiset or Set table…………………………………………………………………………………6 6. Collect Statistics………………………………………………………………………………………7 7. Skewing………………………………………………………………………………………………….7 8. Conclusion……………………………………………………………………………………………….8 9. References………………………………………………………………………………………………8 2
  • 3. 1. INTRODUCTION:1.1 SPOOL SPACE:TERADATA Spool Space is unused Perm Space that it used for running queries. Spool Space is usedto hold intermediate rows during processing, and to hold the rows in the answer set of a transaction.TERADATA recommends 20% of the available perm space is allocated for Spool space but variousacross applications.In the majority of cases, well written SQL queries should not use huge amounts of spool space. Apoor choice of join column, product join and lack of statistics are the main reason of excessive spoolspace consumption. Each user can be set a spool space limit. In later version of TERADATA, this isoften set in the user’s profile.Insufficient spool error is usually the result of poor table design, poor data distribution, or a poorlywritten query. Running out of Spool Space will give the user an error code 2646.1.2 SPOOL SPACE AND CAPACITY PLANNING: H E C KSpool Space and Capacity Planning are mutually dependent concepts.Spool space is critical to the operation of Teradata RDBMS, yet it is frequentlyoverlooked in capacity planning. Size requirements vary from user to user, table to table andapplication to application.For instance,• The Spool space of a user is used to hold the response rows of every query run by that user duringa session. Thus, each user needs a high enough spool allocation to contain the biggest anticipatedanswer set.•Tables containing huge data require more available spool space than smaller tables, becauseintermediate rows are held in spool space during query execution.1.3. SPOOL SPACE CATEGORIESSpool falls into three categories of space.They are:-VolatileIntermediateOutput 3
  • 4. Volatile Intermediate OutputThis Spool is retained until the Intermediate spool results are Output Spool results are eithertransaction completes (unless retained until no longer needed. final rows returned in the answerthe table was created with ON We can determine when set for a query, rows updatedCOMMIT PRESERVE ROW), table intermediate spool is flushed by within, inserted into, or deletedis dropped manually during the examining the output of an from a base table.session, Session ends or Teradata EXPLAIN.RDBMS resets.1.4. SPOOL SPACE ALLOCATIONTeradata RDBMS allocates spool space dynamically only from disk cylinders that are not being usedfor permanent or temporary data. Permanent, temporary, and spool data blocks cannot co-exist onthe same cylinder. Spool space is not reserved. All unused space in the Teradata RDBMS isconsidered available spool space. When spool is released, the file system returns the cylinders it wasusing to the free cylinder list.We allocate spool space for a database, a user, or a user profile, not at the table level.A SPOOL limit defined in a profile takes effect upon completion of a:• CREATE/MODIFY USER statement that assigns the profile to a user.• MODIFY PROFILE statement that changes the spool space limit.If the user is logged on, the profile specification affects the current session.Inefficient SQL queries generally results in Capacity Planning and Spool Space Allocation goingwayward and throws up Spool Space Error which is one of the most common error encountered by aTeradata SQL Programmer.2. CAUSES OF SPOOL SPACE ERROR AND HOW TO MINIMIZE IT:When Resource thresholds are met, like Spool Space exceeded, then either a warning is given bythe DBAs or the query is aborted by them. Different thresholds are set for tactical, decision support,and ad-hoc scenarios.High skew is another cause for Spool Space been exceeded.Not all alerts or warnings indicate there is a problem, as some transactions use high CPU and spoolbecause of large data volumes and complexity of the code.Often we have en-countered scenarios where a SQL query has been running for a long time.The reasons may be: • Missing or aged statistics. • Large product joins. • Merge joins where there is a many-to-many relationship. 4
  • 5. • Set tables that should be Multiset. • Stats reflect zero rows on a table, yet are not empty. • A change in data volume which requires additional stats and it will generate a different explain plan. • Unbalanced parenthesis.The key is to know the data before writing SQL codes.3. KNOW THE DATABelow are a few questions that should always be given a thought so that SQL codes are efficient anddo not exceed the thresholds of Spool Space or CPU. 1. How many rows exist on the tables in the query? 2. What columns are we joining on? 3. Do we need to add filters or additional joins to reduce volume? 4. How many unique values exist on columns? 5. How many rows exist on tables that are duplicated? 6. Queries having derived tables will often show no confidence because the optimizer does not know how many rows are in a derived table. 7. High estimated time can indicate aged stats i.e. stats should be collected again. 8. What type of join is performed? Product Join Merge Join Hash JoinThis is a cross join every row from Requires sort of spool files. Merge The tables do not have to be sortedone table is joined to every row on join are efficient when there is not and the smaller table can be muchthe second table. Spool file is as a many to many relationship on larger than for a product join. Thelarge as (No. Of rows table_one * columns involved in the join. smaller table/spool is "hashed" intoNo. Of rows table_two), large If there is a many to many memory. Then, the larger table isproduct joins (billions of rows) relationship, try to aggregate the scanned and for each row, it looksshould be avoided. Product joins columns on one table to reduce the up the row from the smaller tableare most efficient when a SMALL volume by creating a volatile table, in the hashed table that waslookup table is duplicated. Product derived table or work table. created in memory. If the smallerjoins are inefficient when large fact table is broken into partitions to fittables are duplicated (this can into memory, the larger table mustindicate aged or missing stats). also be broken into the same partitions prior to the join. 4. PRIMARY INDEX A poor primary index having lumpy distribution data which can cause a query to run several hours when it should execute in seconds/minutes. Hence, we should choose a single column or multiple columns that distribute the data evenly across all AMPS. 5
  • 6. Eliminate columns from the primary index that have a lot of null values. Value change rate should be low or never. Column(s) should be frequently used in join constraint. Teradata is a multi parallel processor so a query runs as well as the SLOWEST AMP. If the table joins to a similar table having the same columns, the primary index on both tables should be the same. AMP-4 has much more data than AMP-1, AMP-2 and AMP-3 which causes Spool Space Error. Choice of primary index should be such to avoid such un-even data distribution across AMPs.5. MULTISET OR SET TABLE A set table performs a duplicate row check. If there are a lot of non unique values for a primary index, this can be very CPU intensive. For example, for a primary index having 2000 values a duplicate row check will be performed 4,000,000 times. This is referred to as chaining. The first record is loaded. The next record having the same PI value to load, checks all the columns of the first one to determine if it is a duplicate. Once the third record is loaded, it checks both the first and the second records and so on. A Multiset table allows duplicate rows so the duplicate row check is omitted. If duplicates can be omitted using a group by or filtered programmatically, a load to a multiset table performs better. A Multiset table having a NUPI, non unique primary index, with occurrences between 500 – 2000 is not bad. For tables having non unique primary index where there are several hundred or a couple of thousands values for a given primary index ‘use a multiset table’ For tables having a more unique index like 1 to 10 values for a give primary index ‘use a set table’ Note: the FASTLOAD utility program will not allow duplicates, even if the target table is MULTISET. 6
  • 7. 6. COLLECT STATISTICS Poor or missing statistics OR Aged statistics may cause Spool Space Error. TERADATA recommends that COLLECT STATS should include:- 1. Individual columns in an index. 2. All columns in an index, multi-column where size is less than 16 bytes. 3. Join columns. 4. Filter or qualifying columns. 5. Secondary Index Statistics are not needed for temp tables that are not joined to other tables and only used for staging. Be careful to NOT over collect on statistics. If a table is updated by several inserts multiple times a day, the statistics do not need to be refreshed after each insert. One collection is Significant after the last insert. For tables being completely refreshed, the statistics are Needed after the refresh. TOOL TO CHECK THE EFFICIENCY OF A QUERY: Run this diagnostic command before the explain of the query. At the bottom of the explain it will list the statistics that are missing. Diagnostic helpstats on for session; Explain <OUR SQL TEXT> WE SHOULD NOT COLLECT STATISTICS ON EMPTY TABLES. This will cause the optimizer to choose an inefficient path based on the information available to the parser. Statistics should be collected when a table is initially loaded and anytime the table’s demographics change by more than 10%. After the initial collect statistics on an object, the user can run the statement below to refresh the table’s statistics based on the (new) data. Collect statistics on databasename.tablename; -- this will refresh all stats, index and column, that were previously gathered on a table To see the statistics that exist for a table, run the following: HELP STATISTICS databasename.tablename;7. SKEWING: Proper primary index specification should evenly distribute the rows of a table across the AMPs. This prevents skewing. The Query Log Information in SQL Assistant and other Editors tells us about the degree of Skewing in a query.CPU Skew > 50 reflects worse case scenarios and generally any query having a CPU Skew > 4 is considered poor performing. Hence, by seeing the CPU Skew from the Query Log Information a programmer can easily make out which query needs to be fine-tuned to avoid high Skew. 7
  • 8. CONCLUSION: Most of the performance related issues are caused by poor indexing, missing statistics, aged statistics , over collecting statistics, mismatched data types and missing filters and conditions on where clause. These can be eliminated if we follow the best practices of sound SQL techniques as discussed above. The key to efficient SQL coding is good knowledge of the database and understanding the various join-constraints and the mappings. Database knowledge accompanied with adherence to collect stats feature of TERADATA is the key to avoidance of Spool Space Error.REFERENCES: 1. http://www.info.teradata.com/Datawarehouse/eBrowseBy.cfm?page=TeradataDatabase 2. http://www.teradatatech.com/ 3. www.google.com 8

×