Clustered Tables vs Heap Tables
• If a table has no indexes or only has non-clustered indexes it is called a heap
An age old question is whether or not a table must have a clustered index. The
answer is no, but in most cases, it is a good idea to have a clustered index on the
table to store the data in a specific order.
• The name suggests itself, these tables have a Clustered Index. Data is stored in a
specific order based on a Clustered Index key.
Cluster table
Heap Tables
Clustered Tables vs Heap Tables
HEAP
• Data is not stored in any particular
order
• Specific data can not be retrieved
quickly, unless there are also non-
clustered indexes.
• Data pages are not linked, so
sequential access needs to refer back
to the index allocation map (IAM)
pages
• Since there is no clustered index,
additional time is not needed to
maintain the index
• Since there is no clustered index, there
is not the need for additional space to
store the clustered index tree
• These tables have a index_id value of 0
in the sys.indexes catalog view
Clustered Table
• Data is stored in order based on the
clustered index key
• Data can be retrieved quickly based on the
clustered index key, if the query uses the
indexed columns
• Data pages are linked for faster sequential
access
• Additional time is needed to maintain
clustered index based on INSERTS,
UPDATES and DELETES
• Additional space is needed to store
clustered index tree
• These tables have a index_id value of 1 in
the sys.indexes catalog view
Clustered Tables vs Heap Tables
Types of Indexes
• Clustered index
• Nonclustered index
• Unique index
• Filtered index
• Covered Index
• Columnstore index
• Non-Key Index Columns
• Implied indexes
Created by some constraints
i. Primary Key
ii. Unique
Types of Indexes
• Full-text index
A special type of token-based functional index that is built and maintained by
the Microsoft Full-Text Engine for SQL Server. It provides efficient support for
sophisticated word searches in character string data.
• Spatial index
A spatial index provides the ability to perform certain operations more
efficiently on spatial objects (spatial data) in a column of the geometry data
type.
Types of Indexes
Clustered Index
• The top-most node of this tree is called
the "root node"
• The bottom level of the nodes is called
"leaf nodes"
• Any index level between the root node
and leaf node is called an "intermediate
level"
• The leaf nodes contain the data pages of
the table in the case of a cluster index.
• The root and intermediate nodes
contain index pages holding an index
row.
• Each index row contains a key value and
pointer to intermediate level pages of
the B-tree or leaf level of the index.
• The pages in each level of the index are
linked in a doubly-linked list.
Clustered Index
Database and leaf node
Root Abby Bob Carol Dave
Abby Ada Andy Ann
Ada Alan Amanda Amy
• A clustered index
sorts and stores the
data rows of the table
or view in order based
on the clustered index
key.
• The clustered index is
implemented as a B-
tree index structure
that supports fast
retrieval of the rows,
based on their
clustered index key
values.
The basic syntax to create a clustered index is
CREATE CLUSTERED INDEX Index_Name ON Schema.TableName(Column);
• A clustered index stores the data for the table based on the columns defined in the
create index statement. As such, only one clustered index can be defined for the
table because the data can only be stored and sorted one way per table.
Nonclustered Index
• Index Leaf Nodes and Corresponding Table Data
• Each index entry consists of the
indexed columns (the key,
column 2) and refers to the
corresponding table row
(via ROWID or RID).
• Unlike the index, the table data is
stored in a heap structure and is
not sorted at all.
• There is neither a relationship
between the rows stored in the
same table block nor is there any
connection between the blocks.
Nonclustered Index
Database
Root Abby Bob Carol Dave
Amy Ada Amanda Alan
Leaf node
Abby Ada Andy Ann
Ada Alan Amanda Amy
• A nonclustered index can be
defined on a table or view
with a clustered index or on a
heap.
• Each index row in the
nonclustered index contains
the nonclustered key value
and a row locator
The basic syntax for a nonclustered index is
CREATE INDEX Index_Name ON Schema.TableName(Column);
• SQL Server supports
up to 999
nonclustered
indexes per table.
CLUSTERED VS. NONCLUSTERED INDEXES
• Clustered index: a SQL Server index that sorts and stores data
rows in a table, based on key values.
• Nonclustered index: a SQL Server index which contains a key
value and a pointer to the data in the heap or clustered index.
• The difference between clustered and nonclustered SQL
Server indexes is that
• a clustered index controls the physical order of the data pages.
• The data pages of a clustered index will always include all the columns
in the table, even if you only create the index on one column.
• The column(s) you specify as key columns affect how the pages are
stored in the B-tree index structure
• A nonclustered index does not affect the ordering and storing of the
data
Clustered and Nonclustered Indexes Interact
• Clustered indexes are always unique
– If you don’t specify unique when creating them, SQL Server may
add a “uniqueifier” to the index key
• Only used when there actually is a duplicate
• Adds 4 bytes to the key
• The clustering key is used in nonclustered indexes
– This allows SQL Server to go directly to the record from the
nonclustered index
– If there is no clustered index, a record identifier will be used instead
1 Jones John
2 Smith Mary
3 Adams Mark
4 Douglas Susan
Adams 3
Douglas 4
Jones 1
Smith 2
Leaf node of a clustered
index on EmployeeID
Leaf node of a nonclustered
index on LastName
Clustered and Nonclustered Indexes Interact
(continued)
• Another reason to keep the clustering key small!
• Consider the following query:
SELECT LastName, FirstName
FROM Employee
WHERE LastName = 'Douglas'
• When SQL Server uses the nonclustered index, it
– Traverses the nonclustered index until it finds the desired key
– Picks up the associated clustering key
– Traverses the clustered index to find the data
Deciding what indexes go where?
• Indexes speed access, but costly to maintain
– Almost every update to table requires altering both data pages
and every index.
• All inserts and deletions affect all indexes
• Many updates will affect non-clustered indexes
• Sometimes less is more
– Not creating an index sometimes may be best
• Code for tranasaction have where clause? What columns used?
Sort requried?
• Selectivity
– Indexes, particularly non-clustered indexes, are primarily beneficial in
situations where there is a reasonably HIGH LEVEL of Selectivity within
the index.
• % of values in column that are unique
• Higher percentage of unique values, the higher the selectivity
– If 80% of parts are either ‘red’ or ‘green’ not very selective
Deciding what indexes go where?
Choosing Clustered Index
• Only one per table! - Choose wisely
• Default, primary key creates clustered index
– Do you really want your prime key to be clustered index?
– Option: create table foo myfooExample
(column1 int identify
primary key nonclustered
column2 ….
)
– Changing clustered index can be costly
• How long? Do I have enough space?
Clustered Indexes Pros & Cons
• Pros
– Clustered indexes best for queries where columns in question will
frequently be the subject of
• RANGE query (e.g., between)
• Group by with max, min, count
– Search can go straight to particular point in data and just keep reading
sequentially from there.
– Clustered indexes helpful with order by based on clustered key
Clustered Indexes Pros & Cons
• The Cons – two situations
– Don’t use clustered index on column just because seems thing to do
(e.g., primary key default)
– Lots of inserts in non-sequential order
• Constant page splits, include data page as well as index pages
• Choose clustered key that is going to be sequential inserting
• Don’t use a clustered index at all perhaps?
These are limits, not goals. Every index you create will take up space in your
database. The index will also need to be modified when inserts, updates, and
deletes are performed. This will lead to CPU and disk overhead, so craft indexes
carefully and test them thoroughly
There are a few limits to indexes.
• There can be only one clustered index per table.
• SQL Server supports up to 999 nonclustered indexes per table.
• An index – clustered or nonclustered – can be a maximum of 16 columns and
900 bytes.
Limits to indexes
PRIMARY KEY AS A CLUSTERED INDEX
• Primary key: a constraint to enforce uniqueness in a table. The primary key
columns cannot hold NULL values.
• In SQL Server, when you create a primary key on a table, if a clustered index
is not defined and a nonclustered index is not specified, a unique clustered
index is created to enforce the constraint.
• However, there is no guarantee that this is the best choice for a clustered
index for that table.
• Make sure you are carefully considering this in your indexing strategy.
Unique Index
• An index that ensures the uniqueness of each value in the indexed column.
• If the index is a composite, the uniqueness is enforced across the columns as a whole,
not on the individual columns.
• For example, • if you were to create an index on the FirstName and LastName
columns in a table, the names together must be unique, but the
individual names can be duplicated.
• A unique index is automatically created when you define a primary key or unique
constraint:
• Primary key: When you define a primary key constraint on one or more
columns, SQL Server automatically creates a unique, clustered index if a
clustered index does not already exist on the table or view. However, you can
override the default behavior and define a unique, nonclustered index on the
primary key.
• Unique: When you define a unique constraint, SQL Server automatically creates
a unique, nonclustered index. You can specify that a unique clustered index be
created if a clustered index does not already exist on the table.
• A unique index ensures that the index key contains no duplicate values.
Both clustered and nonclustered indexes can be unique.
Filtered index
• An optimized nonclustered index, especially suited to cover queries that select from a
well-defined subset of data.
• SQL Server 2008 introduces Filtered Indexes which is an index with a WHERE clause
• Filtered indexes can provide the following advantages over full-table indexes:
• Improved query performance and plan quality
• Reduced index maintenance costs
• Reduced index storage costs
A well-designed filtered index improves query performance and execution plan quality
because it is smaller than a full-table nonclustered index and has filtered statistics
An index is maintained only when data manipulation language (DML) statements affect
the data in the index. A filtered index reduces index maintenance costs compared with a
full-table nonclustered index because it is smaller and is only maintained when the data
in the index is changed.
Creating a filtered index can reduce disk storage for nonclustered indexes when a full-table
index is not necessary.
Filtered index
Design Considerations
• When a column only has a small number of relevant values for queries, you can create a
filtered index on the subset of values. For example, when the values in a column are mostly
NULL and the query selects only from the non-NULL values, you can create a filtered index for
the non-NULL data rows. The resulting index will be smaller and cost less to maintain than a
full-table nonclustered index defined on the same key columns.
• When a table has heterogeneous data rows, you can create a filtered index for one or more
categories of data. This can improve the performance of queries on these data rows by narrowing
the focus of a query to a specific area of the table. Again, the resulting index will be smaller and
cost less to maintain than a full-table nonclustered index.
SELECT ComponentID, StartDate FROM Production.BillOfMaterials
WITH ( INDEX ( FIBillOfMaterialsWithEndDate ) ) WHERE EndDate
IN ('20000825', '20000908', '20000918');
To ensure that a filtered index is used in a SQL query
CREATE NONCLUSTERED INDEX
FIBillOfMaterialsWithEndDate ON
Production.BillOfMaterials (ComponentID,
StartDate) WHERE EndDate IS NOT NULL ;
Covering Indexes
• When a nonclustered index includes all the data requested in a query (both the items
in the SELECT list and the WHERE clause), it is called a covering index
• With a covering index, there is no need to access the actual data pages
– Only the leaf nodes of the nonclustered index are accessed
– For example, your query might retrieve the FirstName ,LastName and DOB columns from a
table, based on a value in the ContactID column. You can create a covering index that
includes all three columns.
• Because the leaf node of a clustered index is the data itself, a clustered index covers all
queries
Leaf node of a nonclustered index on LastName, FirstName, Birthdate
Adams Mark 1/14/1956 3
Douglas Susan 12/12/1947 4
Jones John 4/15/1967 1
Smith Mary 7/14/1970 2
The last column is EmployeeID.
Remember that the clustering key
is always included in a
nonclustered index.
Non-Key Index Columns
• SQL Server 2005 and later allow you to include columns in a non-clustered
index that are not part of the key
– Allows the index to cover more queries
– Included columns only appear in the leaf level of the index
– Up to 1,023 additional columns
– Can include data types that cannot be key columns
• Except text, ntext, and image data types
• Syntax
CREATE [ UNIQUE ] NONCLUSTERED INDEX index_name
ON <object> ( column [ ASC | DESC ] [ ,...n ] )
[ INCLUDE ( column_name [ ,...n ] ) ]
• Example
CREATE NONCLUSTERED INDEX NameRegion_IDX
ON Employees(LastName)
INCLUDE (Region)
KEY VS. NONKEY COLUMNS
• Key columns: the columns specified to create a clustered or nonclustered index.
• Nonkey columns: columns added to the INCLUDE clause of a nonclustered index.
• The basic syntax to create a nonclustered index with nonkey columns is:
• CREATE INDEX Index_Name ON Schema.TableName(Column) INCLUDE
(ColumnA, ColumnB);
• A column cannot be both a key and a non-key. It is either a key column or a non-
key, included column.
• The difference lies in where the data about the column is stored in the B-tree.
Clustered and nonclustered key columns are stored at every level of the index –
the columns appear on the leaf and all intermediate levels. A nonkey column will
only be stored at the leaf level, however.
• There are benefits to using non-key columns.
• Columns can be accessed with an index scan.
• Data types not allowed in key columns are allowed in nonkey columns. All data
types but text, ntext, and image are allowed.
• Included columns do not count against the 900 byte index key limit enforced by
SQL Server.
The query we want to use is
SELECT ProductID, Name, ProductNumber, Color
FROM dbo.Products
WHERE Color = 'Black';
The first index is nonclustered, with two key columns:
CREATE INDEX IX_Products_Name_ProductNumber ON dbo.Products(Name,
ProductNumber);
The second is also nonclustered, with two key columns and three nonkey
columns:
CREATE INDEX IX_Products_Name_ProductNumber_ColorClassStyle ON
dbo.Products(Name, ProductNumber)
INCLUDE (Color, Class, Style);
In this case, the first index would not be a covering index for that query. The
second index would be a covering index for that specific query.
COVERING INDEXES EXAMPLES
Column Store Index Basic
There are two types of storage available in the database; RowStore and ColumnStore.
In RowStore, data rows are placed sequentially on a page while in ColumnStore values
from a single column, but from multiple rows are stored adjacently. So a ColumnStore
Index works using ColumnStore storage.
We cannot perform DML ( Insert Update Delete ) operations on a table having a
ColumnStore Index, because this puts the data in a Read Only mode.
So one big advantage of using this feature is a Data Warehouse where most operations
are read only.
Creating Column Store Index
Creating a ColumnStore Index is the same as creating a NonClustered Index except we
need to add the ColumnStore keyword as shown below.
The syntax of a ColumnStore Index is:
CREATE NONCLUSTERED COLUMNSTORE INDEX ON Table_Name
(Column1,Column2,... Column N)
Example:
-- Creating Non - CLustered ColumnStore Index on 3 Columns
CREATE NONCLUSTERED COLUMNSTORE INDEX [ColumnStore__Test_Person]ON
[dbo].[Test_Person]([FirstName] , [MiddleName],[LastName])
• The cost when using the ColumnStore index is 4 times less than the
traditional non-clustered index.
Fill Factor
• When you create an index the fill factor option indicates how full the
leaf level pages are when the index is created or rebuilt.
• Valid values are 0 to 100.
• A fill factor of 0 means that all of the leaf level pages are full.
• If data is always inserted at the end of the table, then the fill factor could be
between 90 to 100 percent since the data will never be inserted into the
middle of a page.
• If the data can be inserted anywhere in the table then a fill factor of 60 to 80
percent could be appropriate based on the INSERT, UPDATE and DELETE
activity.
B-Tree Index Data Structure
• SQL Server indexes are based on B-trees
– Special records called nodes that allow keyed access to data
– Two kinds of nodes are special
• Root
• Leaf
Intermediate node
Leaf
node
Data
pages
Root node A O
O T
T W
E IGCA T
A C E G I K M N O Q
A I
• If there are enough records, intermediate levels may be added as well.
• Clustered index leaf-level pages contain the data in the table.
• Nonclustered index leaf-level pages contain the key value and a pointer to the
data row in the clustered index or heap.
SQL Server B-Tree Rules
• Root and intermediate nodes point only to other nodes
• Only leaf nodes point to data
• The number of nodes between the root and any leaf is the same for all leaves
• B+tree can have more than 1 keys in a node, in fact thousands of keys is seen typically
stored in a node and hence, the branching factor of a B+tree is very large.
• B-trees are always sorted
• The tree will be maintained during insertion, deletion, and updating so that these rules are
met
– When records are inserted or updated, nodes may split
– When records are deleted, nodes may be collapsed
• B+trees have all the key values in their leaf nodes. All the leaf nodes of a B+tree are at
the same height, which implies that every index lookup will take same number of B+tree
lookups to find a value.
• Within a B+tree all leaf nodes are linked together in a linked-listed, left to right, and since
the values at the leaf nodes are sorted, so range lookups are very efficient.
What Is a Node?
• A page that contains key and pointer pairs
Key Pointer
Key Pointer
Key Pointer
Key Pointer
Key Pointer
Key Pointer
Key Pointer
Key Pointer
Splitting a B-Tree Node
Root (Level 0)
Node (Level 1)
Leaf (Level 2)
Abby Bob Carol Dave
Abby Ada Andy Ann
Ada Alan Amanda Amy
Bob Alan Amanda Carol Amy Dave Ada DB
Let’s Add Alice
• Step 1: Split the leaf node
Bob Alan Amanda Carol Amy Dave Ada Alice
Ada Alan Alice Amanda Amy
Adding Alice
• Step 2: Split the next level up
DB
Leaf
Abby Ada Amanda Andy Ann
Bob Alan Amanda Carol Amy Dave Ada Alice
Ada Alan Alice Amanda Amy
Adding Alice
(continued)• Split the root
DB
LeafAda Alan Alice
Bob Alan Amanda Carol Amy Dave Ada Alice
Amanda Amy
Andy Ann
Carol DaveAbby Andy Bob
Abby Ada Amanda
Adding Alice
(continued)
• When the root splits, the tree grows another level
Root (Level 0)
Node
(Level 1)
Node
(Level 2)
Leaf
(Level 3)
DB
Abby Carol
Amanda Amy
Bob Alan Amanda Carol Amy Dave Ada Alice
Ada Alan Alice
Abby Andy Bob
Abby Ada Amanda
Carol Dave
Andy Ann
Page splits cause fragmentation
• Two types of fragmentation
– Data pages in a clustered table
– Index pages in all indexes
• Fragmentation happens because these pages must be kept in order
• Data page fragmentation happens when a new record must be added to a page that is full
– Consider an Employee table with a clustered index on LastName, FirstName
– A new employee, Peter Dent, is hired
ExtentAdams, Carol
Ally, Kent
Baccus, Mary
David, Sue
Dulles, Kelly
Edom, Mike
Farly, Lee
Frank, Joe
Ollen, Carol
Oppus, Larry...
Index Fragmentation
• Index page fragmentation occurs when a new key-pointer pair must be
added to an index page that is full
– Consider an Employee table with a nonclustered index on Social
Security Number
• Employee 048-12-9875 is added
036-11-9987, pointer
036-33-9874, pointer
038-87-8373, pointer
046-11-9987, pointer
048-33-9874, pointer
052-87-8373, pointer
116-11-9987, pointer
116-33-9874, pointer ...
124-11-9987, pointer
124-33-9874, pointer
125-87-8373, pointer
Extent
Why use B+tree?
• B+tree is used for an obvious reason and that is speed.
• As we know that there are space limitations when it comes to memory, and not
all of the data can reside in memory, and hence majority of the data has to be
saved on disk.
• Disk as we know is a lot slower as compared to memory because it has
moving parts.
• So if there were no tree structure to do the lookup, then to find a value in a
database, the DBMS would have to do a sequential scan of all the records.
• Now imagine a data size of a billion rows, and you can clearly see that
sequential scan is going to take very long.
• But with B+tree, its possible to store a
billion key values (with pointers to billion
rows) at a height of 3, 4 or 5, so that every
key lookup out of the billion keys is going
to take 3, 4 or 5 disk accesses, which is a
huge saving.
This goes to show the effectiveness of a
B+tree index, more than 16 million key
values can be stored in a B+tree of height
1 and every key value can be accessed in
exactly 2 lookups.
How is B+tree structured?
• B+trees are normally structured in such a
way that the size of a node is chosen
according to the page size.
• Why? Because whenever data is accessed
on disk, instead of reading a few bits, a
whole page of data is read, because that is
much cheaper.
• Let us look at an example,
Consider InnoDB whose page size is 16KB
• and suppose we have an index on a integer
column of size 4bytes
• So a node can contain at most 16 * 1024 /
4 = 4096 keys, and a node can have at
most 4097 children.
• So for a B+tree of height 1, the root node
has 4096 keys and the nodes at height 1
(the leaf nodes) have 4096 * 4097 =
16781312 key values.
• So the size of the index values have a direct bearing on performance!
How important is the size of the index values?
As can be seen from the above example, the size of the index values plays a very
important role for the following reasons:
• The longer the index, the less number of values that can fit in a node,
and hence the more the height of the B+tree.
• The more the height of the tree, the more disk accesses are needed.
• The more the disk accesses the less the performance.
Index Design
• For tables that are heavily updated, use as few columns as possible in the
index, and don’t over-index the tables.
• If a table contains a lot of data but data modifications are low, use as many
indexes as necessary to improve query performance
• For clustered indexes, try to keep the length of the indexed columns as short
as possible. Ideally, try to implement your clustered indexes on unique
columns that do not permit null values.
• The uniqueness of values in a column affects index performance. In general,
the more duplicate values you have in a column, the more poorly the index
performs.
Index design should take into account a number of considerations.
Index Design
• In addition, indexes are automatically updated when the data rows themselves
are updated, which can lead to additional overhead and can affect
performance.
• Due to the storage and sorting impacts, be sure to carefully determine the best
column for this index.
• The number of columns in the clustered (or non clustered) index can have
significant performance implications with heavy INSERT, UPDATE and DELETE
activity in your database.
• For composite indexes, take into consideration the order of the columns in the
index definition. Columns that will be used in comparison expressions in the
WHERE clause (such as WHERE FirstName = 'Charlie') should be listed first.
• You can also index computed columns if they meet certain requirements. For
example, the expression used to generate the values must be deterministic
(which means it always returns the same result for a specified set of inputs).
Resolving Fragmentation
Heap Tables:
• For heap tables this is not as easy. The following are different options you can
take to resolve the fragmentation:
• Create a clustered index
• Create a new table and insert data from the heap table into the new table
based on some sort order
• Export the data, truncate the table and import the data back into the table
Clustered Tables:
• Resolving the fragmentation for a clustered table can be done easily by
rebuilding or reorganizing your clustered index. This was shown in this
previous tip: SQL Server 2000 to 2005 Crosswalk - Index Rebuilds.
DBCC DBREINDEX
DBCC INDEXDEFRAG
( { database_name | database_id | 0 }
, { table_name | table_id}
, { index_name | index_id }
)