Physical Database
Design and Tuning
R&G - Chapter 20
Contents
• Physical Database Design
• Database Workloads
• Physical Design and tuning Decisions
• Need for Tuning
• Guidelines for Index Selection
• Clustering & indexing tools for index selection
• Database Tuning: Tuning index
• Tuning Conceptual schema
• Tuning queries and views
• Impact of Concurrency
• Benchmarking
Physical Database Design
• Process of producing a description of the implementation
of the database on secondary storage.
• It describes the base relations, file organizations, and
indexes used to achieve efficient access to the data, and
any associated integrity constraints and security
measures.
Physical Database Design
• We will describe the plan for how to build the tables,
including appropriate data types, field sizes, attribute
domains, and indexes.
• The plan should have enough detail that if someone else
were to use the plan to build a database, the database they
build is the same as the one you are intending to create.
• The conceptual design and logical design were
independent of physical considerations. Now, we not only
know that we want a relational model, we have selected a
database management system (DBMS) such as Access or
Oracle, and we focus on those physical considerations.
Logical vs. Physical Design:
• Logical database design is concerned with what to
store;
• physical database design is concerned with how to
store it.
Introduction
• We will be talking at length about “database
design”
– Conceptual Schema: info to capture, tables, columns,
views, etc.
– Physical Schema: indexes, clustering, etc.
• Physical design linked tightly to query optimization
– So we’ll study this “bottom up”
– But note: DB design is usually “top-down”
• conceptual then physical. Then iterate.
• We must begin by understanding the workload:
– The most important queries and how often they arise.
– The most important updates and how often they arise.
– The desired performance for these queries and updates.
Understanding the Workload
• For each query in the workload:
– Which relations does it access?
– Which attributes are retrieved?
– Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
• For each update in the workload:
– Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
– The type of update (INSERT/DELETE/UPDATE), and the
attributes that are affected.
– For the Update command, the fields that are modified by
update
Creating an ISUD Chart
Employee Table
Transaction Frequency% table Name Salary Address
Payroll Run monthly 100 S S S
Add Emps daily 0.1 I I I
Delete Emps daily 0.1 D D D
Give Raises monthly 10 S U NA
Insert, Select, Update, Delete Frequencies
Physical Design and tuning
Decisions
• Choice of indexes to create
– Which relations to index and which field
– What field(s) should be the search key
– Should we build several indexes
– For each index, should it be Clustered or un clustered
• Tuning the conceptual schema
– Alternative normalization,
– De normalization,
– Vertical partitioning,
– Views
• Query and transaction tuning
– Frequently executed queries are rewritten to run faster.
Need for Database Tuning
• Hard to get detailed workload at initial design.
• Concept of design and tuning are arbitrary.
• Design process is over after conceptual schema and
set of clustering and indexing decisions are made.
• Tuning process is subsequent changes to the
conceptual schema or the indexes.
Index Selection
• One approach:
– Consider most important queries.
– Consider best plan using the current indexes, and see if
better plan is possible with an additional index.
– If so, create it.
• Before creating an index, must also consider the
impact on updates in the workload.
– Trade-off of slowing some updates in order to speed up
some queries.
Whether to index (Guideline 1)
• Do not build an index unless some query including the
query components of updates benefit from it.
• Whenever possible choose indexes that speed up more than
one query
Multi attribute Search keys (Guideline 3)
– Two situations should be considered
1. a WHERE clause includes conditions on
more than one attribute of a relation
2. They enable index only evaluation
strategies (i.e. accessing relation can be
avoided) for important queries.
Whether to cluster (Guideline 4)
• As a rule of thumb, range queries are likely
to benefit the most from clustering
• If an index enables an index-only evaluation
strategy for the query it is intended to speed
up, the index need not be clustered
Hash verses Tree index (Guideline 5)
– Hash index is better in following situations
• The index is intended to support index
nested loops join; the indexed relation is the
inner relation, and the search key includes
the join columns
• There is very important equality query, and
no range queries, involving the search key
attributes
Balancing the cost of Index
Maintenance (Guideline 6)
• If maintaining an index slows down frequent
update operations, consider dropping the
index
• Keep in mind however , that adding an index
may well speed up a given update operation.
– E.g. an index on employee IDs could speed up the
operation of increasing the salary of a employee
(specified by ID)
Example 1
• Hash index on D.dname supports ‘Toy’ selection.
– Given this, index on D.dno is not needed.
• Hash index on E.dno allows us to get matching
(inner) Emp tuples for each selected (outer) Dept
tuple.
• What if WHERE included: `` ... AND E.age=25’’ ?
– Could retrieve Emp tuples using index on E.age, then join
with Dept tuples satisfying dname selection. Comparable to
strategy that used E.dno index.
– So, if E.age index is already created, this query provides
much less motivation for adding an E.dno index.
SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE D.dname=‘Toy’ AND E.dno=D.dno
Example 2
• All selections are on Emp so it should be the outer
relation in any Index NL join.
– Suggests that we build a B+ tree index on D.dno.
• What index should we build on Emp?
– B+ tree on E.sal could be used, OR an index on E.hobby
could be used. Only one of these is needed, and which is
better depends upon the selectivity of the conditions.
• As a rule of thumb, equality selections more selective than range
selections.
• As both examples indicate, our choice of indexes is
guided by the plan(s) that we expect an optimizer to
consider for a query. Have to understand optimizers!
SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE E.sal BETWEEN 10000 AND 20000
AND E.hobby=‘Stamps’ AND E.dno=D.dno
Clustering and Indexing
• Clustered indexes can be especially
important while accessing the inner relation
in an index nested loops joins
• Revisit the same e.g
• Should the used indexes be clustered?
• Unclustered index on dname
• On the other hand Emp is the inner relation
in an index NL join and dno is not candidate
key
• Dno should be clustered index
SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE D.dname=‘Toy’ AND E.dno=D.dno
Examples of Clustering
• B+ tree index on E.age can be
used to get qualifying tuples.
– How selective is the condition?
– Is the index clustered?
• Consider the GROUP BY query.
– If many tuples have E.age > 10,
using E.age index and sorting the
retrieved tuples may be costly.
– Clustered E.dno index may be
better!
• Equality queries and duplicates:
– Clustering on E.hobby helps!
SELECT E.dno
FROM Emp E
WHERE E.age>40
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>10
GROUP BY E.dno
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps
Impact of Clustering
Co-clustering Two Relations
• It can speed up joins, in particular key
foreign key joins corresponding to 1:N
relations
• A sequential scan of either relation becomes
slower.
• All inserts, deletes and updates that alter
record lengths become slower, thanks to
overhead involved in maintaining the
clustering
Index-Only Plans
• A number of
queries can be
answered
without
retrieving any
tuples from one
or more of the
relations
involved if a
suitable index
is available.
SELECT D.mgr
FROM Dept D, Emp E
WHERE D.dno=E.dno
SELECT D.mgr, E.eid
FROM Dept D, Emp E
WHERE D.dno=E.dno
SELECT E.dno, COUNT(*)
FROM Emp E
GROUP BY E.dno
SELECT E.dno, MIN(E.sal)
FROM Emp E
GROUP BY E.dno
SELECT AVG(E.sal)
FROM Emp E
WHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
<E.dno>
<E.dno,E.eid>
<E.dno>
<E.dno,E.sal>
B-tree trick!
<E. age,E.sal>
or
<E.sal, E.age>
Tools to Assist in Index Selection
• First generation of such tools:
– Index tuning wizards or
– Index advisors
• Drawback of these systems
– They had to replicate the database query
optimizers cost model
• The DB2 Index Advisor
– Tool for automatic index recommendation given a
workload
– Workload table: ADVISE_WORKLOAD
– It is populated either
• By SQL stmts from DB2 dynamic SQL stmt cache for
recently executed SQL stmts
• With SQL stmts from packages statically compiled stmts
OR
• With SQL stmts from online monitor called Query
Patroller
• Output: SQL DDL statements whose execution creates
recommended indexes
Tools to Assist in Index Selection
• The Microsoft SQL server 2000 Index Tuning
wizard
– Tuning wizard integrated with the database query
optimizer
– 3 tuning modes that permits user to trade off running
time of analysis and no. of candidate index
configurations examined: fast, medium and thorough
with fast having lowest running time and thorough
examining the largest no. of configurations
– Max space allowed for indexes, Allows table scaling
– Reduces running time by sampling mode
– Table scaling
Tools to Assist in Index Selection
Overview of Database Tuning
• Actual use of DB provides a valuable source
of detailed information that can be used to
refine the initial design
• Original assumptions are replaced
• Initial workload is validated
• Initial guesses about size of data can be
replaced with actual statistics
• Tuning imp to get best possible performance
• 3 kinds of tuning: - tuning indexes, tuning
the conceptual schema, and tuning queries
Tuning indexes
• Queries and updates considered important at
initial level are not very frequent
• Observed workload may also identify some
new queries and updates
• Initial choice of indexes has to be reviewed
in light of this new information
• Some original indexes may be dropped and
new ones added
• It uses index only scan with Emp as inner
relation
• If this query takes an unexpectedly long
time to execute replace previous plan with
dno field and clustered index
Tuning indexes continues…
SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE D.dname=‘Toy’ AND E.dno=D.dno
• In addition we have to Periodically reorganize
indexes
– E.g Static index (ISAM index) may have developed long
overflow chains, drop or rebuilt- if feasible, improves
access time through this index
– Dynamic structure (B+ tree) - if the implementation does
not merge pages on deletes, space occupancy can
decrease considerably in some situations. This in turn
makes the size of the index (in pages) larger than
necessary, and could increase the height and therefore
the access time
Tuning indexes continues…
Tuning conceptual Schema
• If initial schema doesn’t meet our performance
objectives for the given workload with any set
of physical design if so redesign conceptual
schema
• Such change is called as schema evolution
• Issues involved in tuning conceptual schema:
– Decide to settle for a 3NF design instead of BCNF
– Among 3NF or BCNF our choice should be guided by
workload
– Sometime we might decide to further decompose
relation that is already in BCNF
– We might denormalize
– partitioning
Tuning Queries and Views
• If a query runs slower than expected, check if an index needs
to be re-built, or if statistics are too old and rebuilt the queries.
• Sometimes, the DBMS may not be executing the plan you had
in mind. Common areas of optimizer weakness:
– Selections involving null values (bad selectivity estimates)
– Selections involving arithmetic or string expressions (ditto)
– Selections involving OR conditions (ditto)
– Complex, correlated subqueries
– Lack of evaluation features like index-only strategies or certain join
methods or poor size estimation.
• Check the plan that is being used! Then adjust the choice of
indexes or rewrite the query/view.
– E.g. check via POSTGRES “Explain” command
– Some systems rewrite for you under the covers (e.g. DB2)
• Can be confusing and/or helpful!
More Guidelines for Query Tuning
• Minimize the use of DISTINCT: don’t need it if
duplicates are acceptable, or if answer contains a
key.
• Minimize the use of GROUP BY and HAVING:
SELECT MIN (E.age)
FROM Employee E
GROUP BY E.dno
HAVING E.dno=102
SELECT MIN (E.age)
FROM Employee E
WHERE E.dno=102
Consider DBMS use of index when writing arithmetic
expressions: E.age=2*D.age will benefit from index on
E.age, but might not benefit from index on D.age!
Guidelines for Query Tuning (Contd.)
• Avoid using intermediate
relations:
SELECT * INTO Temp
FROM Emp E, Dept D
WHERE E.dno=D.dno
AND D.mgrname=‘Joe’
SELECT T.dno, AVG(T.sal)
FROM Temp T
GROUP BY T.dno
vs.
SELECT E.dno, AVG(E.sal)
FROM Emp E, Dept D
WHERE E.dno=D.dno
AND D.mgrname=‘Joe’
GROUP BY E.dno
and
Does not materialize the intermediate reln Temp.
Choices in Tuning The Conceptual
Schema
– Consider the following schema
• Contracts(cid: integer, supplierid : integer, projectid: integer,
depti: integer, partid: integer, qty: integer, value: real)
• Departments(did: integer, budget: real, annualreport:
varchar)
• Parts(pid: integer, cost: integer)
• Projects( jid: integer, mgr: char(20))
• Suppliers(sid: integer, address: char(50))
Choices in Tuning The Conceptual
Schema contd…
• the relation Contracts, denoted as CSJDPQV
– The meaning of a tuple in this relation is that the contract
with cid C is an agreement that supplier S (with sid equal to
supplierid) will supply Q items of part P (with pid equal to
partid) to project J (with jid equal to projectid) associated
with department D (with deptid equal to did), and that the
value V of this contract is equal to value
Choices in Tuning The Conceptual
Schema contd…
• There are two known integrity constraints with
respect to Contracts
• 1. A project purchases a given part using a
single contract
• JP C
• 2. a department purchases at most one part
from any given supplier
• SD  P
Settling for a Weaker Normal Form
• Consider contract relation
• We will see what normal form it is in
• candidate keys for this relation are C and JP
• only nonkey dependency is SD P, and P is
a prime attribute because it is part of
candidate key JP
• It is in 3NF
• We will decompose it and convert it into
BCNF
• we obtain a lossless-join and dependency-
preserving decomposition into BCNF by
decomposing schema we will get schemas
CJP, SDP, and CSJDQV
Horizontal Decompositions
• Usual Def. of decomposition: Relation is replaced by
collection of relations that are projections. Most
important case.
– We will talk about this at length as part of Conceptual DB
Design
• Sometimes, might want to replace relation by a
collection of relations that are selections.
– Each new relation has same schema as original, but subset
of rows.
– Collectively, new relations contain all rows of the original.
– Typically, the new relations are disjoint.
Horizontal Decompositions (Contd.)
• Contracts (Cid, Sid, Jid, Did, Pid, Qty, Val)
• Suppose that contracts with value > 10000 are
subject to different rules.
– So queries on Contracts will often say WHERE val>10000.
• One approach: clustered B+ tree index on the val
field.
• Second approach: replace contracts by two new
relations, LargeContracts and SmallContracts, with
the same attributes (CSJDPQV).
– Performs like index on such queries, but no index overhead.
– Can build clustered indexes on other attributes, in addition!
Masking Conceptual Schema Changes
• Horizonal Decomposition from above
• Masked by a view.
– NOTE: queries with condition val>10000 must be asked wrt
LargeContracts for efficiency: so some users may have to
be aware of change.
• I.e. the users who were having performance problems
• Arguably that’s OK -- they wanted a solution!
CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val)
AS SELECT *
FROM LargeContracts
UNION
SELECT *
FROM SmallContracts
Impact of Concurrency
• In a system with many concurrent users,
several additional points must be considered
• Transaction obtains locks on the pages that
it reads or writes and others may be blocked
• 2 specific ways to reduce blocking
– Reduce the time that transactions hold locks
– Reducing hot spots
Reducing Lock Durations
• Delay lock requests
– Tune transaction by writing to local prog. variables
and deferring changes to database until the end of
transaction
• Make transaction Faster
– Tuning indexing and rewriting queries
– Careful partitioning of the tuples in relation and
associated indexes across a collection of discs
• Replace long transactions by short ones
– Rewriting into two or more smaller transactions
Reducing Lock Durations contd…
• Build a warehouse
– Complex queries can hold shared lock for longer
time, involve statistical analysis of business trends
– Can run on copy of data that is little out of date
• Consider a lower Isolation Level
– In many situations such as queries generating
aggregate info or statistical summaries
– Use lower SQL isolation level as REPEATABLE
READ or READ COMMITTED
Reducing Hot Spots
• Delay operations on Hot Spots
– Requests using frequently used objects
• Optimize Access Patterns
– Pattern of updates
• Partitioning operations on Hot Spots
– Batch append
• Choice of Index
– In Frequent updating relation, B+ tree indexes can
become bottleneck so root and index pages
becomes hot spots
– Specialized locking protocols help (fine granularity
locks)
– Leads to ISAM index (only leafs gets locks)
DBMS Benchmarking
• Includes benchmarks for measuring the
performance of a certain class of applications
(e.g., the TPC benchmarks) and
• benchmarks for measuring how well a DBMS
performs various operations (e.g., the Wisconsin
benchmark)
– Benchmarks should be portable, easy to understand,
and scale naturally to larger problem instances. They
should measure peak performance (e.g., transactions
per second, or tps) as well as price/performance ratios
(e.g., $/tps) for typical workloads in a given application
domain
• The Transaction Processing Council (TPC)
was created to define benchmarks for
transaction processing and database
systems
• Well-Known DBMS Benchmarks
– The TPC-A and TPC-B benchmarks constitute the
standard definitions of the tps and $/tps measures
– TPC-A measures the performance and price of a
computer network in addition to the DBMS,
– whereas the TPC-B benchmark considers the
DBMS by itself
DBMS Benchmarking
DBMS Benchmarking
– The TPC-C benchmark is a more complex suite of
transactional tasks than TPC-A and TPC-B
– It models a warehouse that tracks items supplied to
customers and involves five types of transactions
– Much more expensive than TPC-A and TPC-B
– exercises a much wider range of system capabilities
– TPC-D TPC-D represents a broad range of decision
support (DS) applications that require complex, long
running queries against large complex data structures. 
DBMS Benchmarking
• The TPC Benchmark™H (TPC-H) is a decision support
benchmark.
• It consists of a suite of business oriented ad-hoc queries and
concurrent data modifications.
• The queries and the data populating the database have been
chosen to have broad industry-wide relevance.
• This benchmark illustrates decision support systems that
examine large volumes of data, execute queries with a high
degree of complexity, and give answers to critical business
questions.
Points to Remember
• Indexes must be chosen to speed up important
queries (and perhaps some updates!).
– Index maintenance overhead on updates to key fields.
– Choose indexes that can help many queries, if possible.
– Build indexes to support index-only strategies.
– Clustering is an important decision; only one index on a
given relation can be clustered!
– Order of fields in composite index key can be important.
• Static indexes may have to be periodically re-built.
• Statistics have to be periodically updated.
Points to remember (Contd.)
• Over time, indexes have to be fine-tuned (dropped,
created, re-clustered, ...) for performance.
– Should determine the plan used by the system, and adjust
the choice of indexes appropriately.
• System may still not find a good plan:
– Only left-deep plans?
– Null values, arithmetic conditions, string expressions, the
use of ORs, nested queries, etc. can confuse an optimizer.
• So, may have to rewrite the query/view:
– Avoid nested queries, temporary relations, complex
conditions, and operations like DISTINCT and GROUP BY.

Hpd 1

  • 1.
    Physical Database Design andTuning R&G - Chapter 20
  • 2.
    Contents • Physical DatabaseDesign • Database Workloads • Physical Design and tuning Decisions • Need for Tuning • Guidelines for Index Selection • Clustering & indexing tools for index selection • Database Tuning: Tuning index • Tuning Conceptual schema • Tuning queries and views • Impact of Concurrency • Benchmarking
  • 3.
    Physical Database Design •Process of producing a description of the implementation of the database on secondary storage. • It describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures.
  • 4.
    Physical Database Design •We will describe the plan for how to build the tables, including appropriate data types, field sizes, attribute domains, and indexes. • The plan should have enough detail that if someone else were to use the plan to build a database, the database they build is the same as the one you are intending to create. • The conceptual design and logical design were independent of physical considerations. Now, we not only know that we want a relational model, we have selected a database management system (DBMS) such as Access or Oracle, and we focus on those physical considerations.
  • 5.
    Logical vs. PhysicalDesign: • Logical database design is concerned with what to store; • physical database design is concerned with how to store it.
  • 6.
    Introduction • We willbe talking at length about “database design” – Conceptual Schema: info to capture, tables, columns, views, etc. – Physical Schema: indexes, clustering, etc. • Physical design linked tightly to query optimization – So we’ll study this “bottom up” – But note: DB design is usually “top-down” • conceptual then physical. Then iterate. • We must begin by understanding the workload: – The most important queries and how often they arise. – The most important updates and how often they arise. – The desired performance for these queries and updates.
  • 7.
    Understanding the Workload •For each query in the workload: – Which relations does it access? – Which attributes are retrieved? – Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? • For each update in the workload: – Which attributes are involved in selection/join conditions? How selective are these conditions likely to be? – The type of update (INSERT/DELETE/UPDATE), and the attributes that are affected. – For the Update command, the fields that are modified by update
  • 8.
    Creating an ISUDChart Employee Table Transaction Frequency% table Name Salary Address Payroll Run monthly 100 S S S Add Emps daily 0.1 I I I Delete Emps daily 0.1 D D D Give Raises monthly 10 S U NA Insert, Select, Update, Delete Frequencies
  • 9.
    Physical Design andtuning Decisions • Choice of indexes to create – Which relations to index and which field – What field(s) should be the search key – Should we build several indexes – For each index, should it be Clustered or un clustered • Tuning the conceptual schema – Alternative normalization, – De normalization, – Vertical partitioning, – Views • Query and transaction tuning – Frequently executed queries are rewritten to run faster.
  • 10.
    Need for DatabaseTuning • Hard to get detailed workload at initial design. • Concept of design and tuning are arbitrary. • Design process is over after conceptual schema and set of clustering and indexing decisions are made. • Tuning process is subsequent changes to the conceptual schema or the indexes.
  • 11.
    Index Selection • Oneapproach: – Consider most important queries. – Consider best plan using the current indexes, and see if better plan is possible with an additional index. – If so, create it. • Before creating an index, must also consider the impact on updates in the workload. – Trade-off of slowing some updates in order to speed up some queries.
  • 12.
    Whether to index(Guideline 1) • Do not build an index unless some query including the query components of updates benefit from it. • Whenever possible choose indexes that speed up more than one query
  • 13.
    Multi attribute Searchkeys (Guideline 3) – Two situations should be considered 1. a WHERE clause includes conditions on more than one attribute of a relation 2. They enable index only evaluation strategies (i.e. accessing relation can be avoided) for important queries.
  • 14.
    Whether to cluster(Guideline 4) • As a rule of thumb, range queries are likely to benefit the most from clustering • If an index enables an index-only evaluation strategy for the query it is intended to speed up, the index need not be clustered
  • 15.
    Hash verses Treeindex (Guideline 5) – Hash index is better in following situations • The index is intended to support index nested loops join; the indexed relation is the inner relation, and the search key includes the join columns • There is very important equality query, and no range queries, involving the search key attributes
  • 16.
    Balancing the costof Index Maintenance (Guideline 6) • If maintaining an index slows down frequent update operations, consider dropping the index • Keep in mind however , that adding an index may well speed up a given update operation. – E.g. an index on employee IDs could speed up the operation of increasing the salary of a employee (specified by ID)
  • 17.
    Example 1 • Hashindex on D.dname supports ‘Toy’ selection. – Given this, index on D.dno is not needed. • Hash index on E.dno allows us to get matching (inner) Emp tuples for each selected (outer) Dept tuple. • What if WHERE included: `` ... AND E.age=25’’ ? – Could retrieve Emp tuples using index on E.age, then join with Dept tuples satisfying dname selection. Comparable to strategy that used E.dno index. – So, if E.age index is already created, this query provides much less motivation for adding an E.dno index. SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno
  • 18.
    Example 2 • Allselections are on Emp so it should be the outer relation in any Index NL join. – Suggests that we build a B+ tree index on D.dno. • What index should we build on Emp? – B+ tree on E.sal could be used, OR an index on E.hobby could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions. • As a rule of thumb, equality selections more selective than range selections. • As both examples indicate, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query. Have to understand optimizers! SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE E.sal BETWEEN 10000 AND 20000 AND E.hobby=‘Stamps’ AND E.dno=D.dno
  • 19.
    Clustering and Indexing •Clustered indexes can be especially important while accessing the inner relation in an index nested loops joins • Revisit the same e.g • Should the used indexes be clustered? • Unclustered index on dname • On the other hand Emp is the inner relation in an index NL join and dno is not candidate key • Dno should be clustered index SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno
  • 20.
    Examples of Clustering •B+ tree index on E.age can be used to get qualifying tuples. – How selective is the condition? – Is the index clustered? • Consider the GROUP BY query. – If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly. – Clustered E.dno index may be better! • Equality queries and duplicates: – Clustering on E.hobby helps! SELECT E.dno FROM Emp E WHERE E.age>40 SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps
  • 21.
  • 22.
    Co-clustering Two Relations •It can speed up joins, in particular key foreign key joins corresponding to 1:N relations • A sequential scan of either relation becomes slower. • All inserts, deletes and updates that alter record lengths become slower, thanks to overhead involved in maintaining the clustering
  • 23.
    Index-Only Plans • Anumber of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT D.mgr FROM Dept D, Emp E WHERE D.dno=E.dno SELECT D.mgr, E.eid FROM Dept D, Emp E WHERE D.dno=E.dno SELECT E.dno, COUNT(*) FROM Emp E GROUP BY E.dno SELECT E.dno, MIN(E.sal) FROM Emp E GROUP BY E.dno SELECT AVG(E.sal) FROM Emp E WHERE E.age=25 AND E.sal BETWEEN 3000 AND 5000 <E.dno> <E.dno,E.eid> <E.dno> <E.dno,E.sal> B-tree trick! <E. age,E.sal> or <E.sal, E.age>
  • 24.
    Tools to Assistin Index Selection • First generation of such tools: – Index tuning wizards or – Index advisors • Drawback of these systems – They had to replicate the database query optimizers cost model
  • 25.
    • The DB2Index Advisor – Tool for automatic index recommendation given a workload – Workload table: ADVISE_WORKLOAD – It is populated either • By SQL stmts from DB2 dynamic SQL stmt cache for recently executed SQL stmts • With SQL stmts from packages statically compiled stmts OR • With SQL stmts from online monitor called Query Patroller • Output: SQL DDL statements whose execution creates recommended indexes Tools to Assist in Index Selection
  • 26.
    • The MicrosoftSQL server 2000 Index Tuning wizard – Tuning wizard integrated with the database query optimizer – 3 tuning modes that permits user to trade off running time of analysis and no. of candidate index configurations examined: fast, medium and thorough with fast having lowest running time and thorough examining the largest no. of configurations – Max space allowed for indexes, Allows table scaling – Reduces running time by sampling mode – Table scaling Tools to Assist in Index Selection
  • 27.
    Overview of DatabaseTuning • Actual use of DB provides a valuable source of detailed information that can be used to refine the initial design • Original assumptions are replaced • Initial workload is validated • Initial guesses about size of data can be replaced with actual statistics • Tuning imp to get best possible performance • 3 kinds of tuning: - tuning indexes, tuning the conceptual schema, and tuning queries
  • 28.
    Tuning indexes • Queriesand updates considered important at initial level are not very frequent • Observed workload may also identify some new queries and updates • Initial choice of indexes has to be reviewed in light of this new information • Some original indexes may be dropped and new ones added
  • 29.
    • It usesindex only scan with Emp as inner relation • If this query takes an unexpectedly long time to execute replace previous plan with dno field and clustered index Tuning indexes continues… SELECT E.ename, D.mgr FROM Emp E, Dept D WHERE D.dname=‘Toy’ AND E.dno=D.dno
  • 30.
    • In additionwe have to Periodically reorganize indexes – E.g Static index (ISAM index) may have developed long overflow chains, drop or rebuilt- if feasible, improves access time through this index – Dynamic structure (B+ tree) - if the implementation does not merge pages on deletes, space occupancy can decrease considerably in some situations. This in turn makes the size of the index (in pages) larger than necessary, and could increase the height and therefore the access time Tuning indexes continues…
  • 31.
    Tuning conceptual Schema •If initial schema doesn’t meet our performance objectives for the given workload with any set of physical design if so redesign conceptual schema • Such change is called as schema evolution • Issues involved in tuning conceptual schema: – Decide to settle for a 3NF design instead of BCNF – Among 3NF or BCNF our choice should be guided by workload – Sometime we might decide to further decompose relation that is already in BCNF – We might denormalize – partitioning
  • 32.
    Tuning Queries andViews • If a query runs slower than expected, check if an index needs to be re-built, or if statistics are too old and rebuilt the queries. • Sometimes, the DBMS may not be executing the plan you had in mind. Common areas of optimizer weakness: – Selections involving null values (bad selectivity estimates) – Selections involving arithmetic or string expressions (ditto) – Selections involving OR conditions (ditto) – Complex, correlated subqueries – Lack of evaluation features like index-only strategies or certain join methods or poor size estimation. • Check the plan that is being used! Then adjust the choice of indexes or rewrite the query/view. – E.g. check via POSTGRES “Explain” command – Some systems rewrite for you under the covers (e.g. DB2) • Can be confusing and/or helpful!
  • 33.
    More Guidelines forQuery Tuning • Minimize the use of DISTINCT: don’t need it if duplicates are acceptable, or if answer contains a key. • Minimize the use of GROUP BY and HAVING: SELECT MIN (E.age) FROM Employee E GROUP BY E.dno HAVING E.dno=102 SELECT MIN (E.age) FROM Employee E WHERE E.dno=102 Consider DBMS use of index when writing arithmetic expressions: E.age=2*D.age will benefit from index on E.age, but might not benefit from index on D.age!
  • 34.
    Guidelines for QueryTuning (Contd.) • Avoid using intermediate relations: SELECT * INTO Temp FROM Emp E, Dept D WHERE E.dno=D.dno AND D.mgrname=‘Joe’ SELECT T.dno, AVG(T.sal) FROM Temp T GROUP BY T.dno vs. SELECT E.dno, AVG(E.sal) FROM Emp E, Dept D WHERE E.dno=D.dno AND D.mgrname=‘Joe’ GROUP BY E.dno and Does not materialize the intermediate reln Temp.
  • 35.
    Choices in TuningThe Conceptual Schema – Consider the following schema • Contracts(cid: integer, supplierid : integer, projectid: integer, depti: integer, partid: integer, qty: integer, value: real) • Departments(did: integer, budget: real, annualreport: varchar) • Parts(pid: integer, cost: integer) • Projects( jid: integer, mgr: char(20)) • Suppliers(sid: integer, address: char(50))
  • 36.
    Choices in TuningThe Conceptual Schema contd… • the relation Contracts, denoted as CSJDPQV – The meaning of a tuple in this relation is that the contract with cid C is an agreement that supplier S (with sid equal to supplierid) will supply Q items of part P (with pid equal to partid) to project J (with jid equal to projectid) associated with department D (with deptid equal to did), and that the value V of this contract is equal to value
  • 37.
    Choices in TuningThe Conceptual Schema contd… • There are two known integrity constraints with respect to Contracts • 1. A project purchases a given part using a single contract • JP C • 2. a department purchases at most one part from any given supplier • SD  P
  • 38.
    Settling for aWeaker Normal Form • Consider contract relation • We will see what normal form it is in • candidate keys for this relation are C and JP • only nonkey dependency is SD P, and P is a prime attribute because it is part of candidate key JP • It is in 3NF • We will decompose it and convert it into BCNF • we obtain a lossless-join and dependency- preserving decomposition into BCNF by decomposing schema we will get schemas CJP, SDP, and CSJDQV
  • 39.
    Horizontal Decompositions • UsualDef. of decomposition: Relation is replaced by collection of relations that are projections. Most important case. – We will talk about this at length as part of Conceptual DB Design • Sometimes, might want to replace relation by a collection of relations that are selections. – Each new relation has same schema as original, but subset of rows. – Collectively, new relations contain all rows of the original. – Typically, the new relations are disjoint.
  • 40.
    Horizontal Decompositions (Contd.) •Contracts (Cid, Sid, Jid, Did, Pid, Qty, Val) • Suppose that contracts with value > 10000 are subject to different rules. – So queries on Contracts will often say WHERE val>10000. • One approach: clustered B+ tree index on the val field. • Second approach: replace contracts by two new relations, LargeContracts and SmallContracts, with the same attributes (CSJDPQV). – Performs like index on such queries, but no index overhead. – Can build clustered indexes on other attributes, in addition!
  • 41.
    Masking Conceptual SchemaChanges • Horizonal Decomposition from above • Masked by a view. – NOTE: queries with condition val>10000 must be asked wrt LargeContracts for efficiency: so some users may have to be aware of change. • I.e. the users who were having performance problems • Arguably that’s OK -- they wanted a solution! CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val) AS SELECT * FROM LargeContracts UNION SELECT * FROM SmallContracts
  • 42.
    Impact of Concurrency •In a system with many concurrent users, several additional points must be considered • Transaction obtains locks on the pages that it reads or writes and others may be blocked • 2 specific ways to reduce blocking – Reduce the time that transactions hold locks – Reducing hot spots
  • 43.
    Reducing Lock Durations •Delay lock requests – Tune transaction by writing to local prog. variables and deferring changes to database until the end of transaction • Make transaction Faster – Tuning indexing and rewriting queries – Careful partitioning of the tuples in relation and associated indexes across a collection of discs • Replace long transactions by short ones – Rewriting into two or more smaller transactions
  • 44.
    Reducing Lock Durationscontd… • Build a warehouse – Complex queries can hold shared lock for longer time, involve statistical analysis of business trends – Can run on copy of data that is little out of date • Consider a lower Isolation Level – In many situations such as queries generating aggregate info or statistical summaries – Use lower SQL isolation level as REPEATABLE READ or READ COMMITTED
  • 45.
    Reducing Hot Spots •Delay operations on Hot Spots – Requests using frequently used objects • Optimize Access Patterns – Pattern of updates • Partitioning operations on Hot Spots – Batch append • Choice of Index – In Frequent updating relation, B+ tree indexes can become bottleneck so root and index pages becomes hot spots – Specialized locking protocols help (fine granularity locks) – Leads to ISAM index (only leafs gets locks)
  • 46.
    DBMS Benchmarking • Includesbenchmarks for measuring the performance of a certain class of applications (e.g., the TPC benchmarks) and • benchmarks for measuring how well a DBMS performs various operations (e.g., the Wisconsin benchmark) – Benchmarks should be portable, easy to understand, and scale naturally to larger problem instances. They should measure peak performance (e.g., transactions per second, or tps) as well as price/performance ratios (e.g., $/tps) for typical workloads in a given application domain
  • 47.
    • The TransactionProcessing Council (TPC) was created to define benchmarks for transaction processing and database systems • Well-Known DBMS Benchmarks – The TPC-A and TPC-B benchmarks constitute the standard definitions of the tps and $/tps measures – TPC-A measures the performance and price of a computer network in addition to the DBMS, – whereas the TPC-B benchmark considers the DBMS by itself DBMS Benchmarking
  • 48.
    DBMS Benchmarking – TheTPC-C benchmark is a more complex suite of transactional tasks than TPC-A and TPC-B – It models a warehouse that tracks items supplied to customers and involves five types of transactions – Much more expensive than TPC-A and TPC-B – exercises a much wider range of system capabilities – TPC-D TPC-D represents a broad range of decision support (DS) applications that require complex, long running queries against large complex data structures. 
  • 49.
    DBMS Benchmarking • TheTPC Benchmark™H (TPC-H) is a decision support benchmark. • It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. • The queries and the data populating the database have been chosen to have broad industry-wide relevance. • This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
  • 50.
    Points to Remember •Indexes must be chosen to speed up important queries (and perhaps some updates!). – Index maintenance overhead on updates to key fields. – Choose indexes that can help many queries, if possible. – Build indexes to support index-only strategies. – Clustering is an important decision; only one index on a given relation can be clustered! – Order of fields in composite index key can be important. • Static indexes may have to be periodically re-built. • Statistics have to be periodically updated.
  • 51.
    Points to remember(Contd.) • Over time, indexes have to be fine-tuned (dropped, created, re-clustered, ...) for performance. – Should determine the plan used by the system, and adjust the choice of indexes appropriately. • System may still not find a good plan: – Only left-deep plans? – Null values, arithmetic conditions, string expressions, the use of ORs, nested queries, etc. can confuse an optimizer. • So, may have to rewrite the query/view: – Avoid nested queries, temporary relations, complex conditions, and operations like DISTINCT and GROUP BY.