Hpd 1

Physical Database
Design and Tuning
R&G - Chapter 20

Contents
• Physical Database Design
• Database Workloads
• Physical Design and tuning Decisions
• Need for Tuning
• Guidelines for Index Selection
• Clustering & indexing tools for index selection
• Database Tuning: Tuning index
• Tuning Conceptual schema
• Tuning queries and views
• Impact of Concurrency
• Benchmarking

Physical Database Design
• Process of producing a description of the implementation
of the database on secondary storage.
• It describes the base relations, file organizations, and
indexes used to achieve efficient access to the data, and
any associated integrity constraints and security
measures.

Physical Database Design
• We will describe the plan for how to build the tables,
including appropriate data types, field sizes, attribute
domains, and indexes.
• The plan should have enough detail that if someone else
were to use the plan to build a database, the database they
build is the same as the one you are intending to create.
• The conceptual design and logical design were
independent of physical considerations. Now, we not only
know that we want a relational model, we have selected a
database management system (DBMS) such as Access or
Oracle, and we focus on those physical considerations.

Logical vs. Physical Design:
• Logical database design is concerned with what to
store;
• physical database design is concerned with how to
store it.

Introduction
• We will be talking at length about “database
design”
– Conceptual Schema: info to capture, tables, columns,
views, etc.
– Physical Schema: indexes, clustering, etc.
• Physical design linked tightly to query optimization
– So we’ll study this “bottom up”
– But note: DB design is usually “top-down”
• conceptual then physical. Then iterate.
• We must begin by understanding the workload:
– The most important queries and how often they arise.
– The most important updates and how often they arise.
– The desired performance for these queries and updates.

Understanding the Workload
• For each query in the workload:
– Which relations does it access?
– Which attributes are retrieved?
– Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
• For each update in the workload:
– Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
– The type of update (INSERT/DELETE/UPDATE), and the
attributes that are affected.
– For the Update command, the fields that are modified by
update

Creating an ISUD Chart
Employee Table
Transaction Frequency% table Name Salary Address
Payroll Run monthly 100 S S S
Add Emps daily 0.1 I I I
Delete Emps daily 0.1 D D D
Give Raises monthly 10 S U NA
Insert, Select, Update, Delete Frequencies

Physical Design and tuning
Decisions
• Choice of indexes to create
– Which relations to index and which field
– What field(s) should be the search key
– Should we build several indexes
– For each index, should it be Clustered or un clustered
• Tuning the conceptual schema
– Alternative normalization,
– De normalization,
– Vertical partitioning,
– Views
• Query and transaction tuning
– Frequently executed queries are rewritten to run faster.

Need for Database Tuning
• Hard to get detailed workload at initial design.
• Concept of design and tuning are arbitrary.
• Design process is over after conceptual schema and
set of clustering and indexing decisions are made.
• Tuning process is subsequent changes to the
conceptual schema or the indexes.

Index Selection
• One approach:
– Consider most important queries.
– Consider best plan using the current indexes, and see if
better plan is possible with an additional index.
– If so, create it.
• Before creating an index, must also consider the
impact on updates in the workload.
– Trade-off of slowing some updates in order to speed up
some queries.

Whether to index (Guideline 1)
• Do not build an index unless some query including the
query components of updates benefit from it.
• Whenever possible choose indexes that speed up more than
one query

Multi attribute Search keys (Guideline 3)
– Two situations should be considered
1. a WHERE clause includes conditions on
more than one attribute of a relation
2. They enable index only evaluation
strategies (i.e. accessing relation can be
avoided) for important queries.

Whether to cluster (Guideline 4)
• As a rule of thumb, range queries are likely
to benefit the most from clustering
• If an index enables an index-only evaluation
strategy for the query it is intended to speed
up, the index need not be clustered

Hash verses Tree index (Guideline 5)
– Hash index is better in following situations
• The index is intended to support index
nested loops join; the indexed relation is the
inner relation, and the search key includes
the join columns
• There is very important equality query, and
no range queries, involving the search key
attributes

Balancing the cost of Index
Maintenance (Guideline 6)
• If maintaining an index slows down frequent
update operations, consider dropping the
index
• Keep in mind however , that adding an index
may well speed up a given update operation.
– E.g. an index on employee IDs could speed up the
operation of increasing the salary of a employee
(specified by ID)

Example 1
• Hash index on D.dname supports ‘Toy’ selection.
– Given this, index on D.dno is not needed.
• Hash index on E.dno allows us to get matching
(inner) Emp tuples for each selected (outer) Dept
tuple.
• What if WHERE included: `` ... AND E.age=25’’ ?
– Could retrieve Emp tuples using index on E.age, then join
with Dept tuples satisfying dname selection. Comparable to
strategy that used E.dno index.
– So, if E.age index is already created, this query provides
much less motivation for adding an E.dno index.
SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE D.dname=‘Toy’ AND E.dno=D.dno

Example 2
• All selections are on Emp so it should be the outer
relation in any Index NL join.
– Suggests that we build a B+ tree index on D.dno.
• What index should we build on Emp?
– B+ tree on E.sal could be used, OR an index on E.hobby
could be used. Only one of these is needed, and which is
better depends upon the selectivity of the conditions.
• As a rule of thumb, equality selections more selective than range
selections.
• As both examples indicate, our choice of indexes is
guided by the plan(s) that we expect an optimizer to
consider for a query. Have to understand optimizers!
FROM Emp E, Dept D
WHERE E.sal BETWEEN 10000 AND 20000
AND E.hobby=‘Stamps’ AND E.dno=D.dno

Clustering and Indexing
• Clustered indexes can be especially
important while accessing the inner relation
in an index nested loops joins
• Revisit the same e.g
• Should the used indexes be clustered?
• Unclustered index on dname
• On the other hand Emp is the inner relation
in an index NL join and dno is not candidate
key
• Dno should be clustered index
FROM Emp E, Dept D

Examples of Clustering
• B+ tree index on E.age can be
used to get qualifying tuples.
– How selective is the condition?
– Is the index clustered?
• Consider the GROUP BY query.
– If many tuples have E.age > 10,
using E.age index and sorting the
retrieved tuples may be costly.
– Clustered E.dno index may be
better!
• Equality queries and duplicates:
– Clustering on E.hobby helps!
SELECT E.dno
FROM Emp E
WHERE E.age>40
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>10
GROUP BY E.dno
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps

Co-clustering Two Relations
• It can speed up joins, in particular key
foreign key joins corresponding to 1:N
relations
• A sequential scan of either relation becomes
slower.
• All inserts, deletes and updates that alter
record lengths become slower, thanks to
overhead involved in maintaining the
clustering

Index-Only Plans
• A number of
queries can be
answered
without
retrieving any
tuples from one
or more of the
relations
involved if a
suitable index
is available.
SELECT D.mgr
FROM Dept D, Emp E
WHERE D.dno=E.dno
SELECT D.mgr, E.eid
FROM Dept D, Emp E
WHERE D.dno=E.dno
SELECT E.dno, COUNT(*)
FROM Emp E
GROUP BY E.dno
SELECT E.dno, MIN(E.sal)
FROM Emp E
GROUP BY E.dno
SELECT AVG(E.sal)
FROM Emp E
WHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
<E.dno>
<E.dno,E.eid>
<E.dno>
<E.dno,E.sal>
B-tree trick!
<E. age,E.sal>
or
<E.sal, E.age>

Tools to Assist in Index Selection
• First generation of such tools:
– Index tuning wizards or
– Index advisors
• Drawback of these systems
– They had to replicate the database query
optimizers cost model

• The DB2 Index Advisor
– Tool for automatic index recommendation given a
workload
– Workload table: ADVISE_WORKLOAD
– It is populated either
• By SQL stmts from DB2 dynamic SQL stmt cache for
recently executed SQL stmts
• With SQL stmts from packages statically compiled stmts
OR
• With SQL stmts from online monitor called Query
Patroller
• Output: SQL DDL statements whose execution creates
recommended indexes

• The Microsoft SQL server 2000 Index Tuning
wizard
– Tuning wizard integrated with the database query
optimizer
– 3 tuning modes that permits user to trade off running
time of analysis and no. of candidate index
configurations examined: fast, medium and thorough
with fast having lowest running time and thorough
examining the largest no. of configurations
– Max space allowed for indexes, Allows table scaling
– Reduces running time by sampling mode
– Table scaling

Overview of Database Tuning
• Actual use of DB provides a valuable source
of detailed information that can be used to
refine the initial design
• Original assumptions are replaced
• Initial workload is validated
• Initial guesses about size of data can be
replaced with actual statistics
• Tuning imp to get best possible performance
• 3 kinds of tuning: - tuning indexes, tuning
the conceptual schema, and tuning queries

Tuning indexes
• Queries and updates considered important at
initial level are not very frequent
• Observed workload may also identify some
new queries and updates
• Initial choice of indexes has to be reviewed
in light of this new information
• Some original indexes may be dropped and
new ones added

• It uses index only scan with Emp as inner
relation
• If this query takes an unexpectedly long
time to execute replace previous plan with
dno field and clustered index
Tuning indexes continues…
FROM Emp E, Dept D

• In addition we have to Periodically reorganize
indexes
– E.g Static index (ISAM index) may have developed long
overflow chains, drop or rebuilt- if feasible, improves
access time through this index
– Dynamic structure (B+ tree) - if the implementation does
not merge pages on deletes, space occupancy can
decrease considerably in some situations. This in turn
makes the size of the index (in pages) larger than
necessary, and could increase the height and therefore
the access time
Tuning indexes continues…

Tuning conceptual Schema
• If initial schema doesn’t meet our performance
objectives for the given workload with any set
of physical design if so redesign conceptual
schema
• Such change is called as schema evolution
• Issues involved in tuning conceptual schema:
– Decide to settle for a 3NF design instead of BCNF
– Among 3NF or BCNF our choice should be guided by
workload
– Sometime we might decide to further decompose
relation that is already in BCNF
– We might denormalize
– partitioning

Tuning Queries and Views
• If a query runs slower than expected, check if an index needs
to be re-built, or if statistics are too old and rebuilt the queries.
• Sometimes, the DBMS may not be executing the plan you had
in mind. Common areas of optimizer weakness:
– Selections involving null values (bad selectivity estimates)
– Selections involving arithmetic or string expressions (ditto)
– Selections involving OR conditions (ditto)
– Complex, correlated subqueries
– Lack of evaluation features like index-only strategies or certain join
methods or poor size estimation.
• Check the plan that is being used! Then adjust the choice of
indexes or rewrite the query/view.
– E.g. check via POSTGRES “Explain” command
– Some systems rewrite for you under the covers (e.g. DB2)
• Can be confusing and/or helpful!

More Guidelines for Query Tuning
• Minimize the use of DISTINCT: don’t need it if
duplicates are acceptable, or if answer contains a
key.
• Minimize the use of GROUP BY and HAVING:
SELECT MIN (E.age)
FROM Employee E
GROUP BY E.dno
HAVING E.dno=102
SELECT MIN (E.age)
FROM Employee E
WHERE E.dno=102
Consider DBMS use of index when writing arithmetic
expressions: E.age=2*D.age will benefit from index on
E.age, but might not benefit from index on D.age!

Guidelines for Query Tuning (Contd.)
• Avoid using intermediate
relations:
SELECT * INTO Temp
FROM Emp E, Dept D
WHERE E.dno=D.dno
AND D.mgrname=‘Joe’
SELECT T.dno, AVG(T.sal)
FROM Temp T
GROUP BY T.dno
vs.
SELECT E.dno, AVG(E.sal)
FROM Emp E, Dept D
WHERE E.dno=D.dno
AND D.mgrname=‘Joe’
GROUP BY E.dno
and
Does not materialize the intermediate reln Temp.

Choices in Tuning The Conceptual
Schema
– Consider the following schema
• Contracts(cid: integer, supplierid : integer, projectid: integer,
depti: integer, partid: integer, qty: integer, value: real)
• Departments(did: integer, budget: real, annualreport:
varchar)
• Parts(pid: integer, cost: integer)
• Projects( jid: integer, mgr: char(20))
• Suppliers(sid: integer, address: char(50))

Schema contd…
• the relation Contracts, denoted as CSJDPQV
– The meaning of a tuple in this relation is that the contract
with cid C is an agreement that supplier S (with sid equal to
supplierid) will supply Q items of part P (with pid equal to
partid) to project J (with jid equal to projectid) associated
with department D (with deptid equal to did), and that the
value V of this contract is equal to value

Schema contd…
• There are two known integrity constraints with
respect to Contracts
• 1. A project purchases a given part using a
single contract
• JP C
• 2. a department purchases at most one part
from any given supplier
• SD  P

Settling for a Weaker Normal Form
• Consider contract relation
• We will see what normal form it is in
• candidate keys for this relation are C and JP
• only nonkey dependency is SD P, and P is
a prime attribute because it is part of
candidate key JP
• It is in 3NF
• We will decompose it and convert it into
BCNF
• we obtain a lossless-join and dependency-
preserving decomposition into BCNF by
decomposing schema we will get schemas
CJP, SDP, and CSJDQV

Horizontal Decompositions
• Usual Def. of decomposition: Relation is replaced by
collection of relations that are projections. Most
important case.
– We will talk about this at length as part of Conceptual DB
Design
• Sometimes, might want to replace relation by a
collection of relations that are selections.
– Each new relation has same schema as original, but subset
of rows.
– Collectively, new relations contain all rows of the original.
– Typically, the new relations are disjoint.

Horizontal Decompositions (Contd.)
• Contracts (Cid, Sid, Jid, Did, Pid, Qty, Val)
• Suppose that contracts with value > 10000 are
subject to different rules.
– So queries on Contracts will often say WHERE val>10000.
• One approach: clustered B+ tree index on the val
field.
• Second approach: replace contracts by two new
relations, LargeContracts and SmallContracts, with
the same attributes (CSJDPQV).
– Performs like index on such queries, but no index overhead.
– Can build clustered indexes on other attributes, in addition!

Masking Conceptual Schema Changes
• Horizonal Decomposition from above
• Masked by a view.
– NOTE: queries with condition val>10000 must be asked wrt
LargeContracts for efficiency: so some users may have to
be aware of change.
• I.e. the users who were having performance problems
• Arguably that’s OK -- they wanted a solution!
CREATE VIEW Contracts(cid, sid, jid, did, pid, qty, val)
AS SELECT *
FROM LargeContracts
UNION
SELECT *
FROM SmallContracts

Impact of Concurrency
• In a system with many concurrent users,
several additional points must be considered
• Transaction obtains locks on the pages that
it reads or writes and others may be blocked
• 2 specific ways to reduce blocking
– Reduce the time that transactions hold locks
– Reducing hot spots

Reducing Lock Durations
• Delay lock requests
– Tune transaction by writing to local prog. variables
and deferring changes to database until the end of
transaction
• Make transaction Faster
– Tuning indexing and rewriting queries
– Careful partitioning of the tuples in relation and
associated indexes across a collection of discs
• Replace long transactions by short ones
– Rewriting into two or more smaller transactions

Reducing Lock Durations contd…
• Build a warehouse
– Complex queries can hold shared lock for longer
time, involve statistical analysis of business trends
– Can run on copy of data that is little out of date
• Consider a lower Isolation Level
– In many situations such as queries generating
aggregate info or statistical summaries
– Use lower SQL isolation level as REPEATABLE
READ or READ COMMITTED

Reducing Hot Spots
• Delay operations on Hot Spots
– Requests using frequently used objects
• Optimize Access Patterns
– Pattern of updates
• Partitioning operations on Hot Spots
– Batch append
• Choice of Index
– In Frequent updating relation, B+ tree indexes can
become bottleneck so root and index pages
becomes hot spots
– Specialized locking protocols help (fine granularity
locks)
– Leads to ISAM index (only leafs gets locks)

DBMS Benchmarking
• Includes benchmarks for measuring the
performance of a certain class of applications
(e.g., the TPC benchmarks) and
• benchmarks for measuring how well a DBMS
performs various operations (e.g., the Wisconsin
benchmark)
– Benchmarks should be portable, easy to understand,
and scale naturally to larger problem instances. They
should measure peak performance (e.g., transactions
per second, or tps) as well as price/performance ratios
(e.g., $/tps) for typical workloads in a given application
domain

• The Transaction Processing Council (TPC)
was created to define benchmarks for
transaction processing and database
systems
• Well-Known DBMS Benchmarks
– The TPC-A and TPC-B benchmarks constitute the
standard definitions of the tps and $/tps measures
– TPC-A measures the performance and price of a
computer network in addition to the DBMS,
– whereas the TPC-B benchmark considers the
DBMS by itself
DBMS Benchmarking

DBMS Benchmarking
– The TPC-C benchmark is a more complex suite of
transactional tasks than TPC-A and TPC-B
– It models a warehouse that tracks items supplied to
customers and involves five types of transactions
– Much more expensive than TPC-A and TPC-B
– exercises a much wider range of system capabilities
– TPC-D TPC-D represents a broad range of decision
support (DS) applications that require complex, long
running queries against large complex data structures.

DBMS Benchmarking
• The TPC Benchmark™H (TPC-H) is a decision support
benchmark.
• It consists of a suite of business oriented ad-hoc queries and
concurrent data modifications.
• The queries and the data populating the database have been
chosen to have broad industry-wide relevance.
• This benchmark illustrates decision support systems that
examine large volumes of data, execute queries with a high
degree of complexity, and give answers to critical business
questions.

Points to Remember
• Indexes must be chosen to speed up important
queries (and perhaps some updates!).
– Index maintenance overhead on updates to key fields.
– Choose indexes that can help many queries, if possible.
– Build indexes to support index-only strategies.
– Clustering is an important decision; only one index on a
given relation can be clustered!
– Order of fields in composite index key can be important.
• Static indexes may have to be periodically re-built.
• Statistics have to be periodically updated.

Points to remember (Contd.)
• Over time, indexes have to be fine-tuned (dropped,
created, re-clustered, ...) for performance.
– Should determine the plan used by the system, and adjust
the choice of indexes appropriately.
• System may still not find a good plan:
– Only left-deep plans?
– Null values, arithmetic conditions, string expressions, the
use of ORs, nested queries, etc. can confuse an optimizer.
• So, may have to rewrite the query/view:
– Avoid nested queries, temporary relations, complex
conditions, and operations like DISTINCT and GROUP BY.

Hpd 1

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Hpd 1

Similar to Hpd 1 (20)

More from dikshagupta111

More from dikshagupta111 (12)

Recently uploaded

Recently uploaded (20)

Hpd 1