SlideShare a Scribd company logo
CSCD34 - Data Management Systems - A. Vaisman 1
Overview of Storage and Indexing
CSCD34 - Data Management Systems - A. Vaisman 2
Data on External Storage
 Disks: Can retrieve random page at fixed cost
 But reading several consecutive pages is much cheaper than
reading them in random order
 Tapes: Can only read pages in sequence
 Cheaper than disks; used for archival storage
 File organization: Method of arranging a file of records
on external storage.
 Record id (rid) is sufficient to physically locate record
 Indexes are data structures that allow us to find the record ids
of records with given values in index search key fields
 Architecture: Buffer manager stages pages from external
storage to main memory buffer pool. File and index
layers make calls to the buffer manager. Page: typically
4 Kbytes.
CSCD34 - Data Management Systems - A. Vaisman 3
Alternative File Organizations
Many alternatives exist, each ideal for some
situations, and not so good in others:
 Heap (random order) files: Suitable when typical
access is a file scan retrieving all records.
 Sorted Files: Best if records must be retrieved in
some order, or only a `range’ of records is needed.
 Indexes: Data structures to organize records via
trees or hashing.
• Like sorted files, they speed up searches for a subset of
records, based on values in certain (“search key”) fields
• Updates are much faster than in sorted files.
CSCD34 - Data Management Systems - A. Vaisman 4
Indexes
 An index on a file speeds up selections on the
search key fields for the index.
 Any subset of the fields of a relation can be the
search key for an index on the relation.
 Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation).
 An index contains a collection of data entries,
and supports efficient retrieval of all data
entries k* with a given key value k.
CSCD34 - Data Management Systems - A. Vaisman 5
Index Classification
 Primary vs. secondary: If search key contains
primary key, then called primary index.
 Unique index: Search key contains a candidate key.
 Clustered vs. unclustered: If order of data records
is the same as order of data entries, then called
clustered index.
 A file can be clustered on at most one search key.
 Cost of retrieving data records through index varies
greatly based on whether index is clustered or not!
CSCD34 - Data Management Systems - A. Vaisman 6
Index Classification
 Dense vs Sparse: If there is an entry in the index
for each key value -> dense index (unclustered
indices are dense). If there is an entry for each
page -> sparse index.
1
5
..
..
1 Brown ..
2 Smith..
3 White ..
4 Yu ..
6 Peterson..
7 Rhodes..
5 Chen ..
………..
Brown
Chen
Peterson
Rhodes
Smith
Yu
White
CSCD34 - Data Management Systems - A. Vaisman 7
Clustered vs. Unclustered Index
 To build clustered index, first sort the Heap file (with
some free space on each page for future inserts).
 Overflow pages may be needed for inserts. (Thus, order of
data recs is `close to’, but not identical to, the sort order.)
Index entries
Data entries
direct search for
(Index File)
(Data file)
Data Records
data entries
Data entries
Data Records
CLUSTERED UNCLUSTERED
CSCD34 - Data Management Systems - A. Vaisman 8
Example B+ Tree
 Good for range queries.
 Insert/delete: Find data entry in leaf, then
change it. Need to adjust parent sometimes.
All leaves at he same height.
2* 3*
Root
17
30
14* 16* 33* 34* 38* 39*
135
7*5* 8* 22* 24*
27
27* 29*
Entries <= 17 Entries > 17
CSCD34 - Data Management Systems - A. Vaisman 9
Hash-Based Indexes
 Good for equality selections.
•Index is a collection of buckets. Bucket = primary
page plus zero or more overflow pages.
•Hashing function h: h(r) = bucket in which
record r belongs. h looks at the search key fields
of r.
 Buckets may contain the data records or just
the rids.
 Hash-based indexes are best for equality
selections. Cannot support range searches
CSCD34 - Data Management Systems - A. Vaisman 10
Static Hashing
 # primary pages fixed, allocated sequentially, never de-allocated;
overflow pages if needed.
 h(k) mod N = bucket to which data entry with key k belongs. (N =
# of buckets)
 Long overflow chains can develop and degrade performance.
 Extendible and Linear Hashing: Dynamic techniques to fix this.
h(key) mod N
h
key
Primary bucket pages Overflow pages
2
0
N-1
CSCD34 - Data Management Systems - A. Vaisman 11
Static Hashing (Contd.)
 Buckets contain data entries.
 Hash fn works on search key field of record r. Must
distribute values over range 0 ... M-1.
 h(key) = (a * key + b) usually works well.
 a and b are constants; lots known about how to tune h.
CSCD34 - Data Management Systems - A. Vaisman 12
Cost Model for Our Analysis
We ignore CPU costs, for simplicity:
 B: The number of data pages
 R: Number of records per page
 D: (Average) time to read or write disk page
 Measuring number of page I/O’s ignores gains of
pre-fetching a sequence of pages; thus, even I/O
cost is only approximated.
 Average-case analysis; based on several simplistic
assumptions.
 Good enough to show the overall trends!
CSCD34 - Data Management Systems - A. Vaisman 13
Comparing File Organizations
 Heap files (random order; insert at eof)
 Sorted files, sorted on <age, sal>
 Clustered B+ tree file, Alternative (1), search
key <age, sal>
 Heap file with unclustered B + tree index on
search key <age, sal>
 Heap file with unclustered hash index on
search key <age, sal>
CSCD34 - Data Management Systems - A. Vaisman 14
Operations to Compare
 Scan: Fetch all records from disk
 Equality search
 Range selection
 Insert a record
 Delete a record
CSCD34 - Data Management Systems - A. Vaisman 15
Cost of Operations
(a) Scan (b) Equality (c ) Range (d) Insert (e) Delete
(1) Heap BD 0.5BD BD 2D Search
+D
(2) Sorted BD Dlog 2B Dlog 2 B +
# matches
Search
+ BD
Search
+BD
(3) Clustered 1.5BD Dlog F 1.5B Dlog 2 1.5B
+ # matches
Search
+ D
Search
+D
(4) Unclustered
Tree index
BD(R+0.15) D(1 +log F
0.15B)
Dlog F
0.15B
+ # matches
D(3 +log F
0.15B)
Search
+ 2D
(5) Unclustered
Hash index
BD(R+0.1
25)
2D BD 4D Search
+ 2D
 Several
assumptions underlie
these (rough)
estimates!
B: The number of data pages
R: Number of records per page
D: (Average) time to read or write disk page
CSCD34 - Data Management Systems - A. Vaisman 16
Choice of Indexes
 What indexes should we create?
 One approach: Consider the most important queries
in turn. Consider the best plan using the current
indexes, and see if a better plan is possible with an
additional index. If so, create it.
 Obviously, this implies that we must understand how a
DBMS evaluates queries and creates query evaluation plans!
 For now, we discuss simple 1-table queries.
 Before creating an index, must also consider the
impact on updates in the workload!
 Trade-off: Indexes can make queries go faster, updates
slower. Require disk space, too.
CSCD34 - Data Management Systems - A. Vaisman 17
Index Selection Guidelines
 Attributes in WHERE clause are candidates for index keys.
 Exact match condition suggests hash index.
 Range query suggests tree index.
• Clustering is especially useful for range queries; can also help on
equality queries if there are many duplicates.
 Multi-attribute search keys should be considered when a
WHERE clause contains several conditions.
 Try to choose indexes that benefit as many queries as
possible. Since only one index can be clustered per relation,
choose it based on important queries that would benefit the
most from clustering.
CSCD34 - Data Management Systems - A. Vaisman 18
Examples of Clustered Indexes
 B+ tree index on E.age can be
used to get qualifying tuples.
 How selective is the condition?
 Is the index clustered?
 Consider the GROUP BY query.
 If many tuples have E.age > 10, using
E.age index and sorting the retrieved
tuples may be costly.
 Clustered E.dno index may be better!
 Equality queries and duplicates:
 Clustering on E.hobby helps!
SELECT E.dno
FROM Emp E
WHERE E.age>40
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>10
GROUP BY E.dno
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps
CSCD34 - Data Management Systems - A. Vaisman 19
Indexes with Composite Search Keys
 Composite Search Keys: Search
on a combination of fields.
 Equality query: Every field
value is equal to a constant
value. E.g. wrt <sal,age> index:
• age=20 and sal =75
 Range query: Some field value
is not a constant. E.g.:
• age =20; or age=20 and sal > 10
 Data entries in index sorted
by search key to support
range queries.
 Order or attributes is
relevant.
sue 13 75
bob
cal
joe 12
10
20
8011
12
name age sal
<sal, age>
<age, sal> <age>
<sal>
12,20
12,10
11,80
13,75
20,12
10,12
75,13
80,11
11
12
12
13
10
20
75
80
Data records
sorted by name
Data entries in index
sorted by <sal,age>
Data entries
sorted by <sal>
Examples of composite key
CSCD34 - Data Management Systems - A. Vaisman 20
Composite Search Keys
 To retrieve Emp records with age=30 AND sal=4000,
an index on <age,sal> would be better than an index
on age or an index on sal.
 Choice of index key orthogonal to clustering etc.
 If condition is: 20<age<30 AND 3000<sal<5000:
 Clustered tree index on <age,sal> or <sal,age> is best.
 If condition is: age=30 AND 3000<sal<5000:
 Clustered <age,sal> index much better than <sal,age>
index!
 Composite indexes are larger, updated more often.
CSCD34 - Data Management Systems - A. Vaisman 21
Summary (Contd.)
 Data entries can be actual data records, <key,
rid> pairs, or <key, rid-list> pairs.
 Choice orthogonal to indexing technique used to
locate data entries with a given key value.
 Can have several indexes on a given file of
data records, each with a different search key.
 Indexes can be classified as clustered vs.
unclustered, primary vs. secondary, and
dense vs. sparse. Differences have important
consequences for utility/performance.
CSCD34 - Data Management Systems - A. Vaisman 22
Summary (Contd.)
 Understanding the nature of the workload for the
application, and the performance goals, is essential
to developing a good design.
 What are the important queries and updates? What
attributes/relations are involved?
 Indexes must be chosen to speed up important
queries (and perhaps some updates!).
 Index maintenance overhead on updates to key fields.
 Choose indexes that can help many queries, if possible.
 Build indexes to support index-only strategies.
 Clustering is an important decision; only one index on a
given relation can be clustered!
 Order of fields in composite index key can be important.

More Related Content

What's hot

Database Performance
Database PerformanceDatabase Performance
Database Performance
Boris Hristov
 
File Structures(Part 2)
File Structures(Part 2)File Structures(Part 2)
File Structures(Part 2)
SURBHI SAROHA
 
Lecture12 abap on line
Lecture12 abap on lineLecture12 abap on line
Lecture12 abap on lineMilind Patil
 
Analysing Large Citation Network
Analysing Large Citation NetworkAnalysing Large Citation Network
Analysing Large Citation Network
Milad Alshomary
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Unit 3
Unit 3Unit 3
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
Stephen Lorello
 
Pa2 session 2
Pa2 session 2Pa2 session 2
Pa2 session 2
aiclub_slides
 
Introduction to PIG
Introduction to PIG Introduction to PIG
Introduction to PIG
Shanmathy Prabakaran
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
Rohit Agrawal
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Ahmed Saleh
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Thamme Gowda
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
 

What's hot (20)

Database Performance
Database PerformanceDatabase Performance
Database Performance
 
File Structures(Part 2)
File Structures(Part 2)File Structures(Part 2)
File Structures(Part 2)
 
Lecture12 abap on line
Lecture12 abap on lineLecture12 abap on line
Lecture12 abap on line
 
Ardbms
ArdbmsArdbms
Ardbms
 
Analysing Large Citation Network
Analysing Large Citation NetworkAnalysing Large Citation Network
Analysing Large Citation Network
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Unit 3
Unit 3Unit 3
Unit 3
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Pandas
PandasPandas
Pandas
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Pa2 session 2
Pa2 session 2Pa2 session 2
Pa2 session 2
 
Preprocess
PreprocessPreprocess
Preprocess
 
Introduction to PIG
Introduction to PIG Introduction to PIG
Introduction to PIG
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
 

Similar to Queryproc2

Unit 4 data storage and querying
Unit 4   data storage and queryingUnit 4   data storage and querying
Unit 4 data storage and querying
Ravindran Kannan
 
Indexing and hashing
Indexing and hashingIndexing and hashing
Indexing and hashing
Abdul mannan Karim
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15
Aneeb_Khawar
 
DMBS Indexes.pptx
DMBS Indexes.pptxDMBS Indexes.pptx
DMBS Indexes.pptx
husainsadikarvy
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
Chirag vasava
 
dbms-file-organization.jvhvhvhvhvyvyctctctct
dbms-file-organization.jvhvhvhvhvyvyctctctctdbms-file-organization.jvhvhvhvhvyvyctctctct
dbms-file-organization.jvhvhvhvhvyvyctctctct
vk5985399
 
dbms-file-organization.pdf for managment
dbms-file-organization.pdf for managmentdbms-file-organization.pdf for managment
dbms-file-organization.pdf for managment
abhaysonone0
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
pradeepa velmurugan
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
Michael Mathioudakis
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
Saranya Natarajan
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
Yadhu Kiran
 
Algorithm ch13.ppt
Algorithm ch13.pptAlgorithm ch13.ppt
Algorithm ch13.ppt
Dreamless2
 
indexing and hashing
indexing and hashingindexing and hashing
indexing and hashing
University of Potsdam
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
gmbmanikandan
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...
Javed Khan
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
AnwarrChaudary
 

Similar to Queryproc2 (20)

Unit08 dbms
Unit08 dbmsUnit08 dbms
Unit08 dbms
 
Unit 4 data storage and querying
Unit 4   data storage and queryingUnit 4   data storage and querying
Unit 4 data storage and querying
 
Indexing and hashing
Indexing and hashingIndexing and hashing
Indexing and hashing
 
Unit 08 dbms
Unit 08 dbmsUnit 08 dbms
Unit 08 dbms
 
Cs437 lecture 14_15
Cs437 lecture 14_15Cs437 lecture 14_15
Cs437 lecture 14_15
 
DMBS Indexes.pptx
DMBS Indexes.pptxDMBS Indexes.pptx
DMBS Indexes.pptx
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
dbms-file-organization.jvhvhvhvhvyvyctctctct
dbms-file-organization.jvhvhvhvhvyvyctctctctdbms-file-organization.jvhvhvhvhvyvyctctctct
dbms-file-organization.jvhvhvhvhvyvyctctctct
 
dbms-file-organization.pdf for managment
dbms-file-organization.pdf for managmentdbms-file-organization.pdf for managment
dbms-file-organization.pdf for managment
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
Storage struct
Storage structStorage struct
Storage struct
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
 
Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013Annotating search results from web databases-IEEE Transaction Paper 2013
Annotating search results from web databases-IEEE Transaction Paper 2013
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Algorithm ch13.ppt
Algorithm ch13.pptAlgorithm ch13.ppt
Algorithm ch13.ppt
 
indexing and hashing
indexing and hashingindexing and hashing
indexing and hashing
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
 
Overview of Storage and Indexing ...
Overview of Storage and Indexing                                             ...Overview of Storage and Indexing                                             ...
Overview of Storage and Indexing ...
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 

Recently uploaded

J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 

Recently uploaded (20)

J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 

Queryproc2

  • 1. CSCD34 - Data Management Systems - A. Vaisman 1 Overview of Storage and Indexing
  • 2. CSCD34 - Data Management Systems - A. Vaisman 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive pages is much cheaper than reading them in random order  Tapes: Can only read pages in sequence  Cheaper than disks; used for archival storage  File organization: Method of arranging a file of records on external storage.  Record id (rid) is sufficient to physically locate record  Indexes are data structures that allow us to find the record ids of records with given values in index search key fields  Architecture: Buffer manager stages pages from external storage to main memory buffer pool. File and index layers make calls to the buffer manager. Page: typically 4 Kbytes.
  • 3. CSCD34 - Data Management Systems - A. Vaisman 3 Alternative File Organizations Many alternatives exist, each ideal for some situations, and not so good in others:  Heap (random order) files: Suitable when typical access is a file scan retrieving all records.  Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is needed.  Indexes: Data structures to organize records via trees or hashing. • Like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields • Updates are much faster than in sorted files.
  • 4. CSCD34 - Data Management Systems - A. Vaisman 4 Indexes  An index on a file speeds up selections on the search key fields for the index.  Any subset of the fields of a relation can be the search key for an index on the relation.  Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).  An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k.
  • 5. CSCD34 - Data Management Systems - A. Vaisman 5 Index Classification  Primary vs. secondary: If search key contains primary key, then called primary index.  Unique index: Search key contains a candidate key.  Clustered vs. unclustered: If order of data records is the same as order of data entries, then called clustered index.  A file can be clustered on at most one search key.  Cost of retrieving data records through index varies greatly based on whether index is clustered or not!
  • 6. CSCD34 - Data Management Systems - A. Vaisman 6 Index Classification  Dense vs Sparse: If there is an entry in the index for each key value -> dense index (unclustered indices are dense). If there is an entry for each page -> sparse index. 1 5 .. .. 1 Brown .. 2 Smith.. 3 White .. 4 Yu .. 6 Peterson.. 7 Rhodes.. 5 Chen .. ……….. Brown Chen Peterson Rhodes Smith Yu White
  • 7. CSCD34 - Data Management Systems - A. Vaisman 7 Clustered vs. Unclustered Index  To build clustered index, first sort the Heap file (with some free space on each page for future inserts).  Overflow pages may be needed for inserts. (Thus, order of data recs is `close to’, but not identical to, the sort order.) Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED
  • 8. CSCD34 - Data Management Systems - A. Vaisman 8 Example B+ Tree  Good for range queries.  Insert/delete: Find data entry in leaf, then change it. Need to adjust parent sometimes. All leaves at he same height. 2* 3* Root 17 30 14* 16* 33* 34* 38* 39* 135 7*5* 8* 22* 24* 27 27* 29* Entries <= 17 Entries > 17
  • 9. CSCD34 - Data Management Systems - A. Vaisman 9 Hash-Based Indexes  Good for equality selections. •Index is a collection of buckets. Bucket = primary page plus zero or more overflow pages. •Hashing function h: h(r) = bucket in which record r belongs. h looks at the search key fields of r.  Buckets may contain the data records or just the rids.  Hash-based indexes are best for equality selections. Cannot support range searches
  • 10. CSCD34 - Data Management Systems - A. Vaisman 10 Static Hashing  # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.  h(k) mod N = bucket to which data entry with key k belongs. (N = # of buckets)  Long overflow chains can develop and degrade performance.  Extendible and Linear Hashing: Dynamic techniques to fix this. h(key) mod N h key Primary bucket pages Overflow pages 2 0 N-1
  • 11. CSCD34 - Data Management Systems - A. Vaisman 11 Static Hashing (Contd.)  Buckets contain data entries.  Hash fn works on search key field of record r. Must distribute values over range 0 ... M-1.  h(key) = (a * key + b) usually works well.  a and b are constants; lots known about how to tune h.
  • 12. CSCD34 - Data Management Systems - A. Vaisman 12 Cost Model for Our Analysis We ignore CPU costs, for simplicity:  B: The number of data pages  R: Number of records per page  D: (Average) time to read or write disk page  Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated.  Average-case analysis; based on several simplistic assumptions.  Good enough to show the overall trends!
  • 13. CSCD34 - Data Management Systems - A. Vaisman 13 Comparing File Organizations  Heap files (random order; insert at eof)  Sorted files, sorted on <age, sal>  Clustered B+ tree file, Alternative (1), search key <age, sal>  Heap file with unclustered B + tree index on search key <age, sal>  Heap file with unclustered hash index on search key <age, sal>
  • 14. CSCD34 - Data Management Systems - A. Vaisman 14 Operations to Compare  Scan: Fetch all records from disk  Equality search  Range selection  Insert a record  Delete a record
  • 15. CSCD34 - Data Management Systems - A. Vaisman 15 Cost of Operations (a) Scan (b) Equality (c ) Range (d) Insert (e) Delete (1) Heap BD 0.5BD BD 2D Search +D (2) Sorted BD Dlog 2B Dlog 2 B + # matches Search + BD Search +BD (3) Clustered 1.5BD Dlog F 1.5B Dlog 2 1.5B + # matches Search + D Search +D (4) Unclustered Tree index BD(R+0.15) D(1 +log F 0.15B) Dlog F 0.15B + # matches D(3 +log F 0.15B) Search + 2D (5) Unclustered Hash index BD(R+0.1 25) 2D BD 4D Search + 2D  Several assumptions underlie these (rough) estimates! B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page
  • 16. CSCD34 - Data Management Systems - A. Vaisman 16 Choice of Indexes  What indexes should we create?  One approach: Consider the most important queries in turn. Consider the best plan using the current indexes, and see if a better plan is possible with an additional index. If so, create it.  Obviously, this implies that we must understand how a DBMS evaluates queries and creates query evaluation plans!  For now, we discuss simple 1-table queries.  Before creating an index, must also consider the impact on updates in the workload!  Trade-off: Indexes can make queries go faster, updates slower. Require disk space, too.
  • 17. CSCD34 - Data Management Systems - A. Vaisman 17 Index Selection Guidelines  Attributes in WHERE clause are candidates for index keys.  Exact match condition suggests hash index.  Range query suggests tree index. • Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.  Multi-attribute search keys should be considered when a WHERE clause contains several conditions.  Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering.
  • 18. CSCD34 - Data Management Systems - A. Vaisman 18 Examples of Clustered Indexes  B+ tree index on E.age can be used to get qualifying tuples.  How selective is the condition?  Is the index clustered?  Consider the GROUP BY query.  If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly.  Clustered E.dno index may be better!  Equality queries and duplicates:  Clustering on E.hobby helps! SELECT E.dno FROM Emp E WHERE E.age>40 SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUP BY E.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps
  • 19. CSCD34 - Data Management Systems - A. Vaisman 19 Indexes with Composite Search Keys  Composite Search Keys: Search on a combination of fields.  Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: • age=20 and sal =75  Range query: Some field value is not a constant. E.g.: • age =20; or age=20 and sal > 10  Data entries in index sorted by search key to support range queries.  Order or attributes is relevant. sue 13 75 bob cal joe 12 10 20 8011 12 name age sal <sal, age> <age, sal> <age> <sal> 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80,11 11 12 12 13 10 20 75 80 Data records sorted by name Data entries in index sorted by <sal,age> Data entries sorted by <sal> Examples of composite key
  • 20. CSCD34 - Data Management Systems - A. Vaisman 20 Composite Search Keys  To retrieve Emp records with age=30 AND sal=4000, an index on <age,sal> would be better than an index on age or an index on sal.  Choice of index key orthogonal to clustering etc.  If condition is: 20<age<30 AND 3000<sal<5000:  Clustered tree index on <age,sal> or <sal,age> is best.  If condition is: age=30 AND 3000<sal<5000:  Clustered <age,sal> index much better than <sal,age> index!  Composite indexes are larger, updated more often.
  • 21. CSCD34 - Data Management Systems - A. Vaisman 21 Summary (Contd.)  Data entries can be actual data records, <key, rid> pairs, or <key, rid-list> pairs.  Choice orthogonal to indexing technique used to locate data entries with a given key value.  Can have several indexes on a given file of data records, each with a different search key.  Indexes can be classified as clustered vs. unclustered, primary vs. secondary, and dense vs. sparse. Differences have important consequences for utility/performance.
  • 22. CSCD34 - Data Management Systems - A. Vaisman 22 Summary (Contd.)  Understanding the nature of the workload for the application, and the performance goals, is essential to developing a good design.  What are the important queries and updates? What attributes/relations are involved?  Indexes must be chosen to speed up important queries (and perhaps some updates!).  Index maintenance overhead on updates to key fields.  Choose indexes that can help many queries, if possible.  Build indexes to support index-only strategies.  Clustering is an important decision; only one index on a given relation can be clustered!  Order of fields in composite index key can be important.

Editor's Notes

  1. 1
  2. 2
  3. 7
  4. 11
  5. 11
  6. 12
  7. 15
  8. 2
  9. 3
  10. 4
  11. 3
  12. 5
  13. 13
  14. 14
  15. 18
  16. 13
  17. 20
  18. 15