SlideShare a Scribd company logo
1 of 58
Download to read offline
I N D E X M A N A G E M E N TI N D E P T H
A n d re a G i u l i a n o
@ b i t _ s h a r k
D ATA B A S E M A N A G E M E N T S Y S T E M S
T H E A R C H I T E C T U R E
D A TA B A S E M A N A G E M E N T S Y S T E M S
web forms Application Front Ends SQL Interface
SQL COMMANDS
Plan Executor Parser
OptimizerOperator Evaluator
Access File Manager
SQL Engine
Buffer Manager
Disk Manager
Transaction Manager
Lock Manager
Recovery
Manager
Concurrency Control
DBMS
S Q L E N G I N E
A R C H I T E C T U R E O F A D B M S
web forms Application Front Ends SQL Interface
SQL COMMANDS
Plan Executor Parser
OptimizerOperator Evaluator
Access File Manager
SQL Engine
Buffer Manager
Disk Manager
Transaction Manager
Lock Manager
Recovery
Manager
Concurrency Control
DBMS
A C C E S S F I L E M A N A G E R
A R C H I T E C T U R E O F A D B M S
web forms Application Front Ends SQL Interface
SQL COMMANDS
Plan Executor Parser
OptimizerOperator Evaluator
Access File Manager
SQL Engine
Buffer Manager
Disk Manager
Transaction Manager
Lock Manager
Recovery
Manager
Concurrency Control
DBMS
B U F F E R M A N A G E R
A R C H I T E C T U R E O F A D B M S
web forms Application Front Ends SQL Interface
SQL COMMANDS
Plan Executor Parser
OptimizerOperator Evaluator
Access File Manager
SQL Engine
Buffer Manager
Disk Manager
Transaction Manager
Lock Manager
Recovery
Manager
Concurrency Control
DBMS
D I S K M A N A G E R
A R C H I T E C T U R E O F A D B M S
web forms Application Front Ends SQL Interface
SQL COMMANDS
Plan Executor Parser
OptimizerOperator Evaluator
Access File Manager
SQL Engine
Buffer Manager
Disk Manager
Transaction Manager
Lock Manager
Recovery
Manager
Concurrency Control
DBMS
U N I T O F I N F O R M AT I O N
Page
size ranges from 2Kb to 64Kb
I/O of pages dominates the cost of the operations
F I L E O R G A N I Z AT I O N
method of arranging the records in a file
–Sars, Roebuck, and Co, Consumers’ Guide, 1897
“If you don’t find it in the index, look very carefully
through the entire catalog.”
I N D E X E S
organize data records on disk to optimize retrieval operations
I N D E X E S
data entry
record stored in an index file
data record
record stored in a database file
P R O P E R T I E S O F A N I N D E X
1. Structure of data entries
2. Clustered/unclustered
3. Primary/secondary
4. Dense/sparse
5. Organization of the index
S T R U C T U R E O F D ATA E N T R I E S
k*: data entry whose search key value is k
1. k* is a data record
-extreme case
2. k* is a pair (k, rid)
-the index file is independent from the data file
3. k* is a pair <k, rid-list>
-the index file is independent from the data file
-better use of space but variable-length data entries
P R O P E R T I E S O F A N I N D E X
C L U S T E R E D / U N C L U S T E R E D I N D E X
unclustered clustered
data record
file
index file
data records
data
entries
data
entries
index entries
data records
P R O P E R T I E S O F A N I N D E X
P E R F O R M A N C E
data record
file
index file
data records
data
entries
Clustered
only few pages
have to be retrieved
Unclustered
as many data page I/Os as
the number of data entries
C L U S T E R E D / U N C L U S T E R E D I N D E X
P R I M A RY A N D S E C O N D A RY I N D E X
P R O P E R T I E S O F A N I N D E X
10
20
30
40
10
20
30
40
page pointers
page in index file
page in data file
P R O P E R T I E S O F A N I N D E X
D E N S E I N D E X
10
30
50
70
10
20
30
40
50
60
70
80
page pointers
page in index file
page in data file
P R O P E R T I E S O F A N I N D E X
S PA R S E I N D E X
P R O P E R T I E S O F A N I N D E X
O R G A N I Z AT I O N O F T H E I N D E X
sorted index
the index is a sorted file
tree-based index
the index is a tree
hash-based index
the index is a function from search key values
to record addresses
I S A M A N D B +
- T R E E
T R E E - B A S E D I N D E X
ISAM
used when the relation is static: no insertion
and deletion on the tree
b+-tree
effective in dynamic situations with insertion
and deletion
Data Pages
Index Pages
Overflow Pages
Page Allocation in ISAM
…
…
Non-leaf
pages
Leaf
pages
Primary pages
Overflow page
I S A M
T R E E - B A S E D I N D E X
I S A M : O P E R AT I O N S
T R E E - B A S E D I N D E X
search
identical to b+-tree (more on this soon).
insertion
find the right position on the tree and write the
key (possible overflow pages).
deletion
remove the entry and the empty overflow page if
needed. Leave the empty primary page as it is.
B +
- T R E E
T R E E - B A S E D I N D E X
Index entries
Data entries
Index file
12 78
3 9 19 56 86 94
33 44
Daniel, 22, 6003
Ashby, 25, 3000
Basu, 33, 4003 Rossi, 44, 3000
Bianchi, 50, 5004
B +
- T R E E
T R E E - B A S E D I N D E X
… … … … … … ……
12 78
3 9 19 56 86 94
33 44
Daniel, 22, 6003
Ashby, 25, 3000
Basu, 33, 4003 Rossi, 44, 3000
Bianchi, 50, 5004
… … … … … … ……
B +
- T R E E : S E A R C H
T R E E - B A S E D I N D E X
Start search
B +
- T R E E : S E A R C H
T R E E - B A S E D I N D E X
12 78
3 9 19 56 86 94
33 44
Daniel, 22, 6003
Ashby, 25, 3000
Basu, 33, 4003 Rossi, 44, 3000
Bianchi, 50, 5004
… … … … … … ……
Start search
A
B
L1 L2 L3
find all data entries with
24 < key < 50
12 78
3 9 19 56
33 44
Daniel, 22, 6003
Ashby, 25, 3000
Basu, 33, 4003 Rossi, 44, 3000
Bianchi, 50, 5004
… … … …
…
…
Start search
A
B
L1 L2 L3
S E A R C H : C O S T
T R E E - B A S E D I N D E X
f: fanout
h: height
m: leaves (f h
)
Cost of a search
[logF m]
T R E E - B A S E D I N D E X
12 78
3 9 19 56
33 44
Daniel, 22, 6003
Ashby, 25, 3000
Basu, 33, 4003 Rossi, 44, 3000
Bianchi, 50, 5004
… … … …
…
…
Start search
A
B
L1 L2 L3
S E A R C H : C O S T
f = 3
h = 3
m = 27
I/Os
[log3 27] = 3
B +
- T R E E : I N S E R T
T R E E - B A S E D I N D E X
13 17 24 30
2* 3* 5* 7*
14* 16* 19* 20* 22* 24* 27* 29*
33* 34* 38* 39*
Insertion of a data record with search key value 8
2* 3* 5* 7* 8*
B +
- T R E E : I N S E R T
T R E E - B A S E D I N D E X
…
5
data entry is copied up
B +
- T R E E : I N S E R T
T R E E - B A S E D I N D E X
17
data entry is pushed up
5 13 24 30
B +
- T R E E : I N S E R T
T R E E - B A S E D I N D E X
17
5* 7* 8* 14
*
16
*
19
*
20
*
22
*
24
*
27
*
29
*
33
*
34
*
38
*
39
*
5 13 24 30
2* 3*
The resulting tree after the insertion of a data record with search key value 8.
B +
- T R E E : D E L E T E
T R E E - B A S E D I N D E X
17
5* 7* 8* 14
*
16
*
22 24 27 29 33
*
34
*
38
*
39
*
5 13 27 30
2* 3*
The resulting tree after deleting entries 19* and 20*
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- dense unclustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- dense unclustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- dense unclustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
how many page accesses
do we need to answer to the query?
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- dense unclustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
Let’s build the index structure
each record has 4 field so in each page there are 40 fields
20 data entries fit in one leaf page of the index
the tree has a fan-out of 20
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
fanout: 20
occupancy factor of 67%
leads to 13 data entries in the leaves
How many leaves are there in the tree?
2.000.000/13 = 153.846
log20 (153.846) = 4 I/O page accesses
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
How many I/Os are needed to go to the leaves?
leaves: 153.846
fanout: 20
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
- 200 records with the same value of the attribute cost
(on average)
- 13 data entries in the leaves
- dense unclustered b+
-tree index with search key cost
Book
code
author
cost
publisher
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
- 15 pages (200/13) to visit for reaching data
records with the same cost value (on average)
~ 3 sec
E X A M P L E
D E N S E U N C L U S T E R E D I N D E X
Costs
path to the leaves: 4 I/Os
leaves access: 15 I/Os
data records: 200 I/Os
Total cost
4 + 15 + 200 = 219 I/Os
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- sparse clustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- sparse clustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- sparse clustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
how many page accesses
do we need to answer to the query?
- 2.000.000 records
- 200.000 pages
- 10 data record in a page
- 200 records with the same value of the attribute cost
(on average)
- sparse clustered b+
-tree index with search key cost
- alternative 2 (k*, rid)
Book
code
author
cost
publisher
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
Let’s build the index structure
each record has 4 field so in each page there are 40 fields
20 data entries fit in one leaf page of the index
the tree has a fan-out of 20
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
fanout: 20
How many pages store 2.000.000 data records?
2.000.000/10 = 200.000
each data entry points to a
data record page
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
fanout: 20
occupancy factor of 67%
leads to 13 data entries in the leaves
How many leaves are there in the tree?
200.000/13 = 15.384
log20 (15.384) = 3 I/O page accesses
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
How many I/Os are needed to go to the leaves?
leaves: 15.384
fanout: 20
fanout: 20
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
- 200 records with the same value of the attribute cost
(on average)
- 10 data records in a page
- sparse clustered b+
-tree index with search key cost
Book
code
author
cost
publisher
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
- 20 pages (200/10) of data records to visit 

(on average)
~ 0.3 sec
E X A M P L E
S PA R S E C L U S T E R E D I N D E X
Costs
path to the leaves: 3 I/Os
data records: 20 I/Os
Total cost
3 + 20 = 23 I/Os
E X A M P L E
A VA R I A N T
What if the attributes were part of the search key?
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
E X A M P L E
W I T H O U T I N D E X
In the worst case we have to visit 2.000.000 records
SELECT code, author, publisher
FROM Book
WHERE cost = %cost%
~ 50 min
$ W H O A M I
Andrea Giuliano
@bit_shark
www.andreagiuliano.it
joind.in/13333
Please rate the talk!
Ramakrishan, Gehrke “Database Management Systems”
Assets:
https://farm4.staticflickr.com/3577/3492185538_a39dbb4511_b_d.jpg
https://farm3.staticflickr.com/2852/10740309163_12f6a671cc_k_d.jpg
https://farm3.staticflickr.com/2456/3835365695_5e515a3492_b_d.jpg
https://farm5.staticflickr.com/4048/4332381194_3cfbed7f8e_b_d.jpg
https://farm5.staticflickr.com/4117/4800819674_3cf963deaa_b_d.jpg
https://farm5.staticflickr.com/4141/4772464179_0672159bbd_b.jpg
https://farm9.staticflickr.com/8530/8574154090_bd14f9ccbf_o_d.jpg
https://farm8.staticflickr.com/7370/10847923014_2b3fc30cea_k_d.jpg
R E F E R E N C E S

More Related Content

Similar to Index management in depth

Postgres indexes: how to make them work for your application
Postgres indexes: how to make them work for your applicationPostgres indexes: how to make them work for your application
Postgres indexes: how to make them work for your applicationBartosz Sypytkowski
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11AnwarrChaudary
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14AnwarrChaudary
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19AnwarrChaudary
 
R-Trees and Geospatial Data Structures
R-Trees and Geospatial Data StructuresR-Trees and Geospatial Data Structures
R-Trees and Geospatial Data StructuresAmrinder Arora
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!Arjen de Vries
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Michael Mathioudakis
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Michael Mathioudakis
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...Marco Gralike
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptChellamuthuHaripriya
 
Advanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation PipelinesAdvanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation PipelinesTom Schreiber
 
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB Europe 2016 - Advanced MongoDB Aggregation PipelinesMongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB Europe 2016 - Advanced MongoDB Aggregation PipelinesMongoDB
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationINRIA-OAK
 
DATA STRUCTURES
DATA STRUCTURESDATA STRUCTURES
DATA STRUCTURESbca2010
 
bca data structure
bca data structurebca data structure
bca data structureshini
 

Similar to Index management in depth (20)

Postgres indexes: how to make them work for your application
Postgres indexes: how to make them work for your applicationPostgres indexes: how to make them work for your application
Postgres indexes: how to make them work for your application
 
Intro to Data warehousing lecture 11
Intro to Data warehousing   lecture 11Intro to Data warehousing   lecture 11
Intro to Data warehousing lecture 11
 
Intro to Data warehousing lecture 14
Intro to Data warehousing   lecture 14Intro to Data warehousing   lecture 14
Intro to Data warehousing lecture 14
 
Intro to Data warehousing lecture 19
Intro to Data warehousing   lecture 19Intro to Data warehousing   lecture 19
Intro to Data warehousing lecture 19
 
R-Trees and Geospatial Data Structures
R-Trees and Geospatial Data StructuresR-Trees and Geospatial Data Structures
R-Trees and Geospatial Data Structures
 
What to do when one size does not fit all?!
What to do when one size does not fit all?!What to do when one size does not fit all?!
What to do when one size does not fit all?!
 
Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02Modern Database Systems - Lecture 02
Modern Database Systems - Lecture 02
 
Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01Modern Database Systems - Lecture 01
Modern Database Systems - Lecture 01
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...OakTable World 2015  - Using XMLType content with the Oracle In-Memory Column...
OakTable World 2015 - Using XMLType content with the Oracle In-Memory Column...
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .ppt
 
Advanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation PipelinesAdvanced MongoDB Aggregation Pipelines
Advanced MongoDB Aggregation Pipelines
 
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB Europe 2016 - Advanced MongoDB Aggregation PipelinesMongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Syntaxdirected
SyntaxdirectedSyntaxdirected
Syntaxdirected
 
Syntaxdirected
SyntaxdirectedSyntaxdirected
Syntaxdirected
 
Syntaxdirected (1)
Syntaxdirected (1)Syntaxdirected (1)
Syntaxdirected (1)
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
 
DATA STRUCTURES
DATA STRUCTURESDATA STRUCTURES
DATA STRUCTURES
 
bca data structure
bca data structurebca data structure
bca data structure
 

More from Andrea Giuliano

CQRS, ReactJS, Docker in a nutshell
CQRS, ReactJS, Docker in a nutshellCQRS, ReactJS, Docker in a nutshell
CQRS, ReactJS, Docker in a nutshellAndrea Giuliano
 
Go fast in a graph world
Go fast in a graph worldGo fast in a graph world
Go fast in a graph worldAndrea Giuliano
 
Concurrent test frameworks
Concurrent test frameworksConcurrent test frameworks
Concurrent test frameworksAndrea Giuliano
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
 
Asynchronous data processing
Asynchronous data processingAsynchronous data processing
Asynchronous data processingAndrea Giuliano
 
Index management in shallow depth
Index management in shallow depthIndex management in shallow depth
Index management in shallow depthAndrea Giuliano
 
Everything you always wanted to know about forms* *but were afraid to ask
Everything you always wanted to know about forms* *but were afraid to askEverything you always wanted to know about forms* *but were afraid to ask
Everything you always wanted to know about forms* *but were afraid to askAndrea Giuliano
 

More from Andrea Giuliano (9)

CQRS, ReactJS, Docker in a nutshell
CQRS, ReactJS, Docker in a nutshellCQRS, ReactJS, Docker in a nutshell
CQRS, ReactJS, Docker in a nutshell
 
Go fast in a graph world
Go fast in a graph worldGo fast in a graph world
Go fast in a graph world
 
Concurrent test frameworks
Concurrent test frameworksConcurrent test frameworks
Concurrent test frameworks
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
Asynchronous data processing
Asynchronous data processingAsynchronous data processing
Asynchronous data processing
 
Index management in shallow depth
Index management in shallow depthIndex management in shallow depth
Index management in shallow depth
 
Everything you always wanted to know about forms* *but were afraid to ask
Everything you always wanted to know about forms* *but were afraid to askEverything you always wanted to know about forms* *but were afraid to ask
Everything you always wanted to know about forms* *but were afraid to ask
 
Stub you!
Stub you!Stub you!
Stub you!
 
Let's test!
Let's test!Let's test!
Let's test!
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Index management in depth

  • 1. I N D E X M A N A G E M E N TI N D E P T H A n d re a G i u l i a n o @ b i t _ s h a r k
  • 2. D ATA B A S E M A N A G E M E N T S Y S T E M S
  • 3. T H E A R C H I T E C T U R E D A TA B A S E M A N A G E M E N T S Y S T E M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser OptimizerOperator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  • 4. S Q L E N G I N E A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser OptimizerOperator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  • 5. A C C E S S F I L E M A N A G E R A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser OptimizerOperator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  • 6. B U F F E R M A N A G E R A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser OptimizerOperator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  • 7. D I S K M A N A G E R A R C H I T E C T U R E O F A D B M S web forms Application Front Ends SQL Interface SQL COMMANDS Plan Executor Parser OptimizerOperator Evaluator Access File Manager SQL Engine Buffer Manager Disk Manager Transaction Manager Lock Manager Recovery Manager Concurrency Control DBMS
  • 8. U N I T O F I N F O R M AT I O N Page size ranges from 2Kb to 64Kb I/O of pages dominates the cost of the operations
  • 9. F I L E O R G A N I Z AT I O N method of arranging the records in a file
  • 10. –Sars, Roebuck, and Co, Consumers’ Guide, 1897 “If you don’t find it in the index, look very carefully through the entire catalog.”
  • 11. I N D E X E S organize data records on disk to optimize retrieval operations
  • 12. I N D E X E S data entry record stored in an index file data record record stored in a database file
  • 13. P R O P E R T I E S O F A N I N D E X 1. Structure of data entries 2. Clustered/unclustered 3. Primary/secondary 4. Dense/sparse 5. Organization of the index
  • 14. S T R U C T U R E O F D ATA E N T R I E S k*: data entry whose search key value is k 1. k* is a data record -extreme case 2. k* is a pair (k, rid) -the index file is independent from the data file 3. k* is a pair <k, rid-list> -the index file is independent from the data file -better use of space but variable-length data entries P R O P E R T I E S O F A N I N D E X
  • 15. C L U S T E R E D / U N C L U S T E R E D I N D E X unclustered clustered data record file index file data records data entries data entries index entries data records P R O P E R T I E S O F A N I N D E X
  • 16. P E R F O R M A N C E data record file index file data records data entries Clustered only few pages have to be retrieved Unclustered as many data page I/Os as the number of data entries C L U S T E R E D / U N C L U S T E R E D I N D E X
  • 17. P R I M A RY A N D S E C O N D A RY I N D E X P R O P E R T I E S O F A N I N D E X
  • 18. 10 20 30 40 10 20 30 40 page pointers page in index file page in data file P R O P E R T I E S O F A N I N D E X D E N S E I N D E X
  • 19. 10 30 50 70 10 20 30 40 50 60 70 80 page pointers page in index file page in data file P R O P E R T I E S O F A N I N D E X S PA R S E I N D E X
  • 20. P R O P E R T I E S O F A N I N D E X O R G A N I Z AT I O N O F T H E I N D E X sorted index the index is a sorted file tree-based index the index is a tree hash-based index the index is a function from search key values to record addresses
  • 21. I S A M A N D B + - T R E E T R E E - B A S E D I N D E X ISAM used when the relation is static: no insertion and deletion on the tree b+-tree effective in dynamic situations with insertion and deletion
  • 22. Data Pages Index Pages Overflow Pages Page Allocation in ISAM … … Non-leaf pages Leaf pages Primary pages Overflow page I S A M T R E E - B A S E D I N D E X
  • 23. I S A M : O P E R AT I O N S T R E E - B A S E D I N D E X search identical to b+-tree (more on this soon). insertion find the right position on the tree and write the key (possible overflow pages). deletion remove the entry and the empty overflow page if needed. Leave the empty primary page as it is.
  • 24. B + - T R E E T R E E - B A S E D I N D E X Index entries Data entries Index file
  • 25. 12 78 3 9 19 56 86 94 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 B + - T R E E T R E E - B A S E D I N D E X … … … … … … ……
  • 26. 12 78 3 9 19 56 86 94 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … …… B + - T R E E : S E A R C H T R E E - B A S E D I N D E X Start search
  • 27. B + - T R E E : S E A R C H T R E E - B A S E D I N D E X 12 78 3 9 19 56 86 94 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … …… Start search A B L1 L2 L3 find all data entries with 24 < key < 50
  • 28. 12 78 3 9 19 56 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … Start search A B L1 L2 L3 S E A R C H : C O S T T R E E - B A S E D I N D E X f: fanout h: height m: leaves (f h ) Cost of a search [logF m]
  • 29. T R E E - B A S E D I N D E X 12 78 3 9 19 56 33 44 Daniel, 22, 6003 Ashby, 25, 3000 Basu, 33, 4003 Rossi, 44, 3000 Bianchi, 50, 5004 … … … … … … Start search A B L1 L2 L3 S E A R C H : C O S T f = 3 h = 3 m = 27 I/Os [log3 27] = 3
  • 30. B + - T R E E : I N S E R T T R E E - B A S E D I N D E X 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Insertion of a data record with search key value 8
  • 31. 2* 3* 5* 7* 8* B + - T R E E : I N S E R T T R E E - B A S E D I N D E X … 5 data entry is copied up
  • 32. B + - T R E E : I N S E R T T R E E - B A S E D I N D E X 17 data entry is pushed up 5 13 24 30
  • 33. B + - T R E E : I N S E R T T R E E - B A S E D I N D E X 17 5* 7* 8* 14 * 16 * 19 * 20 * 22 * 24 * 27 * 29 * 33 * 34 * 38 * 39 * 5 13 24 30 2* 3* The resulting tree after the insertion of a data record with search key value 8.
  • 34. B + - T R E E : D E L E T E T R E E - B A S E D I N D E X 17 5* 7* 8* 14 * 16 * 22 24 27 29 33 * 34 * 38 * 39 * 5 13 27 30 2* 3* The resulting tree after deleting entries 19* and 20*
  • 35.
  • 36. - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E D E N S E U N C L U S T E R E D I N D E X
  • 37. SELECT code, author, publisher FROM Book WHERE cost = %cost% - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E D E N S E U N C L U S T E R E D I N D E X
  • 38. SELECT code, author, publisher FROM Book WHERE cost = %cost% - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher how many page accesses do we need to answer to the query? E X A M P L E D E N S E U N C L U S T E R E D I N D E X
  • 39. E X A M P L E D E N S E U N C L U S T E R E D I N D E X - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - dense unclustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher Let’s build the index structure each record has 4 field so in each page there are 40 fields 20 data entries fit in one leaf page of the index the tree has a fan-out of 20
  • 40. E X A M P L E D E N S E U N C L U S T E R E D I N D E X fanout: 20 occupancy factor of 67% leads to 13 data entries in the leaves How many leaves are there in the tree? 2.000.000/13 = 153.846
  • 41. log20 (153.846) = 4 I/O page accesses E X A M P L E D E N S E U N C L U S T E R E D I N D E X How many I/Os are needed to go to the leaves? leaves: 153.846 fanout: 20
  • 42. E X A M P L E D E N S E U N C L U S T E R E D I N D E X - 200 records with the same value of the attribute cost (on average) - 13 data entries in the leaves - dense unclustered b+ -tree index with search key cost Book code author cost publisher SELECT code, author, publisher FROM Book WHERE cost = %cost% - 15 pages (200/13) to visit for reaching data records with the same cost value (on average)
  • 43. ~ 3 sec E X A M P L E D E N S E U N C L U S T E R E D I N D E X Costs path to the leaves: 4 I/Os leaves access: 15 I/Os data records: 200 I/Os Total cost 4 + 15 + 200 = 219 I/Os
  • 44. - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X
  • 45. - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X SELECT code, author, publisher FROM Book WHERE cost = %cost%
  • 46. - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X SELECT code, author, publisher FROM Book WHERE cost = %cost% how many page accesses do we need to answer to the query?
  • 47. - 2.000.000 records - 200.000 pages - 10 data record in a page - 200 records with the same value of the attribute cost (on average) - sparse clustered b+ -tree index with search key cost - alternative 2 (k*, rid) Book code author cost publisher E X A M P L E S PA R S E C L U S T E R E D I N D E X Let’s build the index structure each record has 4 field so in each page there are 40 fields 20 data entries fit in one leaf page of the index the tree has a fan-out of 20
  • 48. E X A M P L E S PA R S E C L U S T E R E D I N D E X fanout: 20 How many pages store 2.000.000 data records? 2.000.000/10 = 200.000 each data entry points to a data record page
  • 49. E X A M P L E S PA R S E C L U S T E R E D I N D E X fanout: 20 occupancy factor of 67% leads to 13 data entries in the leaves How many leaves are there in the tree? 200.000/13 = 15.384
  • 50. log20 (15.384) = 3 I/O page accesses E X A M P L E S PA R S E C L U S T E R E D I N D E X How many I/Os are needed to go to the leaves? leaves: 15.384 fanout: 20 fanout: 20
  • 51. E X A M P L E S PA R S E C L U S T E R E D I N D E X - 200 records with the same value of the attribute cost (on average) - 10 data records in a page - sparse clustered b+ -tree index with search key cost Book code author cost publisher SELECT code, author, publisher FROM Book WHERE cost = %cost% - 20 pages (200/10) of data records to visit 
 (on average)
  • 52. ~ 0.3 sec E X A M P L E S PA R S E C L U S T E R E D I N D E X Costs path to the leaves: 3 I/Os data records: 20 I/Os Total cost 3 + 20 = 23 I/Os
  • 53. E X A M P L E A VA R I A N T What if the attributes were part of the search key? SELECT code, author, publisher FROM Book WHERE cost = %cost%
  • 54. E X A M P L E W I T H O U T I N D E X In the worst case we have to visit 2.000.000 records SELECT code, author, publisher FROM Book WHERE cost = %cost% ~ 50 min
  • 55.
  • 56. $ W H O A M I Andrea Giuliano @bit_shark www.andreagiuliano.it
  • 58. Ramakrishan, Gehrke “Database Management Systems” Assets: https://farm4.staticflickr.com/3577/3492185538_a39dbb4511_b_d.jpg https://farm3.staticflickr.com/2852/10740309163_12f6a671cc_k_d.jpg https://farm3.staticflickr.com/2456/3835365695_5e515a3492_b_d.jpg https://farm5.staticflickr.com/4048/4332381194_3cfbed7f8e_b_d.jpg https://farm5.staticflickr.com/4117/4800819674_3cf963deaa_b_d.jpg https://farm5.staticflickr.com/4141/4772464179_0672159bbd_b.jpg https://farm9.staticflickr.com/8530/8574154090_bd14f9ccbf_o_d.jpg https://farm8.staticflickr.com/7370/10847923014_2b3fc30cea_k_d.jpg R E F E R E N C E S