The document provides an overview of storage and indexing in database systems. It discusses different file organizations like heap files, sorted files, and indexes. It describes two common index structures - B+ trees and hash indexes. It covers concepts like primary vs secondary indexes, clustered vs unclustered indexes, and dense vs sparse indexes. The document also provides examples of how to choose appropriate indexes for different types of queries and discusses design trade-offs around indexing.
Presentation on data preparation with pandasAkshitaKanther
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
Presentation on data preparation with pandasAkshitaKanther
Data preparation is the first step after you get your hands on any kind of dataset. This is the step when you pre-process raw data into a form that can be easily and accurately analyzed. Proper data preparation allows for efficient analysis - it can eliminate errors and inaccuracies that could have occurred during the data gathering process and can thus help in removing some bias resulting from poor data quality. Therefore a lot of an analyst's time is spent on this vital step.
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.
I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
Learning Objectives - In this module, you will understand Advance Hive concepts such as UDF. Understanding columnar database HBase. Comparing SQL and NoSQL approach.
Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Ahmed Saleh
While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents.
On the one hand, the full-text of documents like scientific papers is not always available due to, e.\,g., copyright policies of academic publishers.
On the other hand, conducting a search based on titles alone has strong limitations. Titles are short and therefore may not contain enough information to yield satisfactory search results.
In this paper, we compare different retrieval models regarding their search performance on the full-text vs. only titles of documents.
We use different datasets, including the three digital library datasets: EconBiz, IREON, and PubMed.
The results show that it is possible to build effective title-based retrieval models that provide competitive results comparable to full-text retrieval. The difference between the average evaluation results of the best title-based retrieval models is only 3% less than those of the best full-text-based retrieval models.
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda
The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.
Bucketing is a popular data partitioning technique to pre-shuffle and (optionally) pre-sort data during writes. This is ideal for a variety of write-once and read-many datasets at Facebook, where Spark can automatically avoid expensive shuffles/sorts (when the underlying data is joined/aggregated on its bucketed keys) resulting in substantial savings in both CPU and IO.
Over the last year, we’ve added a series of optimizations in Apache Spark as a means towards achieving feature parity with Hive and Spark. These include avoiding shuffle/sort when joining/aggregating/inserting on tables with mismatching buckets, allowing user to skip shuffle/sort when writing to bucketed tables, adding data validators before writing bucketed data, among many others. As a direct consequence of these efforts, we’ve witnessed over 10x growth (spanning 40% of total compute) in queries that read one or more bucketed tables across the entire data warehouse at Facebook.
In this talk, we’ll take a deep dive into the internals of bucketing support in SparkSQL, describe use-cases where bucketing is useful, touch upon some of the on-going work to automatically suggest bucketing tables based on query column lineage, and summarize the lessons learned from developing bucketing support in Spark at Facebook over the last 2 years
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.
I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
Learning Objectives - In this module, you will understand Advance Hive concepts such as UDF. Understanding columnar database HBase. Comparing SQL and NoSQL approach.
Learning Objectives - This module will help you in understanding Apache Hive Installation, Loading and Querying Data in Hive and so on.
Topics - Hive Architecture and Installation, Comparison with Traditional Database, HiveQL: Data Types, Operators and Functions, Hive Tables (Managed Tables and External Tables, Partitions and Buckets, Storage Formats, Importing Data, Altering Tables, Dropping Tables), Querying Data (Sorting And Aggregating, Map Reduce Scripts, Joins & Subqueries, Views, Map and Reduce side Joins to optimize Query).
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Ahmed Saleh
While there are many studies on information retrieval models using full-text, there are presently no comparison studies of full-text retrieval vs. retrieval only over the titles of documents.
On the one hand, the full-text of documents like scientific papers is not always available due to, e.\,g., copyright policies of academic publishers.
On the other hand, conducting a search based on titles alone has strong limitations. Titles are short and therefore may not contain enough information to yield satisfactory search results.
In this paper, we compare different retrieval models regarding their search performance on the full-text vs. only titles of documents.
We use different datasets, including the three digital library datasets: EconBiz, IREON, and PubMed.
The results show that it is possible to build effective title-based retrieval models that provide competitive results comparable to full-text retrieval. The difference between the average evaluation results of the best title-based retrieval models is only 3% less than those of the best full-text-based retrieval models.
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityThamme Gowda
The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.
Bucketing is a popular data partitioning technique to pre-shuffle and (optionally) pre-sort data during writes. This is ideal for a variety of write-once and read-many datasets at Facebook, where Spark can automatically avoid expensive shuffles/sorts (when the underlying data is joined/aggregated on its bucketed keys) resulting in substantial savings in both CPU and IO.
Over the last year, we’ve added a series of optimizations in Apache Spark as a means towards achieving feature parity with Hive and Spark. These include avoiding shuffle/sort when joining/aggregating/inserting on tables with mismatching buckets, allowing user to skip shuffle/sort when writing to bucketed tables, adding data validators before writing bucketed data, among many others. As a direct consequence of these efforts, we’ve witnessed over 10x growth (spanning 40% of total compute) in queries that read one or more bucketed tables across the entire data warehouse at Facebook.
In this talk, we’ll take a deep dive into the internals of bucketing support in SparkSQL, describe use-cases where bucketing is useful, touch upon some of the on-going work to automatically suggest bucketing tables based on query column lineage, and summarize the lessons learned from developing bucketing support in Spark at Facebook over the last 2 years
Annotating search results from web databases-IEEE Transaction Paper 2013Yadhu Kiran
Abstract—An increasing number of databases have become web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic
annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
1. CSCD34 - Data Management Systems - A. Vaisman 1
Overview of Storage and Indexing
2. CSCD34 - Data Management Systems - A. Vaisman 2
Data on External Storage
Disks: Can retrieve random page at fixed cost
But reading several consecutive pages is much cheaper than
reading them in random order
Tapes: Can only read pages in sequence
Cheaper than disks; used for archival storage
File organization: Method of arranging a file of records
on external storage.
Record id (rid) is sufficient to physically locate record
Indexes are data structures that allow us to find the record ids
of records with given values in index search key fields
Architecture: Buffer manager stages pages from external
storage to main memory buffer pool. File and index
layers make calls to the buffer manager. Page: typically
4 Kbytes.
3. CSCD34 - Data Management Systems - A. Vaisman 3
Alternative File Organizations
Many alternatives exist, each ideal for some
situations, and not so good in others:
Heap (random order) files: Suitable when typical
access is a file scan retrieving all records.
Sorted Files: Best if records must be retrieved in
some order, or only a `range’ of records is needed.
Indexes: Data structures to organize records via
trees or hashing.
• Like sorted files, they speed up searches for a subset of
records, based on values in certain (“search key”) fields
• Updates are much faster than in sorted files.
4. CSCD34 - Data Management Systems - A. Vaisman 4
Indexes
An index on a file speeds up selections on the
search key fields for the index.
Any subset of the fields of a relation can be the
search key for an index on the relation.
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation).
An index contains a collection of data entries,
and supports efficient retrieval of all data
entries k* with a given key value k.
5. CSCD34 - Data Management Systems - A. Vaisman 5
Index Classification
Primary vs. secondary: If search key contains
primary key, then called primary index.
Unique index: Search key contains a candidate key.
Clustered vs. unclustered: If order of data records
is the same as order of data entries, then called
clustered index.
A file can be clustered on at most one search key.
Cost of retrieving data records through index varies
greatly based on whether index is clustered or not!
6. CSCD34 - Data Management Systems - A. Vaisman 6
Index Classification
Dense vs Sparse: If there is an entry in the index
for each key value -> dense index (unclustered
indices are dense). If there is an entry for each
page -> sparse index.
1
5
..
..
1 Brown ..
2 Smith..
3 White ..
4 Yu ..
6 Peterson..
7 Rhodes..
5 Chen ..
………..
Brown
Chen
Peterson
Rhodes
Smith
Yu
White
7. CSCD34 - Data Management Systems - A. Vaisman 7
Clustered vs. Unclustered Index
To build clustered index, first sort the Heap file (with
some free space on each page for future inserts).
Overflow pages may be needed for inserts. (Thus, order of
data recs is `close to’, but not identical to, the sort order.)
Index entries
Data entries
direct search for
(Index File)
(Data file)
Data Records
data entries
Data entries
Data Records
CLUSTERED UNCLUSTERED
8. CSCD34 - Data Management Systems - A. Vaisman 8
Example B+ Tree
Good for range queries.
Insert/delete: Find data entry in leaf, then
change it. Need to adjust parent sometimes.
All leaves at he same height.
2* 3*
Root
17
30
14* 16* 33* 34* 38* 39*
135
7*5* 8* 22* 24*
27
27* 29*
Entries <= 17 Entries > 17
9. CSCD34 - Data Management Systems - A. Vaisman 9
Hash-Based Indexes
Good for equality selections.
•Index is a collection of buckets. Bucket = primary
page plus zero or more overflow pages.
•Hashing function h: h(r) = bucket in which
record r belongs. h looks at the search key fields
of r.
Buckets may contain the data records or just
the rids.
Hash-based indexes are best for equality
selections. Cannot support range searches
10. CSCD34 - Data Management Systems - A. Vaisman 10
Static Hashing
# primary pages fixed, allocated sequentially, never de-allocated;
overflow pages if needed.
h(k) mod N = bucket to which data entry with key k belongs. (N =
# of buckets)
Long overflow chains can develop and degrade performance.
Extendible and Linear Hashing: Dynamic techniques to fix this.
h(key) mod N
h
key
Primary bucket pages Overflow pages
2
0
N-1
11. CSCD34 - Data Management Systems - A. Vaisman 11
Static Hashing (Contd.)
Buckets contain data entries.
Hash fn works on search key field of record r. Must
distribute values over range 0 ... M-1.
h(key) = (a * key + b) usually works well.
a and b are constants; lots known about how to tune h.
12. CSCD34 - Data Management Systems - A. Vaisman 12
Cost Model for Our Analysis
We ignore CPU costs, for simplicity:
B: The number of data pages
R: Number of records per page
D: (Average) time to read or write disk page
Measuring number of page I/O’s ignores gains of
pre-fetching a sequence of pages; thus, even I/O
cost is only approximated.
Average-case analysis; based on several simplistic
assumptions.
Good enough to show the overall trends!
13. CSCD34 - Data Management Systems - A. Vaisman 13
Comparing File Organizations
Heap files (random order; insert at eof)
Sorted files, sorted on <age, sal>
Clustered B+ tree file, Alternative (1), search
key <age, sal>
Heap file with unclustered B + tree index on
search key <age, sal>
Heap file with unclustered hash index on
search key <age, sal>
14. CSCD34 - Data Management Systems - A. Vaisman 14
Operations to Compare
Scan: Fetch all records from disk
Equality search
Range selection
Insert a record
Delete a record
15. CSCD34 - Data Management Systems - A. Vaisman 15
Cost of Operations
(a) Scan (b) Equality (c ) Range (d) Insert (e) Delete
(1) Heap BD 0.5BD BD 2D Search
+D
(2) Sorted BD Dlog 2B Dlog 2 B +
# matches
Search
+ BD
Search
+BD
(3) Clustered 1.5BD Dlog F 1.5B Dlog 2 1.5B
+ # matches
Search
+ D
Search
+D
(4) Unclustered
Tree index
BD(R+0.15) D(1 +log F
0.15B)
Dlog F
0.15B
+ # matches
D(3 +log F
0.15B)
Search
+ 2D
(5) Unclustered
Hash index
BD(R+0.1
25)
2D BD 4D Search
+ 2D
Several
assumptions underlie
these (rough)
estimates!
B: The number of data pages
R: Number of records per page
D: (Average) time to read or write disk page
16. CSCD34 - Data Management Systems - A. Vaisman 16
Choice of Indexes
What indexes should we create?
One approach: Consider the most important queries
in turn. Consider the best plan using the current
indexes, and see if a better plan is possible with an
additional index. If so, create it.
Obviously, this implies that we must understand how a
DBMS evaluates queries and creates query evaluation plans!
For now, we discuss simple 1-table queries.
Before creating an index, must also consider the
impact on updates in the workload!
Trade-off: Indexes can make queries go faster, updates
slower. Require disk space, too.
17. CSCD34 - Data Management Systems - A. Vaisman 17
Index Selection Guidelines
Attributes in WHERE clause are candidates for index keys.
Exact match condition suggests hash index.
Range query suggests tree index.
• Clustering is especially useful for range queries; can also help on
equality queries if there are many duplicates.
Multi-attribute search keys should be considered when a
WHERE clause contains several conditions.
Try to choose indexes that benefit as many queries as
possible. Since only one index can be clustered per relation,
choose it based on important queries that would benefit the
most from clustering.
18. CSCD34 - Data Management Systems - A. Vaisman 18
Examples of Clustered Indexes
B+ tree index on E.age can be
used to get qualifying tuples.
How selective is the condition?
Is the index clustered?
Consider the GROUP BY query.
If many tuples have E.age > 10, using
E.age index and sorting the retrieved
tuples may be costly.
Clustered E.dno index may be better!
Equality queries and duplicates:
Clustering on E.hobby helps!
SELECT E.dno
FROM Emp E
WHERE E.age>40
SELECT E.dno, COUNT (*)
FROM Emp E
WHERE E.age>10
GROUP BY E.dno
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps
19. CSCD34 - Data Management Systems - A. Vaisman 19
Indexes with Composite Search Keys
Composite Search Keys: Search
on a combination of fields.
Equality query: Every field
value is equal to a constant
value. E.g. wrt <sal,age> index:
• age=20 and sal =75
Range query: Some field value
is not a constant. E.g.:
• age =20; or age=20 and sal > 10
Data entries in index sorted
by search key to support
range queries.
Order or attributes is
relevant.
sue 13 75
bob
cal
joe 12
10
20
8011
12
name age sal
<sal, age>
<age, sal> <age>
<sal>
12,20
12,10
11,80
13,75
20,12
10,12
75,13
80,11
11
12
12
13
10
20
75
80
Data records
sorted by name
Data entries in index
sorted by <sal,age>
Data entries
sorted by <sal>
Examples of composite key
20. CSCD34 - Data Management Systems - A. Vaisman 20
Composite Search Keys
To retrieve Emp records with age=30 AND sal=4000,
an index on <age,sal> would be better than an index
on age or an index on sal.
Choice of index key orthogonal to clustering etc.
If condition is: 20<age<30 AND 3000<sal<5000:
Clustered tree index on <age,sal> or <sal,age> is best.
If condition is: age=30 AND 3000<sal<5000:
Clustered <age,sal> index much better than <sal,age>
index!
Composite indexes are larger, updated more often.
21. CSCD34 - Data Management Systems - A. Vaisman 21
Summary (Contd.)
Data entries can be actual data records, <key,
rid> pairs, or <key, rid-list> pairs.
Choice orthogonal to indexing technique used to
locate data entries with a given key value.
Can have several indexes on a given file of
data records, each with a different search key.
Indexes can be classified as clustered vs.
unclustered, primary vs. secondary, and
dense vs. sparse. Differences have important
consequences for utility/performance.
22. CSCD34 - Data Management Systems - A. Vaisman 22
Summary (Contd.)
Understanding the nature of the workload for the
application, and the performance goals, is essential
to developing a good design.
What are the important queries and updates? What
attributes/relations are involved?
Indexes must be chosen to speed up important
queries (and perhaps some updates!).
Index maintenance overhead on updates to key fields.
Choose indexes that can help many queries, if possible.
Build indexes to support index-only strategies.
Clustering is an important decision; only one index on a
given relation can be clustered!
Order of fields in composite index key can be important.