Your SlideShare is downloading. ×
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Qeary pro and opt
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Qeary pro and opt

397

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
397
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Query Processing and Optimization Michael L. Rupley, Jr. Indiana University at South Bend mrupleyj@iusb.edu owned_by 10 NoABSTRACTAll database systems must be able to respond to The drivers relation:requests for information from the user—i.e. process Attribute Length Key?queries. Obtaining the desired information from a driver_id 10 Yesdatabase system in a predictable and reliable fashion is first_name 20 Nothe scientific art of Query Processing. Getting these last_name 20 Noresults back in a timely manner deals with the age 2 Notechnique of Query Optimization. This paper willintroduce the reader to the basic concepts of queryprocessing and query optimization in the relational The car_driver relation:database domain. How a database processes a query as Attribute Length Key?well as some of the algorithms and rule-sets utilized to cd_car_name 15 Yes*produce more efficient queries will also be presented. cd_driver_id 10 Yes*In the last section, I will discuss the implementationplan to extend the capabilities of my Mini-DatabaseEngine program to include some of the query 2. WHAT IS A QUERY?optimization techniques and algorithms covered in this A database query is the vehicle for instructing a DBMSpaper. to update or retrieve specific data to/from the physically stored medium. The actual updating and1. INTRODUCTION retrieval of data is performed through various “low-level” operations. Examples of such operationsQuery processing and optimization is a fundamental, if for a relational DBMS can be relational algebranot critical, part of any DBMS. To be utilized operations such as project, join, select, Cartesianeffectively, the results of queries m be available in ust product, etc. While the DBMS is designed to processthe timeframe needed by the submitting user—be it a these low-level operations efficiently, it can be quiteperson, robotic assembly machine or even another the burden to a user to submit requests to the DBMS indistinct and separate DBMS. How a DBMS processes these formats. Consider the following request:queries and the methods it uses to optimize theirperformance are topics that will be covered in this “Give me the vehicle ids of all Chevrolet Camarospaper. built in the year 1977.”In certain sections of this paper, various concepts willbe illustrated with reference to an example database of While this is easily understandable by a human, acars and drivers. Each car and driver are unique, and DBMS must be presented with a format it caneach car can have 0 or more drivers, but only one understand, such as this SQL statement:owner. A driver can own and drive multiple cars. select vehicle_idThere are 3 relations: cars, drivers and car_driver with from vehiclesthe following attributes: where year = 1977The vehicles relation: Note that this SQL statement will still need to be Attribute Length Key? translated further by the DBMS so that the vehicle_id 15 Yes functions/methods within the DBMS program can not make 20 No only process the request, but do it in a timely manner. model 20 No year 4 NoIntroduction to Query Processing and Optimization page 2 of 11
  • 2. 3. THE QUERY PROCESSOR After the evaluation plan has been selected, it is passed into the DMBS’ query-execution engine [12] (alsoThere are three phases [12] that a query passes through referred to as the runtime database processor [5]),during the DBMS’ processing of that query: where the plan is executed and the results are returned. 1. Parsing and translation 3.1 Parsing and Translating the Query 2. Optimization 3. Evaluation The first step in processing a query submitted to a DBMS is to convert the query into a form usable by theMost queries submitted to a DBMS are in a high-level query processing engine. High-level query languageslanguage such as SQL. During the parsing and such as SQL represent a query as a string, or sequence,translation stage, the human readable form of the query of characters. Certain sequences of charactersis translated into forms usable by the DBMS. These represent various types of tokens such as keywords,can be in the forms of a relational algebra expression, operators, operands, literal strings, etc. Like allquery tree and query graph. Consider the following languages, there are rules ( syntax and grammar) thatSQL query: govern how the tokens can be combined into understandable (i.e. valid) statements. select make from vehicles The primary job of the parser is to extract the tokens where make = “Ford” from the raw string of characters and translate them into the corresponding internal data elements (i.e.This can be translated into either of the following relational algebra operations and operands) andrelational algebra expressions: structures (i.e. query tree, query graph). σ make="Ford" (π make (vehicles )) The last job of the parser is to verify the validity and syntax of the original query string. π make (σ make="Ford" (vehicles )) 3.2 Optimizing the QueryWhich can also be represented as either of the In this stage, the query processor applies rules to thefollowing query trees: internal data structures of the query to transform these structures into equivalent, but more efficient σ make="Ford" π make representations. The rules can be based upon mathematical models of the relational algebra π make σ make="Ford" expression and tree (heuristics), upon cost estimates of different algorithms applied to operations or upon the semantics within the query and the relations it involves. vehicles vehicles Selecting the proper rules to apply, when to apply them and how they are applied is the function of the queryAnd represented as a query graph: optimization engine. 3.3 Evaluating the Query “Ford” make = “Camaro” vehicles The final step in processing a query is the evaluation phase. The best evaluation plan candidate generated by the optimization engine is selected and then executed.After parsing and translation into a relational algebra Note that there can exist multiple methods of executingexpression, the query is then transformed into a form, a query. Besides processing a query in a simpleusually a query tree or graph, that can be handled by sequential manner, some of a query’s individualthe optimization engine. The optimization engine then operations can be processed in parallel—either asperforms various analyses on the query data, independent processes or as interdependent pipelines ofgenerating a number of valid evaluation plans. From processes or threads. Regardless of the method chosen,there, it determines the most appropriate evaluation the actual results should be same.plan to execute.Introduction to Query Processing and Optimization page 3 of 11
  • 3. 4. QUERY METRICS: COST minimizes the number of blocks that have to be read. However, after large numbers of insertions andThe execution time of a query depends on the resources deletions, the performance can degrade quite quickly,needed to perform the needed operations: disk and the only way to restore the performance is toaccesses, CPU cycles, RAM and, in the case of parallel perform a reorganization.and distributed systems , thread and processcommunication (which will not be considered in this Secondary Index: The data file is ordered by anpaper). Since data transfer to/from disks is attribute that is different from the search key in thesubstantially slower than memory-based transfers, the index file. Secondary indices must be dense.disk accesses usually represent an overwhelmingmajority of the total cost—particularly for very large Multi-Level Index: An index structure consisting of 2databases that cannot be pre-loaded into memory. or more tiers of records where an upper tier’s recordsWith today’s computers, the CPU cost also can be point to associated index records of the tier below. Theinsignificant compared to disk access for many bottom tier’s index records contain the pointers to theoperations. data-file records. Multi-level indices can be used, for instance, to reduce the number of disk block readsThe cost to access a disk is usually measured in terms needed during a binary search.of the number of blocks transferred from and to a disk,which will be the unit of measure referred to i the n Clustering Index: A two-level index structure whereremainder of this paper. the records in the first level contain the clustering field value in one field and a second field pointing to a block5. THE ROLE OF INDEXES [of 2nd level records] in the second level. The recordsThe utilization of indexes can dramatically reduce the in the second level have one field that points to anexecution time of various operations such as select and actual data file record or to another 2nd level block.join. Let us review some of the types of index filestructures and the roles they play in reducing execution B+-tree Index: Multi-level index with a balanced-tree +time and overhead: structure. Finding a search key value in a B -tree is proportional to the height of the tree—maximumDense Index: Data-file is ordered by the search key and number of seeks required is lg( height ) . Whileevery search key value has a separate index record. this, on average, is more than a single-level, denseThis structure requires only a single seek to find the index that requires only one seek, the B+-tree structurefirst occurrence of a set of contiguous records with the has a distinct advantage in that it does not requiredesired search value. reorganization—it is self-optimizing because the tree is kept balanced during insertions and deletions. ManySparse Index: Data-file is ordered by the index search mission-critical applications require high performancekey and only some of the search key values have with near-100% uptime, which cannot be achievedcorresponding index records. Each index record’s with structures requiring reorganization. The leaves ofdata-file pointer points to the first data-file record with the B+-tree are used to reorganize the data file.the search key value. While this structure can be lessefficient (in terms of number of disk accesses) than a 6. QUERY ALGORITHMSdense index to find the desired records, it requires lessstorage space and less overhead during insertion and Queries are ultimately reduced to a number of file scandeletion operations. operations on the underlying physical file structures. For each relational operation, there can exist severalPrimary Index: The data file is ordered by the attribute different access paths to the particular records needed.that is also the search key in the index file. Primary The query execution engine can have a multitude ofindices can be dense or sparse. This is also referred to specialized algorithms designed to process particularas an Index-Sequential File [5]. For scanning through relational operation and access path combinations. Wea relation’s records in sequential order by a key value, will look at some examples of algorithms for both thethis is one of the fastest and more efficient structures — select and join operations.locating a record has a cost of 1 seek, and thecontiguous makeup of the records in sorted orderIntroduction to Query Processing and Optimization page 4 of 11
  • 4. 6.1 Selection Algorithms 6.2 Join AlgorithmsThe Select operation must search through the data files Like selection, the join operation can be implementedfor records meeting the selection criteria. The in a variety of ways. In terms of disk accesses, the joinfollowing are some examples of simple (one attribute) operations can be very expensive, so implementing andselection algorithms [13]: utilizing efficient join algorithms is critical in minimizing a query’s execution time. The followingS1. Linear search: Every record from the file is read are 4 well-known types of join algorithms: and compared to the selection criteria. The execution cost for searching on a non-key attribute J1. Nested-Loop Join: This algorithm consists of a is b r , where br is the number of blocks in the file inner for loop nested within an outer for loop. To representing relation r. On a key attribute, the illustrate this algorithm, we will use the following average cost is b r / 2, with a worst case of b r . notations:S2. Binary search on primary key: A binary search, on r, s Relations r and s equality, performed on a primary key attribute (file tr Tuple (record) in relation r ordered by the key) has a worst-case cost of ts Tuple (record) in relation s lg(br ) . This can be significantly more efficient nr Number of records in relation r than the linear search, particularly for a large ns Number of records in re lation s number of records. br Number of blocks with records in relation r bs Number of blocks with records in relation sS3. Search using a primary index on equality: With a B+-tree index, an equality comparison on a key Here is a sample pseudo-code listing for joining attribute will have a worst-case cost of the height the two relations r and s utilizing the nested-for of the tree (in the index file) plus one to retrieve loop [12]: the record from the data file. An equality for each tuple tr in r comparison on a non-key attribute will be the for each tuple ts in s same except that multiple records may meet the if join condition is true for (tr , tr ) condition, in which case, we add the number of add tr +ts to the result blocks containing the records to the cost. Each record in the outer relation r is scanned once,S4. Search using a primary index on comparison: and each record in the inner relation s is scanned When the comparison operators (<, ≤, >, ≥ ) are n r . times, resulting in nr * ns total record scans. If used to retrieve multiple records from a file sorted only one block of each relation can fit into by the search attribute, the first record satisfying memory, then the cost (number of block accesses ) the condition is located and the total blocks before is nr * bs + br [12]. If all blocks in both relations ( <, ≤ ) or after (>, ≥ ) is added to the cost of can fit into memory, then the cost is b r + b s [12]. If all of the blocks in relation s (the inner relation) locating the first record. can fit into memory, then the cost is identical toS5. Search using a secondary index on equality: both relations fitting in memory: b r + b s [12]. Retrieve one record with an equality comparison Thus, if one of the relations can fit entirely in on a key attribute; or retrieve a set of records on a memory, then it is advantageous for the query non-key attribute. For a single record, the cost optimizer to select that relation as the inner one. will be equal to the cost of locating the search key Even though the worst case for the nested-loop in the index file plus one for retrieving the data join is quite expensive, it has an advantage in that record. For multiple records, the cost will be equal it does not impose any restrictions on the access to the cost of locating the search key in the index paths for either relation, regardless of the join file plus one block access for each data record condition. retrieval, since the data file is not ordered on the search attribute. J2. Index Nested-Loop Join: This algorithm is the same as the Nested-Loop Join, except an index file on the inner relation’s (s) join attribute is used versus a data-file scan on s—each index lookup in the inner loop is essentially an equality selectionIntroduction to Query Processing and Optimization page 5 of 11
  • 5. on s utilizing one of the selection algorithms (ex. 7. QUERY OPTIMIZATION S2, S3, S5). Let c be the cost for the lookup, then the worst-case cost for joining r and s is br + nr * c The function of a DBMS’ query optimization engine is [12]. to find an evaluation plan that reduces the overall execution cost of a query. We have seen in theJ3. Sort-Merge Join: This algorithm can be used to previous sections that the costs for performing perform natural joins and equi-joins and requires particular operations such as select and join can vary that each relation (r and s) be sorted by the quite dramatically. As an example, consider 2 relations common attributes between them (R ∩ S) [12]. r and s, with the following characteristics: The details for how this algorithm works can be found in [5] and [12] and will not be presented 10,000 = n r = Number of tuples in r here. However, it is notable to point out that each 1,000 = n s = Number of tuples in s record in r and s is only scanned once, thus 1,000 = b r = Number of blocks with tuples in r producing a worst and best-case cost of br + b s 100 = b s = Number of blocks with tuples in s [12]. Variations of the Sort-Merge Join algorithm Selecting a single record from r on a non-key attribute are used, for instance, when the data files are in can have, a cost of lg(br )  = 10 (binary search) or a un-sorted order, but there exist secondary indices for the two relations. cost of br / 2 = 5, 000 (linear search). Joining r and s can have a cost of nr ∗ bs + br = 1, 001,000 (nested-loopJ4. Hash Join: Like the sort-merge join, the hash join algorithm can be used to perform natural joins and join)[13] or a cost of 3(br + bs ) + 4nh = 73,000 (hash- equi-joins [12]. The hash join utilizes two hash join where n h = 10,000)[13]. table file structures (one for each relation) to partition each relation’s records into sets Notice that the cost difference between the 2 selects containing identical hash values on the join differs by a factor of 500, and the 2 joins by a factor of attributes. Each relation is scanned and its ~14. Clearly, selecting lower-cost methods can result corresponding hash table on the join attribute in substantially better performance. This process of values is built. Note that collisions may occur, selecting a lower-cost mechanism is known as cost- resulting in some of the partitions containing based optimization. Other strategies for lowering the different sets records with matching join attribute execution time of queries include heuristic-based values. After the two hash tables are built, for optimization and semantic-based optimization. each matching partition in the hash tables, an in- memory hash index of the smaller relation’s (the In heuristic-based optimization, mathematical rules are build relation) records is built and a nested-loop applied to the components of the query to generate an join is performed against the corresponding evaluation plan that, theoretically, will result in a lower records in the other relation, writing out to the execution time. Typically, these components are the result for each join. data elements within an internal data structure, such as Note that the above works only if the required a query tree, that the query parser has generated from a amount of memory is available to hold the hash higher level representation of the query (i.e. SQL). index and the number records in any partition of the build relation. If not, then a process known as The internal nodes of a query tree represent specific recursive partitioning is performed—see [5] or relational algebra operations to be performed on the [12] for details. relation(s) passed into them from the child node(s) directly below. The leaves of the tree are the The cost for the hash join, without recursive relation(s). The tree is evaluated from the bottom up, partitioning, is 3(b r + b s ) + 4 h where n h is the n creating a specific evaluation plan. In section 3, we number of partitions in the hash table [12]. The saw that a query’s query tree can be constructed in cost for the hash join with recursive partitioning is multiple, equivalent ways. In m any instances, there 2(br + bs )log M −1 (bs ) − 1 + br + bs where M is the will be at least one of these equivalent trees that number of memory blocks used. produces a faster, “optimized” execution plan. Section 7.2 will illustrate this concept.Introduction to Query Processing and Optimization page 6 of 11
  • 6. Another way of optimizing a query is semantic-based key attribute, sA = 1. The selection cardinalityquery optimization. In many cases, the data within and allows the query optimizer to determine thebetween relations contain “rules” and patterns that are “average number of records that will satisfy anbased upon “real-world” situations that the DBMS does equality selection condition on that attribute”[5].not “know” about. For example, vehicles like the The query optimizer also depends on other importantDelorean were not made after 1990, so a query like data such as the ordering of the data file, the type of“Retrieve all vehicles with make equal to Delorean and index structures available and the attributes involved inyear > 2000” will produce zero records. Injecting these these file organization structures. Knowing whethertypes of semantic rules into a DBMS can thus further certain access structures exist allows the queryenhance a query’s execution time. optimizer to select the appropriate algorithm(s) to use for particular operations.7.1 Statistics of Expression ResultsIn order to estimate the various costs of query 7.2 Expression and Tree Transformationsoperations, the query optimizer utilizes a fairly After a high-level query (i.e. SQL statement) has beenextensive amount of metadata associated with the parsed into an equivalent relational algebra expression,relations and their corresponding file structures. These the query optimizer can perform heuristic rules on thedata are collected during and after various database expression and tree to transform the expression andoperations (such as queries) and stored in the DBMS tree into equivalent, but optimized forms. As ancatalog. These data include [5, 12]: example, consider the following SQL query: nr Number of records (tuples) in a relation r. select first_name, last_name Knowing the number of records in a relation is a from drivers, vehicles critical piece of data utilized in nearly all cost where make = “Chevrolet” and owned_by = estimations of operations. driver_id fr Blocking factor (number of records per block) for relation r. This data is used in calculating A corresponding relational algebra expression is: the blocking factor, and is also useful in π first_name,last_name(( σ make = “Chevrolet” ( σ owned_by = driver_id determining the proper size and number of (vehicles X drivers))) memory buffers. br Number of blocks in relation r’s data-file. Also And the corresponding canonical query tree for the a critical and commonly used datum, b r is relational algebra expression: calculated value equal to n r / br. π first_name,last_name lr Length of a record, in bytes, in relation r. The record size is another important data item used in many operations, particularly when the values σ make = “Chevrolet” differ significantly for two relations involved in an operation. For variable -length records, the actual length value used—either the average or σ owned_by = driver_id the maximum—depends on the type of operation to be performed. d Ar Number of distinct values of attribute A in X relation r. This value is important in calculating the number of resulting records for a projection operation and for aggregate functions like sum, vehicles drivers count and average. x Number of levels in a multi-level index (B+-tree, Suppose the vehicles and drivers relations both have cluster index, etc.). This data item is used in 10,000 records each and the number of Chevrolet estimating the number of block accesses needed vehicles is 5,000. Note that the Cartesian product in various search algorithms . Note that for a resulting in 10,000,000 records can be reduced by 50% B+-tree, x will be equal to the height of the tree. if the σ make = “Chevrolet”.operation is performed first. We sA Selection cardinality of an attribute. This is a can also combine the σ owned_by = driver_id and Cartesian calculated value equal to n r / d Ar . When A is a product operations into a more efficient join operation,Introduction to Query Processing and Optimization page 7 of 11
  • 7. as well as eliminating any unneeded columns before follows:the expensive join is performed. The diagram below a. If all of the attributes in the select’s conditionshows this better, “optimized” version of the tree: are in relation R then σ c (R S) ≡ ( σ c (R) ) S b. Given select the condition c composed of π first_name,last_name conditions c1 and c2, and c1 contains only attributes from R, and c2 contains only attributes owned_by = driver_id from S, then σ c (R S) ≡ ( σ c1 (R) ) ( σ c 2 (S)) 7. Commutativity of set operations ( ∪, ∩, − ): Union and intersection operations are π make,owned_by π first_name,last_name commutative; but the difference operation is not: , driver_id R ∪ S ≡ S ∪ R, R ∩ S ≡ S ∩ R, R − S ≠ S − R σ make = “Chevrolet” 8. Associativity of , X, ∪ and ∩ : All four of these drivers operations are individually associative. Let θ be any one of these operators, then: vehicles (R θ S) θ T ≡ R θ (S θ T) 9. Commuting σ with set operations (∪, ∩, − ):LetIn relational algebra, there are several definitions and θ be any one of the three set operators, then:theorems the query optimizer can use to transform thequery. For instance, the definition of equivalent σ c (R θ S) ≡ ( σ c (R) ) θ ( σ c (S) )relations states that the set of attributes (domain) ofeach relation must be the same—because they are sets, 10. Commuting π with ∪ : Project and union operations can be commuted:the order does not matter. Here is a partial list ofrelational algebra theorems from the Elmasri/Navathe π AList ( R ∪ S ) ≡ (π AList (R ) ) ∪ (π AList (S ) )textbook [5]: Using these theorems, an algorithm can be defined to transform the original query expression/tree created by1. Cascade of σ : A select with conjunctive the parser into a more optimized query. A detailed conditions on the attribute list is equivalent to a example of such an algorithm can be found in the cascade of selects upon selects: Elmasri/Navathe textbook [5]—some of the key σ A1∧ A2 ∧...∧An ( R ) ≡ σ A1 (σ A2 (...(σ An ( R ))...)) concepts can be summarized as follows:2. Commutativity of σ : The select operation is 1. One primary objective is to reduce the size of the commutative: σ A1 (σ A2 (R)) ≡ σ A2 (σ A1 (R )) intermediate relations, both in terms of bytes per record as well as number of records, as soon as3. Cascade of π : A cascade of project operations is possible so that subsequent operations will have equivalent to the last project operation of the less data to process and thus execute quicker. cascade: π AList 1 (π AList 2 (...(π AList n ( R))...)) ≡ π AList 1 ( R) 2. Operations, such as conjunctive selections, should be broken down into their equivalent set of smaller4. Commuting σ with π : Given a π ’s and σ ’s units to allow the individual units to be moved into attribute list of A1 , A2 ,…, An , the π and σ “better” positions within the query tree. operations can be commuted: 3. Combine Cartesian products with corresponding π A1, A2 ,..., An (σ c (R ) ) ≡ σ c (π A1 , A2 ,..., An ( R) ) selects to create joins—utilizing optimized join algorithms like the sort-merge join and hash join5. Communativity of or X: The join and Cartesian can be orders of magnitude more efficient. product operations are communative: R S ≡ S R and R X S ≡ S X R 4. Move selects and projects as far down the tree as possible, as these operations will produce smaller6. Commuting σ with or X: Select can be intermediate relations that can be processed more commuted with join (or Cartesian product) as quickly by the operations above.Introduction to Query Processing and Optimization page 8 of 11
  • 8. 7.3 Choice of Evaluation Plans “owner” of the state of each data-file record (i.e. active or deleted).The query optimization engine typically generates a setof candidate evaluation plans. Some will, in heuristic • Hash-based, unordered, single-level, secondarytheory, produce a faster, more efficient execution. index. This index structure provides single-seekOthers may, by prior historical results, be more access to data-file records. Because it isefficient than the theoretical models —this can very unordered, only equality-based comparisons canwell be the case for queries dependent on the semantic be utilized in locating records.nature of the data to be processed. Still others can be • Sequential, unordered data file. Due to the factmore efficient due to “outside agencies” such as that the current indices are unordered, this filenetwork congestion, competing applications on the will not be able to be put into a physically sortedsame CPU, etc. Thus, a plethora of data can exist from format, even after a reorganization.which the query execution engine can probe for thebest evaluation plan to execute at any given time. • Meta-Data file: This file stores all of the information, in standard XML-formatted text,10. CONCLUSION pertaining to its corresponding relation. This includes: table name; number of fieldsOne of the most critical functional requirements of a (attributes); list of fields with their name, sizeDBMS is its ability to process queries in a timely and type; primary key field identifier; foreignmanner. This is particularly true for very large, key identifiers; list of indices with the indexmission critical applications such as weather name, index field name and type of indexforecasting, banking systems and aeronautical (unique/key or clustered/secondary). Also, theapplications, which can contain millions and even access algorithm for this file is flexible enoughtrillions of records. The need for faster and faster, to handle any additional data (singular or“immediate” results never ceases. Thus, a great deal of nested).research and resources is spent on creating smarter,highly efficient query optimization engines. Some of Data Structuresthe basic techniques of query processing and • Each of the file structures has a correspondingoptimization have been presented in this paper. Other, class with methods that handle the file accessmore advanced topics are the subjects of many research (open, close, rename, etc.) as well as thepapers and projects. Some examples include XML reading, updating, inserting and deleting ofquery processing [3, 11], query containment [2], records.utilizing materialized views [13], sequence queries [9, • Wrapping around the file structure classes is the10] and many others. DBRelation class with methods that handle the creation, opening and updating of the associated11. EXTENDING MY MINI-DB ENGINE files. This class also has high-level methods oneMy primary goal in enhancing my “mini” database would associate with single-relation operationsengine application is to speed up the processing of such as: insert, delete, update and search.queries. Before we get in the details behind the • Wrapping around the DBRelation class is theimplementation plan for accomplishing this goal, let’s Mini_Rel_Algebra class with methods thatlook at what data structures, file structures and perform the following relational algebraalgorithms are currently in place. Only the most operations: select, project, Cartesian product,significant ones will be discussed. union, intersection difference and join.11.1 Current Implementation Algorithms • DBRelation.Search() This method performs theFile Structures actual search for record(s) in a given relation. It • Record-number-ordered index. This index is can accept multiple search fields, conditions and “low-level” and is not “seen” by the relational search values for performing the equivalent of a algebra methods (i.e. select, project, etc.). It’s conjunctive select. The search values can be purpose is to provide a primary access either constant string values or references to a mechanism to the data-file records. It is also the particular field’s value. The search algorithm itself can operate in two ways: 1. Linear search,Introduction to Query Processing and Optimization page 9 of 11
  • 9. where every record in the relation’s data file is 6. Create a new DBQuery class. This will allow the scanned and compared to the condition. 2. user to build and execute a query consisting of a Index lookup, by equality, on an attribute in the sequence of relational operations. This class will condition list. The index lookup is essentially be a simplified version in that it will only handle a equivalent to the S5 search algorithm detailed in sequential list of operations. If time permits, I section 6.1. Note that if the record is found may create a “true” query tree structure. The using the index, then the remaining search following instance variables and methods will be conditions, if any, are evaluated for the current implemented: record. If the record is not found, then the ArrayList opList : Instance variable holding the condition with the indexed attribute is false, and sequential list of relational operations. the remaining conditions do not have to be evaluated. DBOperation opObj : Object holding a relational operation. DBOperation will either be a • Mini_Rel_Algebra.Join() The current structure or class that holds all of the implementation of the join operation is possible parameters involved in the various performed by executing a select operation types of relational operations. followed by a Cartesian product operation. DBQuery(string queryName) The constructor11.2 Proposed Enhancements method.There are 6 main query execution speed enhancements bool AddOp(string opName, string param1,I plan on imple menting: string param2, …) Adds an operation object to the opList array. Will be overloaded to1. Implement a “record generator” so that a large handle the various parameter lists of the number (>100,000) of records can be populated different relational operations. For instance, into the database. This will allow performance a project operation needs three parameters: comparisons to be made between the current and string opName, string relationName, string new implementations. attributeList.2. Replace the current join algorithm with the J4 hash void Clear() Clears the current query— join algorithm discussed in section 6.2. I expect a opList.Clear() very significant boost in performance, and, so that string Execute() Executes the current query. speed comparisons can be made, I will keep the The returned string will be the name of the old join method (rename it to OldJoin). The resulting relation. syntax of the new Join() call will be the same: string ToString() Returns a multi-line string of Join(string relationName1, string the current list of operations and their relationName2, string joinField1, string parameters. joinField2)3. Create a new HJIndex class to handle the hash 11.3 Query Performance Comparisons table file structure that will be needed by the new Time permitting, I will build a test database consisting hash join algorithm. Since most all of the current of several thousand records. This test database will DBIndex class’ existing access methods (add, then be used to time the execution speeds of “identical” delete, modify) will most likely not need any queries in the existing and new version of the Mini changes (even if they do, the changes will be DBEngine application. The test results will be minor), the HJIndex will be inherited from the compiled into a comparison table and included in the DBIndex class. report for the final version of the application.4. Finish the implementation of the cluster index. This new index structure will allow multiple records to be retrieved utilizing the concepts of the S5 search algorithm in section 6.1.5. Modify the DBRelation.Search() method to utilize the new cluster index.Introduction to Query Processing and Optimization page 10 of 11
  • 10. REFERENCES Transactions on Database Systems, Vol. 29, Issue 2, June 2004, Pages 282-318. [1] Henk Ernst Blok, Djoerd Hiemstra and Sunil Choenni, Franciska de Jong, Henk M. [10] Reza Sadri, Carlo Zaniolo, Amir Zarkesh and Blanken and Peter M.G. Apers. Predicting the Jafar Adibi. Optimization of Sequence cost-quality trade-off for information retrieval Queries in Database Systems. In Proceedings queries: Facilitatiing database design and of the twentieth ACM SIGMOD-SIGACT- query optimization. Proceedings of the tenth SIGART symposium on Principles of database international conference on Information and systems, May 2001, Pages 71-81. knowledge management, October 2001, Pages 207-214. [11] Thomas Schwentick. XPath Query Containment. ACM SIGMOD Record, Vol. [2] D. Calvanese, G. De Giacomo, M. Lenzerini 33, No. 1, March 2004. and M. Y. Vardi. Reasoning on Regular Path Queries. ACM SIGMOD Record, Vol. 32, No. [12] Avi Silbershatz, Hank Korth and S. 4, December 2003. Sudarshan. Database System Concepts, 4th Edition. McGraw-Hill, 2002. [3] Andrew Eisenberg and Jim Melton. Advancements in SQL/XML. ACM SIGMOD [13] Dimitri Theodoratos and Wugang Xu. Record, Vol. 33, No. 3, September 2004. Constructing Search Spaces for Materialized View Selection. Proceedings of the 7th ACM [4] Andrew Eis enberg and Jim Melton. An Early international workshop on Data warehousing Look at XQuery API for Java™ (XQJ). ACM and OLAP, November 2004, Pages 112-121. SIGMOD Record, Vol. 33, No. 2, June 2004. [14] Jingren Zhou and Kenneth A. Ross. Buffering [5] Ramez Elmasri and Shamkant B. Navathe. Database Operations for Enhanced Instruction Fundamentals of Database Systems, second Cache Performance. Proceedings of the 2004 edition. Addison-Wesley Publishing ACM SIGMOD international conference on Company, 1994. Management of data, June 2004, Pages 191-202. [6] Donald Kossmann and Konrad Stocker. Iterative Dynamic Programming: A new Class of Query Optimization Algorithms. ACM Transactions on Database Systems, Vol. 25, No. 1, March 2000, Pages 43-82. [7] Chiang Lee, Chi-Sheng Shih and Yaw-Huei Chen. A Graph-theoritic model for optimizing queries involving methods. The VLDB Journal — The International Journal on Very Large Data Bases, Vol. 9, Issue 4, April 2001, Pages 327-343. [8] Hsiao-Fei Liu, Ya-Hui Chang and Kun-Mao Chao. An Optimal Algorithm for Querying Tree Structures and its Applications in Bioinformatics. ACM SIGMOD Record Vol. 33, No. 2, June 2004. [9] Reza Sadri, Carlo Zaniolo, Amir Zarkesh and Jafar Adibi. Expressing and Optimizing Sequence Queries in Database Systems. ACMIntroduction to Query Processing and Optimization page 11 of 11

×