A comprehensive study of non blocking joining technique


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A comprehensive study of non blocking joining technique

  1. 1. International Journal of Computer and Technology (IJCET), ISSN 0976 – 6367(Print),International Journal of Computer Engineering EngineeringISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEand Technology (IJCET), ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) Volume 1 IJCETNumber 2, Sept - Oct (2010), pp. 57-68 ©IAEME© IAEME, http://www.iaeme.com/ijcet.html A COMPREHENSIVE STUDY OF NON-BLOCKING JOINING TECHNIQUES Glory Birru Computer science and engineering Karunya University, Tamil Nadu E-Mail: Glory.Birru@live.com Silja Varghese Computer science and engineering Karunya University, Tamil Nadu E-Mail: varghesesilja287@gmail.com Ms. G. Hemalatha Assistant Professor, CSE Dept Karunya University, Coimbatore, India E-Mail: hema_latha207@yahoo.comABSTRACT: The huge amount of the available data requires that the data be stored at differentlocations with the least amount of memory requirement and easy retrieval. This gavebirth to databases and DBMS. The retrieval is simple and quick when the data is stored ata single location (logical or physical); it becomes complex or non-trivial when the data isnot at one place. The technique of getting this data from different locations (here tables)together for use is called joining. Joining has been used since the development ofdatabases; many techniques have since been introduced, some with the modification toexisting ones and some with a different approach altogether. In a real-time queryexecution environment, when the number of tuples is large, it is the join that takes themaximum amount of time and CPU usage. In this paper we will explain and compare the non-blocking joining techniquesand their approaches. The joining techniques are compared based on their execution time,flushing policy, the memory requirements, I/O complexity and other factors that makeone algorithm more preferable than the other in the appropriate environment. The ability 57
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEof the techniques to handle the multiple inputs, continuous tuples to give excellent resultsfor the resources available is of much significance.Keywords: Blocking, Non-blocking, CPU usage, Memory usage, Execution time.INTRODUCTION: The state-of-the-art joining techniques all have the basic assumption that thetuples or relations to be joined are available in the memory before the joining begins; thisassumption though simple cannot always be met. The availability of large amount of real-time necessitates that the joining is done as the tuples arrive at real time. This introducesthe concept of blocking and non-blocking joining algorithms where the first one requiresall the input before hand while the later does not. The blocking algorithms thoughpopular cannot be used in real-time environments, thus the non-blocking algorithms cameinto existence. This paper explains some of the non-blocking techniques for joining thetuples in a relation.1. SYMMETRIC HASH JOIN Symmetric hash join algorithm is a non blocking algorithm. The symmetric hashjoin operator maintains two hash tables, one for each relation. Each hash table uses adifferent hash function. It supports the traditional demand-pull pipeline interface. Read atuple from the inner relation and insert it into the inner relations hash table, using theinner relations hash function. Then, use the new tuple to probe the outer relations hashtable for matches. To probe, use the outer relations hash function. When probing with theinner tuple finds no more matches, read a tuple from the outer relation. Insert it into theouter relations hash table using the outer relations hash function. Then, use the outertuple to probe the inner relations hash table for matches, using the inner tables hashfunction. These two steps are repeated until there are no more tuples to be read fromeither of the two input relations. Figure 1 Symmetric hash joins 58
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME SHJ aims at producing its output tuples as early as possible in the process ofcalculating the join, without decreasing the performance of the join operation itself.2. XJOIN: A REACTIVELY-SCHEDULED PIPELINED JOINOPERATOR XJoin is a non-blocking join operator based on Symmetric hash Join algorithm.Xjoin is optimized to produce initial results quickly and hide inter mitten delays in dataarrival by reactively scheduling background processing. XJoin is based on twofundamental principles:1. It is optimized for producing results incrementally as they become available.2. It allows progress to be made even when one or more sources experience delays.Algorithm Details: XJoin works in three stages. The first and second stages run while there are stilltuples coming from either source, and the third stage is a cleanup executed after all thetuples have been received. The first stage hashes tuples into partitions and then probes thecomplementary memory partition for a match. If the memory allocated to the join hasbeen exhausted, tuples are flushed to disk to make room for more incoming tuples. Figure 2 handling the partitions If both sources become blocked, the first stage yields to the second. This stagechooses a disk partition, reads the tuples it contains into memory and probes thecorresponding memory partition of the other relation. The tuples in this disk partitioncannot be discarded at this point because they may still join with inputs that have not yet 59
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEarrived. The second and third stages avoid producing spurious duplicates though keepingtimestamps of when the second stage was run for a particular disk partition. Figure 3 Memory to memory joins Thus, if a tuple in a disk partition is repeatedly run against the same tuples in amemory partition, timestamps show that the two have already matched and the match isdropped. The third stage is a cleanup stage. For each set of partitions, it loads all of oneinto memory and then streams the corresponding disk and memory partitions by it. Onceall the partitions have been processed, the join is complete. XJoin proceeds in 3 stages(separate threads)MEMORY OVERFLOW HANDLING: XJoin flushes largest bucket from only one source. Flush the largest singlepartition. Flushing policy affects the duplicate detection strategy of the join algorithm.Also affects its performance in two ways: Join output rate - The number of results generated as input is being received. Thisdepends on the tuples in memory. Overall execution time - The total time may change depending on the cost of flushingand post-join cleanup.3. PROGRESSIVE MERGE JOIN: GENERIC APPROACH ANDNON-BLOCKING SORT-BASED JOIN ALGORITHM Progressive Merge Join (PMJ) is derived from sort-merge join.PMJ computes theresults already during the sorting phase. It does so by sorting both input setssimultaneously and by joining data items that are in main memory at the same time. Thefirst data item can be produced earlier than completion of sorting. 60
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEPMJ ALGORITHM PMJ consists of two phases. In first phase, PMJ starts reading as much data aspossible from two input sets into available memory. Both subsets are then sorted using aninternal algorithm like Quicksort. The sorted sequences are joined using an in memoryjoin algorithm. After that, both sequences are temporarily written to external memory.PMJ continues with loading subsets in memory from the remaining input, sorting andjoining these subsets until the input is completely processed. In the second phase, PMJgenerates longer runs by merging the sequences that were temporarily written to externalmemory.MEMORY OVERFLOW HANDLING: In PMJ, memory overflow is handled by flushing policy. It flushes the wholememory by flushing large buckets into disk. Due to this kind of flushing, I/Operformance of PMJ is better.4. HASH MERGE JOIN: A NON-BLOCKING JOIN ALGORITHMFOR PRODUCING FAST AND EARLY JOIN RESULTS. Hash Merge Join algorithm deals with data items from remote sources viaunpredictable, slow, or bursty network traffic. The HMJ algorithm is designed with twogoals in mind: (1) Minimize the time to produce the first few results, and (2) Produce joinresults even if the two sources of the join operator occasionally get blockedHMJ ALGORITHM The Hash-merge join algorithm has two phases: The hashing and merging phases.The hashing phase employs an in-memory hash-based join algorithm that produces joinresults as quickly as data arrives. Once the memory gets filled, certain parts of thememory are flushed into disk storage to free memory space for the newly incomingtuples. If one of the sources is blocked for any reason, e.g., due to slow or bursty networktraffic, the hashing phase can still produce join results from the unblocked source. If thetwo input sources are blocked, the HMJ algorithm starts its merging phase. In themerging phase, previously flushed parts in disk are joined together using a sort-merge-like join algorithm. Thus, the HMJ algorithm can produce join results even if the twosources are blocked. Once the blocking of any of the two sources is resolved, the HMJ 61
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEalgorithm switches back to the hashing phase. The HMJ algorithm switches back andforth between the two phases until all data items are received from remote sources. Then,the whole memory is flushed into disk storage and the merging phase takes place toproduce the final part of the join result. The hash merge join algorithm can bediagrammatically shown as in figure. Figure 4 Hash Merge JoinMEMORY OVERFLOW HANDLING In HMJ algorithm, adaptive flushing policy is used to handle the memoryoverflow. The adaptive flushing policy aims to balance the memory to havesimilar number of tuples from each source. In adaptive flushing policy we can setthe acceptable bucket size. This policy flushes partition pairs. It needs to choosetwo victim buckets; one from each source, with the same hash value. By flushing apair of partitions, timestamps are not required to prevent duplicates. A flushedpartition is sorted before being written to disk as the blocking phase performs amodified progressive merge join to produce results when both sources are blocked.5. RPJ: PRODUCING FAST JOIN RESULTS ON STREAMSTHROUGH RATE-BASED OPTIMIZATION Rate based Progressive Join(RPJ) maximizes the output rate by optimizing itsexecution according to the characteristics of join relations, for example, data distribution,tuple arrival pattern etc. The objectives are to 62
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME(i) Generate the first result as early as possible (soon after data trans-mission begins), and(ii) Output the remaining results at a fast rate (as tuples continuously arrive). The existing algorithms consider that the memory is not large enough toaccommodate all the tuples received from the input streams, such that part of the datamust be mi-grated to the disk.RPJ ALGORITHM During the online phase, it performs as HMJ. When memory is full, it appliesflushing policy. When both relations become blocked, RPJ begins its reactive phase,which combines the XJoin and HMJ reactive phases. The tuples from one of the diskbuckets of either relation can join with the corresponding memory bucket of the oppositerelation, as in case of HMJ and PMJ. The algorithm chooses the task that has the highestoutput rate. During its cleanup phase RPJ joins the disk buckets. The duplicate avoidancestrategy is similar with that one applied by join. The algorithm can be depicted as infigure 5. Figure 5 Rate based progressive joinMEMORY OVERFLOW HANDLING RPJ uses optimal flushing policy for memory overflow handling. Here whenmemory is full, it tries to estimate which tuples have the smallest chance to participate injoins. Its flushing policy is based on the estimation of the probability of a new incomingtuple to belong to relation and to be a part of bucket. Once all probabilities are calculated,the flushing policy is applied. If the victim tuple does not contain enough tuples, the nextsmallest probability is chosen, all the tuples that are flushed together from the samerelation and they form the sorted “segment” as in HMJ. 63
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME6. MAXIMIZING THE OUTPUT RATE OF MULTI-WAY JOINQUERIES OVER STREAMING INFORMATION SOURCES The complementary approach of allowing non-binary trees; that is, bygeneralizing existing streaming binary join algorithms to produce a multi-way streamingjoin operator, which we call MJoin, that works over more than two inputs is explored.Using a single multi-way join, an arrival from any input source can be used to generateand propagate results in a single step, without having to pass these results through amulti-stage binary execution pipeline. Furthermore, since the operator is completelysymmetric with respect to its inputs, there is no need to restructure a query plan inresponse to changing input arrival rates.MULTI-WAY JOIN ALGORITHM The algorithm first creates as many hash tables as there are inputs. When a newtuple arrives at an input, it is inserted into the corresponding hash table and used to probethe remaining hash tables. This generates every possible result tuple that can be producedby joining the new arrival with the memory resident tuples of the other relations. Not allhash tables will be probed for every arrival, as the sequence of probes stops whenever aprobe of a hash table finds no matches (since in this case it cannot produce answertuples.) For instance, for the second probe operation to execute, the first one has toproduce matches. The sequence is organized in such a way so that the most selectivepredicates are evaluated first and it is different for each input. This ensures that thesmallest number of temporary tuples is generated.MEMORY OVERFLOW HANDLING The technique “coordinated flushing” can improve the output rate in the presenceof overflow and addresses the problem of deciding how best to partition a large multi-way join into set of one or more MJoin operators. Using coordinated flushing, when anew tuple arrives on any input stream, if it falls into an in-memory partition, it isimmediately probed in the in-memory partitions of the other streams; if it falls into a diskresident partition, then it is added to an output buffer for that partition and not probed inthe other streams’. 64
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME7. EARLY HASH JOIN: A CONFIGURABLE ALGORITHM FORTHE EFFICIENT AND EARLY PRODUCTION OF JOIN RESULTS. Early hash join is a hash-based join algorithm specifically designed for interactivequery processing that has a fast response time like other early join algorithms with anoverall execution time that is significantly shorter. It is a customizable hash joinalgorithm, which produces results early without a major penalty in total execution time.Early hash join reduces the total execution time and number of I/O operations by biasingthe reading strategy and flushing policy to the smaller relation.EARLY HASH JOIN (EHJ) ALGORITHM The early hash join (EHJ) algorithm allows the optimizer to dynamicallycustomize its performance to tradeoff between early production of results and minimaltotal execution time. Early hash join is based on symmetric hash join. It uses one hashtable for each input. A hash table consists of P partitions. Each partition consists of Bbuckets. A bucket can store a linked list of pages, where each page can store a fixednumber of tuples. When a tuple from an input arrives, it is first used to probe the hashtable for the other input to generate matches. Then, it is placed in the hash table for itsinput. In this first in-memory phase, alternate reading is used by default as it was shownto be the best fixed reading strategy. However, it is possible to select different readingstrategies (that favor R) if the bias is to minimize total execution time. At any time, theuser/optimizer can change the reading policy and know the expected output rate. Oncememory is full; the algorithm enters its second phase (called the flushing phase). In the flushing phase, the algorithm uses biased flushing to favor buffering asmuch of R in memory as possible. By default, it increases the reading rate to favorreading more of R. This reduces the expected output rate, but decreases the totalexecution time. In both phases, the optimizations to discard tuples when performing one-to-many joins and many to- many joins once all of R has been read are performed. Notethat for one-to-many joins if a tuple from R matches tuple(s) in S in the hash table, thenthose tuples must be deleted from the hash table. For mediator joins, a concurrentbackground process can be activated if the inputs are slow. After all of R and S have beenread, the algorithm performs a cleanup join to generate all possible join results. 65
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEMEMORY OVERFLOW HANDLING: Biased flushing policy favors flushing partitions of S before partitions of R, andtransitions the algorithm into a form of dynamic hash join. The biased flushing policyuses these rules to select a victim partition whenever memory must be freed:1. Select the largest, non-frozen partition of S.2. If no such partition of S exists, then select the smallest, non-frozen partition of R. Once a partition is flushed, all buckets of its hash table are removed and arereplaced by a single page buffer. This partition is considered frozen (non-replacement)and cannot buffer any tuples in memory (except for the single page buffer) and cannot beprobed. If a tuple hashes to this partition, it is placed in the page buffer which is flushedwhen filled. If a tuple in the other input hashes to this partition index, then no probe isperformed.READING STRATEGY The reading policies are configurable by the optimizer, and can also be changedinteractively as the join is progressing or after a certain number of output results havebeen generated. During the flushing phase, a 5:1 reading strategy is used to continue toproduce results while lowering overall execution time. It is also possible to minimizetotal execution time by reading all of R once memory is full. These settings are chosenbecause in interactive querying the priority of the first few results is much higher thanlater query results. Further, early hash join can behave exactly as dynamic hash join byusing a reading policy that reads all of R before any of S.CONCLUSION: With the increase in number of the users of World Wide Web and various real-world applications there is a huge amount of data that is available that requiresprocessing. Joining the tuples in a relation has now become a common carry out in mostof the applications; it has now taken a more significant place in a transaction. Responding to the queries at real-time necessitates the speeding up of the queryprocessing of which joining takes the maximum time. Hence speeding of the joining ofrelations has become of prime importance. This paper surveys some of these techniques.The Table1 below shows the comparison of these techniques. Reducing the speed of a 66
  11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEMEjoin query execution is an issue that is still open for improvement. As obserevd from thestudied techniques for the joining of tuples in a relation, it is evident that to reduce theuse of CPU we need to use more memory and to save memory we have to increase thenumber of required CPU or the number of input output operations required. The besttechnique of joins depends on the environment of the application of the joins and theresource that is more valuable. The future work can also be done for performing joinoperations on streams of continuous inputs instead of relations. Table 1 Comparison of the non-blocking joining techniques. SHJ Xjoin PMJ HMJ RPJ MJoin EHJFLUSHING No Flush Flush Adaptiv Optim Coordinat BiasedPOLICY Flushin Largest All e al ed Flushi g ngI/O Not High Less Less Less Reduced ModerCOMPLEXI applica By ateTY ble Reading StrategyDUPLICAT No Time Additio No Time Time TimeE Duplica Stamp nal Duplica Stamp Stamps StampHANDLING tes Check tes sRANGE Not Not Not Not Not Not AllowPREDICAT Allowe Allowed Allowe Allowe Allow Allowed edES d d d edMEMORY High Comparati Not Less Less Optimum EfficieREQUIREM vely Less Efficien nt UseENT t Of Availa ble Memo ryEXECUTIO High High Less Lower Lowe High FastN TIME I/O so than r than since but less XJoin XJoin recomput more executi and and aion than on time PMJ HMJ required DHJREFERENCES:1. J. Dittrich, B. Seeger, and D. Taylor. “Progressive merge join: A generic and non- blocking sort-based join algorithm”. In Proceedings of VLDB, 2002.2. T.Urhan and M. J. Franklin. “XJoin: A Reactively-Scheduled Pipelined Join Operator”. IEEE Data Eng. Bull., 23(2), 2000. 67
  12. 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 2, Sept – Oct (2010), © IAEME3. M. F. Mokbel, M. Lu, and W. G. Aref. “Hash-Merge Join: A Nonblocking Join Algorithm for Producing Fast and Early Join Results”. In CDE Conf., 2004.4. Y. Tao, M. L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis. “RPJ: Producing Fast Join Results on Streams Through Rate-based Optimization”. In Proceedings of ACM SIGMOD Conference, 2005.5. S. D. Viglas, J. F. Naughton, and J. Burger. “Maximizing the output rate of multi-way join queries over streaming information sources”. In VLDB ’2003: Proceedings of the 29th international conference on Very large data bases, pages 285–296. VLDB Endowment, 2003.6. Rahman, Nurazzah Abd Saad, Tareq Salahi. “Early Hash Join: A Configurable Algorithm for the Efficient and Early Production of Join Results” .ITSim 2008. International Symposium on 28 Aug. 2008 68