Qicheng yu

330 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
330
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Qicheng yu

  1. 1. An Agent-based Adaptive Join Algorithm for Distributed Data Warehousing 1ST ANNUAL POSTGRADUATE RESEARCH STUDENT CONFERENCE Qicheng Yu FoC of London Metropolitan University Dr. Fang Fang Cai FoC of London Metropolitan University Dr. Julie A McCann DoC of Imperial College Supervisors:
  2. 2. Introduction Data warehousing continues to play an important role in global information systems for businesses. Applications of data warehousing have evolvedApplications of data warehousing have evolved from reporting and decision support systems to mission critical decision making systems. This requires data warehouses to combine both historical and current data from operational systems.
  3. 3. Introduction (cont’d) The primary objective of this research is to seeking effective and efficient join techniques for distributed data warehousing. This is because: Joins are a frequently occurring operation in DW queries.queries. Joins are one of the most expensive operations that a DW performs. Joining two large tables could consume a significant amount of the system’s CPU cycles, disk or network bandwidth, and buffer memory. Traditional join algorithms caused significant performance issues in a distributed and dynamic network environment .
  4. 4. Theories and Techniques Software Agents software entities that have an internal goal and acts on behalf of a user. Autonomy AdaptivenessAdaptiveness Collaboration Mobility They are well suited to applications which involve distributed computation or communication between components, sensing or monitoring of their environment, or autonomous operation.
  5. 5. Theories and Techniques Adaptive Join Algorithms focus on using runtime feedback to modify query processing in a way that provides better response time or more efficient CPU utilization. Good examples are: Nested Loop Ripple JoinNested Loop Ripple Join Hash Ripple Join XJoin. Distributed Query Technique Semi-join was introduced for reducing data transmission in processing distributed queries. It has been traditionally used to great advantage in distributed systems.
  6. 6. The Main Idea of AJoin The main idea of AJoin is based on software agents and extend the ripple join and semi-join techniques into a multi-agent system where a join task is decomposed into smaller independent units which will be assigned to software agents.software agents. It takes data warehousing features into account It utilises intelligent software agents for the dynamic optimisation and coordination of join processing.
  7. 7. AJoin Framework S R1AIndex R2AIndex R1S R2S Join Results Agent coordinator R1 R2 Filter C1 Filter C2 S1 S2 R1 Input Buffer R1S R2S R1S Results Buffer 2 2 3 3 4 4 5 5 1 1 R2S Results Buffer R2 Input Buffer
  8. 8. AJoin Algorithm Join coordinate at local site S: 1.Initialization. Perform cost-benefit analysis; Allocate initial memory and buffer. 2. Activate RIA at sites S1 and S2 (see table 4) Start Remote Adapting 3. Activate MTA and OA at site S (see table 2) 4. Coordinate and monitor join process 5. Iterate from step 4 until join completed 6. Deactivate MTA, RIA, and OA 7. End AJoin Table1 - Join Coordinator Agent (JCA)
  9. 9. AJoin Algorithm (cont’d) MTA for R1 at local site S: 1. Receive a tuple from IB (see table 3) 2. If a tuple is received from R1 3. If full-join method is selected 4. Save R1S at local site S and add its LAP and R1AS A to index table for R1A 5. Else // semi-join is selected 6. Add negative RAP and R1A to index table for R1A 7. Using R1A to probe R2A in index table for R2A 8. If a match is found 9. For each matched R1A and R2A TABLE 2-1. TUPLE MATCHING AGENT (MTA)
  10. 10. AJoin Algorithm (cont’d) MTA for R1 at local site S: 10. If the access pointer of RS is RAP 11. Call RIA at S1 and/or S2 (see table 5) to retrieve R1S and/or R2S respectivelyS S 12. Else // RS accessible from the local storage 13. Retrieve R1S and/or R2S locally 14. Output the matched R1S and R2S 15. Iterate from step 1 until no more tuples from R1 16.Notify JCA to end the process TABLE 2-2. TUPLE MATCHING AGENT (MTA)
  11. 11. AJoin Algorithm (cont’d) Receive tuple procedure at local site S: 1. If all tuples from both relation R1 and R2 are received 2. Join completed 3. If both R1 and R2 buffer are not empty 4. Retrieve a tuple from R1 and R2 IB in turn 5. Else5. Else 6. If both R1 and R2 IB are empty 7. Waiting tuples to arrive 8. Else // one of R1 orR2 IB is not empty 9. If R1 IB is not empty 10. Retrieve a tuple from R1 IB 11. Else // R2 IB is not empty 12. Retrieve a tuple from R2 IB 13. Return TABLE 3. RECEIVE A TUPLE PROCEDURE
  12. 12. AJoin Algorithm (cont’d) Tuple filter and join adapting at remote site S1: 1. Receive join parameter (C and join selection threshold) 2. Retrieve a tuple from R1 3. If the tuple is not meet the condition C1 4. Repeat step 2 5. Evaluate join cost-benefit 6. If semi-join is selected 7. Send R1A and its remote access pointer to R1 IB at site S 8. Else // full-join is selected 9. Send R1A and R1s to R1 IB at site S 10. Iterate from step 2 until no more tuples from R1 TABLE 4. REMOTE INFORMATION AGENT (RIA)
  13. 13. AJoin Algorithm (cont’d) Retrieve RS at remote site S1 (or S2): 1. Get the remote access pointer 2. Retrieve the tuple according to its access pointer 3. Return R to R buffer at site S3. Return RS to RS buffer at site S TABLE 5. RIA RETRIEVE RS SERVICE
  14. 14. Cost and Benefit Analysis cost of full ripple join: Cost F (R1⋈AR2) = (1) where T0 is the cost to start-up a new network connection and V is the network speed. |R1| and |R2| to indicate the cardinality of R1 and R2 respectively, ⋈ 0 | 1| ( 1) | 2| ( 2)R Size R R Size R T V × + × + cost of semi ripple join: CostS (R1⋈AR2) = (2) where Size(P) is the size in byte of the RAP. 0 | 1| ( 1 ) ( 1) ( ( 1 ) ( )) | 1| ( 2 ) ( 2) ( ( 2 ) ( )) A S A S R Size R JC R Size R Size P T V R Size R JC R Size R Size P V × + × + × + × + + +
  15. 15. Cost and Benefit Analysis Cost and Benefit : Cost F (R1⋈AR2) - CostS (R1⋈AR2) = (3) | 1| ( 1) | 1| ( 1 ) ( 1) ( ( 1 ) ( )) | 2| ( 2) | 1| ( 2 ) ( 2) ( ( 2 ) ( )) A S A S R Size R R Size R JC R Size R Size P V R Size R R Size R JC R Size R Size P V × − × − × + × − × − × + + In AJoin, full-join or semi-join can be applied independently to each relation. The semi-join method will be chosen only when: (4) or CR(R) × SR(R) < 1 (5) where CR(R) = denotes as join cardinality ratio as %; SR(R) = denotes as attribute selection ratio as % V+ | | ( ) | | ( ) ( ) ( ( ) ( )) 0 A SR Size R R Size R JC R Size R Size P V × − × − × + > ( ) | | JC R R ( ) ( ) ( ) ( ) S A Size R Size P Size R Size R + −
  16. 16. AJoin for Data Warehousing Multiple one-to-many joins between the fact table and the dimensional tables. referential integrity constraints are applied e.g. all join attributes in fact table must have a matching joinattributes in fact table must have a matching join attributes in the dimensional tables. Therefore, CR(R1) =100% full-join method will be used for R1 unless SR(R1) is low. run-time data of the fact table only represents a small portion of the fact table. Therefore, CR(R2) << 100% the semi-join method should be chosen for R2.
  17. 17. Cost and Benefit in High Bandwidth Network The join cost of local processing cannot be ignored any more. But, since the cost of joins at a local site using full or semi-join method is at an equivalent level. The semi-join method will be chosen only when: | | ( ) | | ( ) ( ) ( ( ) ( ))R Size R R Size R JC R Size R Size P× − × − × + (6) V< (7) where V’ is the desk data access speed and V is the join switch threshold. | | ( ) | | ( ) ( ) ( ( ) ( )) ( ) ( ) ' A S S R Size R R Size R JC R Size R Size P V JC R Size R V × − × − × + × > | | ( ) | | ( ) ( ) ( ( ) ( )) ' ( ) ( ) A S S R Size R R Size R JC R Size R Size P V JC R Size R × − × − × + × ×
  18. 18. Performance Evaluation Join under 250K-2Mbps network 200 250 300 Time(sec) 0 50 100 150 10 30 50 70 90 110 130 150 170 190 210 230 250 Matched Tuples Time(sec) AJoin Hash Ripple Join
  19. 19. Performance Evaluation (cont’d) Join under 250K-2Mbps network with 1 sec delay per 1000 tuples 300 350 400 0 50 100 150 200 250 300 10 30 50 70 90 110 130 150 170 190 210 230 250 Matched Tuples Time(sec) AJoin Hash Ripple Join
  20. 20. Performance Evaluation (cont’d) Join under 100M-1Gbps network 25 30 35 0 5 10 15 20 25 10 30 50 70 90 110 130 150 170 190 210 230 250 Matched Tuples Time(sec) AJoin Hash Ripple Join
  21. 21. Performance Evaluation (cont’d) Join under 100M-1Gbps network with 1 sec delay per 1000 tuples 100 120 0 20 40 60 80 10 30 50 70 90 110 130 150 170 190 210 230 250 Matched tuples time(sec) AJoin Hash Ripple Join
  22. 22. Summary The main work undertaken is summarized below: investigation the feasibility and effectiveness of utilising software agent technology to address some specific issues in data warehouses. conduction an experimental study on the performance of modern join algorithm for distributed environment. implement of a framework for an adaptive join algorithm using intelligent agents. evaluation the agent-based join approach against current approaches in distributed and dynamic data warehouse environments.
  23. 23. Questions & Comments

×