Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

  • 820 views
Uploaded on

Having a code reuse repository available can be a great asset for a programmer. But locating components can be difficult if only static documentation is available, due to vocabulary mismatch. …

Having a code reuse repository available can be a great asset for a programmer. But locating components can be difficult if only static documentation is available, due to vocabulary mismatch. Identifying informal synonyms used in documentation can help alleviate this mismatch. The cost of creating a reuse support system is usually fairly high, as much manual effort goes into its construction.

This project has resulted in a fully functional reuse support sys- tem with clustering of search results. By automating the construc- tion of a reuse support system from an existing code reuse repository, and giving the end user a familiar interface, the reuse support system constructed in this project makes the desired functionality available. The constructed system has an easy to use interface, due to a fa- miliar browser-based front-end. An automated method called LSI is used to handle synonyms, and to some degree polysemous words in indexed components.

In the course of this project, the reuse support system has been tested using components from two sources, the retrieval performance measured, and found acceptable. Clustering usability is evaluated and clusters are found to be generally helpful, even though some fine-tuning still has to be done.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
820
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A reuse repository with automatedsynonym support and cluster generation Laust Rud Jensen Århus University, 2004
  • 2. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 2
  • 3. 1. Introduction• Constructed a reuse support system• Fully functional prototype with performance enabling interactive use• Usable for other applications, as it is a general system 3
  • 4. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 4
  • 5. 2. Problem• Reuse is a Good Thing, but not done enough• Code reuse repository available, but needs search function 5
  • 6. Java• Platform independent• Easy to use• Well documented using Javadoc 6
  • 7. Javadoc problems• Browsing for relevant components is insufficient: • Assumes existing knowledge • Information overload 9
  • 8. Simple: keyword search• Exact word matching• Too literal, but stemming can help• Vocabulary mismatch, <20% agreement [Furnas, 1987] 10
  • 9. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 11
  • 10. 3. Solution• Create a search engine• Use automated indexing methods• Automatic synonym handling• Grouping search results to assist in information discernment 12
  • 11. Search engine technology• Information Retrieval method• Vector Space Model• Latent Semantic Indexing• Clustering done by existing Open Source system 13
  • 12. ourse the choice of words within the similar doceems to be some overlap between documents. C Vector Space Modelnd d6 . Apparently these documents have nothin   d1 d2 d3 d4 d5 d6  cosmonaut 1 0 1 0 0 0     astronaut 0 1 0 0 0 0  X=  moon   1 1 0 0 0 0    car 1 0 0 1 1 0  truck 0 0 0 1 0 1Example document contents with simple binary ter 14
  • 13. he example T0 matrix resulting from SVD being performed on X  om figure 3.1 on page 15. 0.44 0.13 0.48 0.70 0.26 Latent Semantic  0.30 0.33  T0 =  −0.57 0.59  0.51 0.37 −0.35 −0.65  −0.15  0.41    CHAPTER 3.  0.58 0.00  TECHNOLOGY 0.00 −0.58 0.58   Indexing 2.16 0.00 0.00 0.00 0.00  0.00 1.59 0.00 0.00 0.00  S0  0.00 0.00 1.28 0.00 0.00   =  0.25 0.73 −0.61 (3.16) 0.16 −0.09  Figure 3.4: The 0.26 example T0 matrix resulting from SVD being p  0.00 0.13 0.44 0.00 0.48 1.00 0.00  0.00 0.70  0.30 0.33 from figure 3.1 on page 15.  0.51 −0.35 0.39 0.00 0.00 0.00 0.00 −0.65  T0 =  −0.57 0.59  0.37 −0.15 0.41  (3.15)  0.58 0.00 0.00 −0.58  0.58the  he example S0 matrix. The diagonal contains singular values 0.00 0.00 0.00 X. 0.25 0.73 −0.61 0.16 −0.09  2.16 0.00   0.00 1.59 0.00 0.00 0.00  S0 =  0.00 0.00 1.28 0.00 0.00   he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00  on X    0.00om figure 0.75on page 15. 3.1 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.39  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the D0 =   Figure 0.58 example 0   (3.17)  0.45 −0.63 0.20 −0.000.000.19  2.16 0.00 0.00 0.00 of X.    0.00 1.59 0.00 −0.58  0.33 −0.22 −0.12 0.00 0.000.41      0.00 0.00 1.28 0.00 0.00  S0 =0.12 −0.41  0.33 0.58 −0.22  (3.16) k = 2  0.00 0.00 0.00 1.00 0.00   0.75 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.3915  0.28 0.53 0.75 0.00 0.29 
  • 14. documents seem to cover two different topics, namely space and of course the choice of words within the similar documents differ. so seems to be some overlap between documents. Compare doc-  Latent Semantic Indexing 0.75 0.29 −0.28 −0.00 −0.53d5 and d6 . Apparently these documents have nothing in common  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 0.58 0.63   D0 = d0.45 2 −0.63 4 d5 d6 −0.00   (3.17  1 d d3 d 0.20 0.19   cosmonaut  1 0 −0.220 −0.120 −0.58 0.33 1 0  0.41     astronaut 0 1 −0.410 0.330 0.58 −0.22 0.12 0 0 X=    moon 1 1 0 0 0 0   Figure car The example 0 0 matrix, and the   3.6: 1 D 0 1 1 0 shaded part is the D matrix. truck 0 0 0 1 0 1 3.1: Example document contents with simple binary term weightings.  d1 d2 d3 d4 d5 d6  cosmonout 0.85 0.52 0.28 0.13 0.21 −0.08     astronaut 0.36 0.36 0.16 −0.21 −0.03 −0.18  X= ˆ  moon  (3.18   1.00 0.71 0.36 −0.05 0.16 −0.21   car 0.98 0.13 0.21 1.03 0.62 0.41  truck 0.13 −0.39 −0.08 0.90 0.41 0.49 16 ˆ T
  • 15.  T   0 0.44 0.13  1   0.30 0.33  Matching a query −1     2.16 0.00 Dq = 0.28 0.53 =  1      −0.57 0.59    0    0.00 1.59 0.58 0.00 0 0.25 0.73 • 3.5.“moon astronaut” CLUSTERING • [cosmonaut, astronaut, moon, car, truck]gure 3.8: Forming X and performing the calculations leading to the vecto q for the query document, D . q Xq = 0 1 1 0 0  T   0 0.44 0.13 d1 d2 d3 d4  d5 d6 X 0.41 1.00 0.00  0.00 1   0.30 0.33   0.00 0.00   2.1 Dq = ˆ 0.28 0.53 =    −0.57 0.59 1     X 0.75 1.00 0.94 −0.45  −0.11 −0.71 0.0 0   0.58 0.00  0 0.25 0.73 Table 3.2: Similarity between query document and original documents. Figure 3.8: Forming Xq and performing the calculations leadi 17
  • 16. Clustering: Carrot USER INTERFACEure 5.4: Data flow when processing in Carrot from input to output. F the manual, available from18the Carrot homepage.
  • 17. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 20
  • 18. 4. Experiments• Performance measurement• Tuning representation to data• Evaluating clusters 21
  • 19. Precision and recall Precision/Recall and recall are the traditional measurements for gauginrformance of an IR system. Precision is the proportion ofwhich is actually relevant to a given query, and recall if relevant material actually retrieved: #relevant retrieved precision = #total retrieved #relevant retrieved recall = #total relevant in collectiono measurements are defined in terms of each other, what iserpolated precision at recall levels of 10%, 20%, . . . , 100 are then plotted as a graph. Another measurement is 22
  • 20. Performance 1 Average precision normal Average precision stemmed 0.8Average precision 0.6 0.4 0.2 0 0 50 100 150 200 250 300 350 400 Number of factors 23
  • 21. Precision/Recall6.3. EXPERIMENTS PERFORMED 79 1 Interpolated recall, unstemmed Interpolated recall, stemmed 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall
  • 22. 80 Average precision CHAPTER 6. EXPERIMENTS 0.2 Unstemmed Stemmed 0.15 Precision 0.1 0.05 0 10 100 1000 Documents retrieved
  • 23. Evaluating clusters6.3. EXPERIMENTS PERFORMED Cluster Elements Cluster title Lingo L1 6 Denoted by this Abstract Pathname L2 3 Parent Directory L3 2 File Objects L4 2 Attributes L5 2 Value L6 5 Array of Strings L7 5 (Other) 7 23 Total listed STC S1 16 files, directories, pathname 26
  • 24. CHAPTER 6. EXPERIMEN L1 L2 L3 L4 L5 L6 L7 S1 S2 S3 # Method Rel. 1 mkdir() • • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 2 mkdirs() • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ • 3 createSubcontext() ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • 4 isDirectory() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 5 setCacheDirectory() ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • • 6 isFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ • 7 getCanonicalPath() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ 8 delete() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 9 createSubcontext() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 10 createNewFolder() ◦ ◦ ◦ ◦ • ◦ ◦ • ◦ • • 11 create environment() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • 12 listFiles() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 13 getParentDirectory() ◦ ◦ • • ◦ ◦ ◦ ◦ • ◦ ◦ 14 list() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 15 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 16 setCurrentDirectory() ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • • ◦ 17 length() ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ 18 createFileSystemRoot() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 19 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 20 listRoots() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ ◦
  • 25. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 29
  • 26. Future work• Extensions • Feedback mechanism • Additional experiments: stop-words, weighting• Applications: • Two-way Javadoc integration • Other applications; more text 30
  • 27. Conclusion• Fully functional prototype• Clustering helpful but needs more work• what else? synonymy vs. polysemy? 31