A reuse repository   with automatedsynonym support and  cluster generation       Laust Rud Jensen     Århus University, 2004
Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion                     2
1. Introduction• Constructed a reuse support system• Fully functional prototype with performance  enabling interactive use...
Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion                   4
2. Problem• Reuse is a Good Thing, but not done enough• Code reuse repository available, but needs  search function       ...
Java• Platform independent• Easy to use• Well documented using Javadoc                    6
Javadoc problems• Browsing for relevant components is  insufficient:  • Assumes existing knowledge  • Information overload ...
Simple: keyword search• Exact word matching• Too literal, but stemming can help• Vocabulary mismatch, <20% agreement  [Fur...
Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion                   11
3. Solution• Create a search engine• Use automated indexing methods• Automatic synonym handling• Grouping search results t...
Search engine         technology• Information Retrieval method• Vector Space Model• Latent Semantic Indexing• Clustering d...
ourse the choice of words within the similar doceems to be some overlap between documents. C           Vector Space Modeln...
he example T0 matrix resulting from SVD being performed on X                                                             ...
documents seem to cover two different topics, namely space and of course the choice of words within the similar documents d...
   T                                              0               0.44   0.13                        1             ...
Clustering: Carrot  USER INTERFACEure 5.4: Data flow when processing in   Carrot from input to output. F        the manual,...
Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion                    20
4. Experiments• Performance measurement• Tuning representation to data• Evaluating clusters                     21
Precision and recall               Precision/Recall and recall are the traditional measurements for gauginrformance of an ...
Performance                     1                                           Average precision normal                      ...
Precision/Recall6.3. EXPERIMENTS PERFORMED                                            79               1                  ...
80                        Average precision                                       CHAPTER 6. EXPERIMENTS                  ...
Evaluating clusters6.3. EXPERIMENTS PERFORMED Cluster   Elements Cluster title                               Lingo L1     ...
CHAPTER 6. EXPERIMEN                                                            L1                                        ...
Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion                   29
Future work• Extensions  • Feedback mechanism  • Additional experiments: stop-words,    weighting• Applications:  • Two-wa...
Conclusion• Fully functional prototype• Clustering helpful but needs more work• what else? synonymy vs. polysemy?         ...
Slides for presentation of  "A reuse repository with automated synonym support and cluster generation"
Slides for presentation of  "A reuse repository with automated synonym support and cluster generation"
Slides for presentation of  "A reuse repository with automated synonym support and cluster generation"
Slides for presentation of  "A reuse repository with automated synonym support and cluster generation"
Upcoming SlideShare
Loading in …5
×

Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

1,066 views
968 views

Published on

Having a code reuse repository available can be a great asset for a programmer. But locating components can be difficult if only static documentation is available, due to vocabulary mismatch. Identifying informal synonyms used in documentation can help alleviate this mismatch. The cost of creating a reuse support system is usually fairly high, as much manual effort goes into its construction.

This project has resulted in a fully functional reuse support sys- tem with clustering of search results. By automating the construc- tion of a reuse support system from an existing code reuse repository, and giving the end user a familiar interface, the reuse support system constructed in this project makes the desired functionality available. The constructed system has an easy to use interface, due to a fa- miliar browser-based front-end. An automated method called LSI is used to handle synonyms, and to some degree polysemous words in indexed components.

In the course of this project, the reuse support system has been tested using components from two sources, the retrieval performance measured, and found acceptable. Clustering usability is evaluated and clusters are found to be generally helpful, even though some fine-tuning still has to be done.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,066
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

  1. 1. A reuse repository with automatedsynonym support and cluster generation Laust Rud Jensen Århus University, 2004
  2. 2. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 2
  3. 3. 1. Introduction• Constructed a reuse support system• Fully functional prototype with performance enabling interactive use• Usable for other applications, as it is a general system 3
  4. 4. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 4
  5. 5. 2. Problem• Reuse is a Good Thing, but not done enough• Code reuse repository available, but needs search function 5
  6. 6. Java• Platform independent• Easy to use• Well documented using Javadoc 6
  7. 7. Javadoc problems• Browsing for relevant components is insufficient: • Assumes existing knowledge • Information overload 9
  8. 8. Simple: keyword search• Exact word matching• Too literal, but stemming can help• Vocabulary mismatch, <20% agreement [Furnas, 1987] 10
  9. 9. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 11
  10. 10. 3. Solution• Create a search engine• Use automated indexing methods• Automatic synonym handling• Grouping search results to assist in information discernment 12
  11. 11. Search engine technology• Information Retrieval method• Vector Space Model• Latent Semantic Indexing• Clustering done by existing Open Source system 13
  12. 12. ourse the choice of words within the similar doceems to be some overlap between documents. C Vector Space Modelnd d6 . Apparently these documents have nothin   d1 d2 d3 d4 d5 d6  cosmonaut 1 0 1 0 0 0     astronaut 0 1 0 0 0 0  X=  moon   1 1 0 0 0 0    car 1 0 0 1 1 0  truck 0 0 0 1 0 1Example document contents with simple binary ter 14
  13. 13. he example T0 matrix resulting from SVD being performed on X  om figure 3.1 on page 15. 0.44 0.13 0.48 0.70 0.26 Latent Semantic  0.30 0.33  T0 =  −0.57 0.59  0.51 0.37 −0.35 −0.65  −0.15  0.41    CHAPTER 3.  0.58 0.00  TECHNOLOGY 0.00 −0.58 0.58   Indexing 2.16 0.00 0.00 0.00 0.00  0.00 1.59 0.00 0.00 0.00  S0  0.00 0.00 1.28 0.00 0.00   =  0.25 0.73 −0.61 (3.16) 0.16 −0.09  Figure 3.4: The 0.26 example T0 matrix resulting from SVD being p  0.00 0.13 0.44 0.00 0.48 1.00 0.00  0.00 0.70  0.30 0.33 from figure 3.1 on page 15.  0.51 −0.35 0.39 0.00 0.00 0.00 0.00 −0.65  T0 =  −0.57 0.59  0.37 −0.15 0.41  (3.15)  0.58 0.00 0.00 −0.58  0.58the  he example S0 matrix. The diagonal contains singular values 0.00 0.00 0.00 X. 0.25 0.73 −0.61 0.16 −0.09  2.16 0.00   0.00 1.59 0.00 0.00 0.00  S0 =  0.00 0.00 1.28 0.00 0.00   he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00  on X    0.00om figure 0.75on page 15. 3.1 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.39  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the D0 =   Figure 0.58 example 0   (3.17)  0.45 −0.63 0.20 −0.000.000.19  2.16 0.00 0.00 0.00 of X.    0.00 1.59 0.00 −0.58  0.33 −0.22 −0.12 0.00 0.000.41      0.00 0.00 1.28 0.00 0.00  S0 =0.12 −0.41  0.33 0.58 −0.22  (3.16) k = 2  0.00 0.00 0.00 1.00 0.00   0.75 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.3915  0.28 0.53 0.75 0.00 0.29 
  14. 14. documents seem to cover two different topics, namely space and of course the choice of words within the similar documents differ. so seems to be some overlap between documents. Compare doc-  Latent Semantic Indexing 0.75 0.29 −0.28 −0.00 −0.53d5 and d6 . Apparently these documents have nothing in common  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 0.58 0.63   D0 = d0.45 2 −0.63 4 d5 d6 −0.00   (3.17  1 d d3 d 0.20 0.19   cosmonaut  1 0 −0.220 −0.120 −0.58 0.33 1 0  0.41     astronaut 0 1 −0.410 0.330 0.58 −0.22 0.12 0 0 X=    moon 1 1 0 0 0 0   Figure car The example 0 0 matrix, and the   3.6: 1 D 0 1 1 0 shaded part is the D matrix. truck 0 0 0 1 0 1 3.1: Example document contents with simple binary term weightings.  d1 d2 d3 d4 d5 d6  cosmonout 0.85 0.52 0.28 0.13 0.21 −0.08     astronaut 0.36 0.36 0.16 −0.21 −0.03 −0.18  X= ˆ  moon  (3.18   1.00 0.71 0.36 −0.05 0.16 −0.21   car 0.98 0.13 0.21 1.03 0.62 0.41  truck 0.13 −0.39 −0.08 0.90 0.41 0.49 16 ˆ T
  15. 15.  T   0 0.44 0.13  1   0.30 0.33  Matching a query −1     2.16 0.00 Dq = 0.28 0.53 =  1      −0.57 0.59    0    0.00 1.59 0.58 0.00 0 0.25 0.73 • 3.5.“moon astronaut” CLUSTERING • [cosmonaut, astronaut, moon, car, truck]gure 3.8: Forming X and performing the calculations leading to the vecto q for the query document, D . q Xq = 0 1 1 0 0  T   0 0.44 0.13 d1 d2 d3 d4  d5 d6 X 0.41 1.00 0.00  0.00 1   0.30 0.33   0.00 0.00   2.1 Dq = ˆ 0.28 0.53 =    −0.57 0.59 1     X 0.75 1.00 0.94 −0.45  −0.11 −0.71 0.0 0   0.58 0.00  0 0.25 0.73 Table 3.2: Similarity between query document and original documents. Figure 3.8: Forming Xq and performing the calculations leadi 17
  16. 16. Clustering: Carrot USER INTERFACEure 5.4: Data flow when processing in Carrot from input to output. F the manual, available from18the Carrot homepage.
  17. 17. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 20
  18. 18. 4. Experiments• Performance measurement• Tuning representation to data• Evaluating clusters 21
  19. 19. Precision and recall Precision/Recall and recall are the traditional measurements for gauginrformance of an IR system. Precision is the proportion ofwhich is actually relevant to a given query, and recall if relevant material actually retrieved: #relevant retrieved precision = #total retrieved #relevant retrieved recall = #total relevant in collectiono measurements are defined in terms of each other, what iserpolated precision at recall levels of 10%, 20%, . . . , 100 are then plotted as a graph. Another measurement is 22
  20. 20. Performance 1 Average precision normal Average precision stemmed 0.8Average precision 0.6 0.4 0.2 0 0 50 100 150 200 250 300 350 400 Number of factors 23
  21. 21. Precision/Recall6.3. EXPERIMENTS PERFORMED 79 1 Interpolated recall, unstemmed Interpolated recall, stemmed 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall
  22. 22. 80 Average precision CHAPTER 6. EXPERIMENTS 0.2 Unstemmed Stemmed 0.15 Precision 0.1 0.05 0 10 100 1000 Documents retrieved
  23. 23. Evaluating clusters6.3. EXPERIMENTS PERFORMED Cluster Elements Cluster title Lingo L1 6 Denoted by this Abstract Pathname L2 3 Parent Directory L3 2 File Objects L4 2 Attributes L5 2 Value L6 5 Array of Strings L7 5 (Other) 7 23 Total listed STC S1 16 files, directories, pathname 26
  24. 24. CHAPTER 6. EXPERIMEN L1 L2 L3 L4 L5 L6 L7 S1 S2 S3 # Method Rel. 1 mkdir() • • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 2 mkdirs() • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ • 3 createSubcontext() ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • 4 isDirectory() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 5 setCacheDirectory() ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • • 6 isFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ • 7 getCanonicalPath() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ 8 delete() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 9 createSubcontext() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 10 createNewFolder() ◦ ◦ ◦ ◦ • ◦ ◦ • ◦ • • 11 create environment() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • 12 listFiles() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 13 getParentDirectory() ◦ ◦ • • ◦ ◦ ◦ ◦ • ◦ ◦ 14 list() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 15 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 16 setCurrentDirectory() ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • • ◦ 17 length() ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ 18 createFileSystemRoot() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 19 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 20 listRoots() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ ◦
  25. 25. Outline1. Introduction2. Problem3. Solution4. Experiments5. Conclusion 29
  26. 26. Future work• Extensions • Feedback mechanism • Additional experiments: stop-words, weighting• Applications: • Two-way Javadoc integration • Other applications; more text 30
  27. 27. Conclusion• Fully functional prototype• Clustering helpful but needs more work• what else? synonymy vs. polysemy? 31

×