Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

A reuse repository
with automated
synonym support and
cluster generation

Laust Rud Jensen
Århus University, 2004

Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion

2

1. Introduction

• Constructed a reuse support system
• Fully functional prototype with performance
enabling interactive use

• Usable for other applications, as it is a general
system

3

Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion

4

2. Problem

• Reuse is a Good Thing, but not done enough
• Code reuse repository available, but needs
search function

5

Java

• Platform independent
• Easy to use
• Well documented using Javadoc

6

Javadoc problems

• Browsing for relevant components is
insufﬁcient:

• Assumes existing knowledge
• Information overload

9

Simple: keyword search

• Exact word matching
• Too literal, but stemming can help
• Vocabulary mismatch, <20% agreement
[Furnas, 1987]

10

Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion

11

3. Solution

• Create a search engine
• Use automated indexing methods
• Automatic synonym handling
• Grouping search results to assist in
information discernment

12

Search engine
technology
• Information Retrieval method
• Vector Space Model
• Latent Semantic Indexing
• Clustering done by existing Open Source
system

13

ourse the choice of words within the similar doc
eems to be some overlap between documents. C
Vector Space Model
nd d6 . Apparently these documents have nothin

 
d1 d2 d3 d4 d5 d6
 cosmonaut 1 0 1 0 0 0 
 
 astronaut 0 1 0 0 0 0 
X=
 moon

 1 1 0 0 0 0  
 car 1 0 0 1 1 0 
truck 0 0 0 1 0 1

Example document contents with simple binary ter
14

he example T0 matrix resulting from SVD being performed on X
 
om figure 3.1 on page 15. 0.44 0.13 0.48 0.70 0.26

Latent Semantic  0.30 0.33

T0 =  −0.57 0.59

0.51
0.37
−0.35 −0.65 
−0.15

0.41 

 CHAPTER 3.  0.58 0.00
 TECHNOLOGY 0.00 −0.58 0.58 

 Indexing
2.16 0.00 0.00 0.00 0.00
 0.00 1.59 0.00 0.00 0.00 
S0  0.00 0.00 1.28 0.00 0.00  
=

0.25 0.73 −0.61

(3.16)
0.16 −0.09


Figure 3.4: The 0.26
example T0 matrix resulting from SVD being p
 0.00 0.13
0.44 0.00 0.48 1.00 0.00 
0.00 0.70
 0.30 0.33 from figure 3.1 on page 15.
 0.51 −0.35 0.39
0.00 0.00 0.00 0.00 −0.65 
T0 =  −0.57 0.59
 0.37 −0.15 0.41  (3.15)
 0.58 0.00 0.00 −0.58 
0.58the  
he example S0 matrix. The diagonal contains singular values 0.00 0.00 0.00
X. 0.25 0.73 −0.61 0.16 −0.09  2.16 0.00 
 0.00 1.59 0.00 0.00 0.00 
S0 =  0.00 0.00 1.28 0.00 0.00 
 
he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00 
on X
   0.00
om figure 0.75on page 15.
3.1 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.39
 0.28 0.53 0.75 0.00 0.29 
 
 0.20 0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the
D0 =   Figure 0.58 example 0
  (3.17)
 0.45 −0.63 0.20 −0.000.000.19 
2.16 0.00 0.00 0.00 of X.
 
 0.00 1.59 0.00 −0.58
 0.33 −0.22 −0.12 0.00 0.000.41  
 
 0.00 0.00 1.28 0.00 0.00 
S0 =0.12 −0.41
 0.33 0.58 −0.22  (3.16) k = 2
 0.00 0.00 0.00 1.00 0.00  
0.75 0.29 −0.28 −0.00 −0.53
0.00 0.00 0.00 0.00 0.3915  0.28 0.53 0.75 0.00 0.29 

documents seem to cover two diﬀerent topics, namely space and
of course the choice of words within the similar documents diﬀer.

so seems to be some overlap between documents. Compare doc- 

Latent Semantic Indexing
0.75 0.29 −0.28 −0.00 −0.53
d5 and d6 . Apparently these documents have nothing in common
 0.28 0.53 0.75 0.00 0.29 
 
 0.20 0.19 −0.45 0.58 0.63 
 D0 = d0.45 2 −0.63 4 d5 d6 −0.00

 (3.17
 1 d d3 d 0.20 0.19 
 cosmonaut  1 0 −0.220 −0.120 −0.58
0.33 1 0  0.41 
 
 astronaut 0 1 −0.410 0.330 0.58 −0.22
0.12 0 0
X=  
 moon 1 1 0 0 0 0  
Figure car The example 0 0 matrix, and the 
 3.6: 1 D 0 1 1 0 shaded part is the D matrix.
truck 0 0 0 1 0 1


3.1: Example document contents with simple binary term weightings. 
d1 d2 d3 d4 d5 d6
 cosmonout 0.85 0.52 0.28 0.13 0.21 −0.08 
 
 astronaut 0.36 0.36 0.16 −0.21 −0.03 −0.18 
X=
ˆ
 moon
 (3.18

 1.00 0.71 0.36 −0.05 0.16 −0.21 
 car 0.98 0.13 0.21 1.03 0.62 0.41 
truck 0.13 −0.39 −0.08 0.90 0.41 0.49

16
ˆ T

 T  
0 0.44 0.13
 1   0.30 0.33 
Matching a query
−1
    2.16 0.00
Dq = 0.28 0.53 =  1 
 

 −0.57 0.59 

 0    0.00 1.59
0.58 0.00
0 0.25 0.73
•
3.5.“moon astronaut”
CLUSTERING

• [cosmonaut, astronaut, moon, car, truck]
gure 3.8: Forming X and performing the calculations leading to the vecto
q
for the query document, D .
q

Xq = 0 1 1 0 0
 T  
0 0.44 0.13
d1 d2 d3 d4
 d5 d6
X 0.41 1.00 0.00 
0.00
1   0.30 0.33
  0.00
0.00

 2.1
Dq =
ˆ 0.28 0.53 = 
  −0.57 0.59
1   

X 0.75 1.00 0.94 −0.45
 −0.11 −0.71 0.0
0   0.58 0.00 
0 0.25 0.73
Table 3.2: Similarity between query document and original documents.

Figure 3.8: Forming Xq and performing the calculations leadi
17

Clustering: Carrot
USER INTERFACE

ure 5.4: Data ﬂow when processing in Carrot from input to output. F
the manual, available from18the Carrot homepage.

Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion

20

4. Experiments

• Performance measurement
• Tuning representation to data
• Evaluating clusters

21

Precision and recall

Precision/Recall
and recall are the traditional measurements for gaugin
rformance of an IR system. Precision is the proportion of
which is actually relevant to a given query, and recall i
f relevant material actually retrieved:

#relevant retrieved
precision =
#total retrieved
#relevant retrieved
recall =
#total relevant in collection
o measurements are deﬁned in terms of each other, what is
erpolated precision at recall levels of 10%, 20%, . . . , 100
are then plotted as a graph. Another measurement is
22

Performance
1
Average precision normal
Average precision stemmed
0.8
Average precision

0.6

0.4

0.2

0
0 50 100 150 200 250 300 350 400
Number of factors
23

Precision/Recall
6.3. EXPERIMENTS PERFORMED 79

1
Interpolated recall, unstemmed
Interpolated recall, stemmed
0.8

0.6
Precision

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Recall

80
Average precision
CHAPTER 6. EXPERIMENTS

0.2
Unstemmed
Stemmed
0.15
Precision

0.1

0.05

0
10 100 1000
Documents retrieved

Evaluating clusters
6.3. EXPERIMENTS PERFORMED

Cluster Elements Cluster title
Lingo
L1 6 Denoted by this Abstract Pathname
L2 3 Parent Directory
L3 2 File Objects
L4 2 Attributes
L5 2 Value
L6 5 Array of Strings
L7 5 (Other)
7 23 Total listed
STC
S1 16 ﬁles, directories, pathname
26

CHAPTER 6. EXPERIMEN

L1
L2
L3
L4
L5
L6
L7
S1
S2
S3
# Method Rel.
1 mkdir() • • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ •
2 mkdirs() • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ •
3 createSubcontext() ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ •
4 isDirectory() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦
5 setCacheDirectory() ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • •
6 isFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ •
7 getCanonicalPath() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦
8 delete() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦
9 createSubcontext() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
10 createNewFolder() ◦ ◦ ◦ ◦ • ◦ ◦ • ◦ • •
11 create environment() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ •
12 listFiles() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦
13 getParentDirectory() ◦ ◦ • • ◦ ◦ ◦ ◦ • ◦ ◦
14 list() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦
15 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • •
16 setCurrentDirectory() ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • • ◦
17 length() ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦
18 createFileSystemRoot() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ •
19 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • •
20 listRoots() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ ◦

Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion

29

Future work
• Extensions
• Feedback mechanism
• Additional experiments: stop-words,
weighting

• Applications:
• Two-way Javadoc integration
• Other applications; more text
30

Conclusion

• Fully functional prototype
• Clustering helpful but needs more work
• what else? synonymy vs. polysemy?

31

Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Recently uploaded

Recently uploaded (20)

Slides for presentation of "A reuse repository with automated synonym support and cluster generation"