A reuse repository was constructed with an automated search engine and synonym support. The system uses information retrieval methods like vector space modeling and latent semantic indexing to index components. Clustering is done by an open source system to group similar search results. Experiments were conducted to measure performance, tune the representation to the data, and evaluate the generated clusters.
3. 1. Introduction
• Constructed a reuse support system
• Fully functional prototype with performance
enabling interactive use
• Usable for other applications, as it is a general
system
3
12. 3. Solution
• Create a search engine
• Use automated indexing methods
• Automatic synonym handling
• Grouping search results to assist in
information discernment
12
13. Search engine
technology
• Information Retrieval method
• Vector Space Model
• Latent Semantic Indexing
• Clustering done by existing Open Source
system
13
14. ourse the choice of words within the similar doc
eems to be some overlap between documents. C
Vector Space Model
nd d6 . Apparently these documents have nothin
d1 d2 d3 d4 d5 d6
cosmonaut 1 0 1 0 0 0
astronaut 0 1 0 0 0 0
X=
moon
1 1 0 0 0 0
car 1 0 0 1 1 0
truck 0 0 0 1 0 1
Example document contents with simple binary ter
14
15. he example T0 matrix resulting from SVD being performed on X
om figure 3.1 on page 15. 0.44 0.13 0.48 0.70 0.26
Latent Semantic 0.30 0.33
T0 = −0.57 0.59
0.51
0.37
−0.35 −0.65
−0.15
0.41
CHAPTER 3. 0.58 0.00
TECHNOLOGY 0.00 −0.58 0.58
Indexing
2.16 0.00 0.00 0.00 0.00
0.00 1.59 0.00 0.00 0.00
S0 0.00 0.00 1.28 0.00 0.00
=
0.25 0.73 −0.61
(3.16)
0.16 −0.09
Figure 3.4: The 0.26
example T0 matrix resulting from SVD being p
0.00 0.13
0.44 0.00 0.48 1.00 0.00
0.00 0.70
0.30 0.33 from figure 3.1 on page 15.
0.51 −0.35 0.39
0.00 0.00 0.00 0.00 −0.65
T0 = −0.57 0.59
0.37 −0.15 0.41 (3.15)
0.58 0.00 0.00 −0.58
0.58the
he example S0 matrix. The diagonal contains singular values 0.00 0.00 0.00
X. 0.25 0.73 −0.61 0.16 −0.09 2.16 0.00
0.00 1.59 0.00 0.00 0.00
S0 = 0.00 0.00 1.28 0.00 0.00
he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00
on X
0.00
om figure 0.75on page 15.
3.1 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.39
0.28 0.53 0.75 0.00 0.29
0.20 0.19 −0.45 3.5: The 0.63 S matrix. The diagonal contains the
D0 = Figure 0.58 example 0
(3.17)
0.45 −0.63 0.20 −0.000.000.19
2.16 0.00 0.00 0.00 of X.
0.00 1.59 0.00 −0.58
0.33 −0.22 −0.12 0.00 0.000.41
0.00 0.00 1.28 0.00 0.00
S0 =0.12 −0.41
0.33 0.58 −0.22 (3.16) k = 2
0.00 0.00 0.00 1.00 0.00
0.75 0.29 −0.28 −0.00 −0.53
0.00 0.00 0.00 0.00 0.3915 0.28 0.53 0.75 0.00 0.29
16. documents seem to cover two different topics, namely space and
of course the choice of words within the similar documents differ.
so seems to be some overlap between documents. Compare doc-
Latent Semantic Indexing
0.75 0.29 −0.28 −0.00 −0.53
d5 and d6 . Apparently these documents have nothing in common
0.28 0.53 0.75 0.00 0.29
0.20 0.19 −0.45 0.58 0.63
D0 = d0.45 2 −0.63 4 d5 d6 −0.00
(3.17
1 d d3 d 0.20 0.19
cosmonaut 1 0 −0.220 −0.120 −0.58
0.33 1 0 0.41
astronaut 0 1 −0.410 0.330 0.58 −0.22
0.12 0 0
X=
moon 1 1 0 0 0 0
Figure car The example 0 0 matrix, and the
3.6: 1 D 0 1 1 0 shaded part is the D matrix.
truck 0 0 0 1 0 1
3.1: Example document contents with simple binary term weightings.
d1 d2 d3 d4 d5 d6
cosmonout 0.85 0.52 0.28 0.13 0.21 −0.08
astronaut 0.36 0.36 0.16 −0.21 −0.03 −0.18
X=
ˆ
moon
(3.18
1.00 0.71 0.36 −0.05 0.16 −0.21
car 0.98 0.13 0.21 1.03 0.62 0.41
truck 0.13 −0.39 −0.08 0.90 0.41 0.49
16
ˆ T
17. T
0 0.44 0.13
1 0.30 0.33
Matching a query
−1
2.16 0.00
Dq = 0.28 0.53 = 1
−0.57 0.59
0 0.00 1.59
0.58 0.00
0 0.25 0.73
•
3.5.“moon astronaut”
CLUSTERING
• [cosmonaut, astronaut, moon, car, truck]
gure 3.8: Forming X and performing the calculations leading to the vecto
q
for the query document, D .
q
Xq = 0 1 1 0 0
T
0 0.44 0.13
d1 d2 d3 d4
d5 d6
X 0.41 1.00 0.00
0.00
1 0.30 0.33
0.00
0.00
2.1
Dq =
ˆ 0.28 0.53 =
−0.57 0.59
1
X 0.75 1.00 0.94 −0.45
−0.11 −0.71 0.0
0 0.58 0.00
0 0.25 0.73
Table 3.2: Similarity between query document and original documents.
Figure 3.8: Forming Xq and performing the calculations leadi
17
18. Clustering: Carrot
USER INTERFACE
ure 5.4: Data flow when processing in Carrot from input to output. F
the manual, available from18the Carrot homepage.
22. Precision and recall
Precision/Recall
and recall are the traditional measurements for gaugin
rformance of an IR system. Precision is the proportion of
which is actually relevant to a given query, and recall i
f relevant material actually retrieved:
#relevant retrieved
precision =
#total retrieved
#relevant retrieved
recall =
#total relevant in collection
o measurements are defined in terms of each other, what is
erpolated precision at recall levels of 10%, 20%, . . . , 100
are then plotted as a graph. Another measurement is
22
23. Performance
1
Average precision normal
Average precision stemmed
0.8
Average precision
0.6
0.4
0.2
0
0 50 100 150 200 250 300 350 400
Number of factors
23