SlideShare a Scribd company logo
1 of 31
Download to read offline
A reuse repository
   with automated
synonym support and
  cluster generation

       Laust Rud Jensen
     Århus University, 2004
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                     2
1. Introduction

• Constructed a reuse support system
• Fully functional prototype with performance
  enabling interactive use

• Usable for other applications, as it is a general
  system



                         3
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                   4
2. Problem


• Reuse is a Good Thing, but not done enough
• Code reuse repository available, but needs
  search function




                     5
Java


• Platform independent
• Easy to use
• Well documented using Javadoc


                    6
Javadoc problems

• Browsing for relevant components is
  insufficient:

  • Assumes existing knowledge
  • Information overload

                     9
Simple: keyword search

• Exact word matching
• Too literal, but stemming can help
• Vocabulary mismatch, <20% agreement
  [Furnas, 1987]



                   10
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                   11
3. Solution

• Create a search engine
• Use automated indexing methods
• Automatic synonym handling
• Grouping search results to assist in
  information discernment



                      12
Search engine
         technology
• Information Retrieval method
• Vector Space Model
• Latent Semantic Indexing
• Clustering done by existing Open Source
  system



                    13
ourse the choice of words within the similar doc
eems to be some overlap between documents. C
           Vector Space Model
nd d6 . Apparently these documents have nothin

                                            
                d1 d2 d3 d4 d5 d6
     cosmonaut 1 0 1 0 0 0 
                                 
     astronaut 0 1 0 0 0 0 
  X=
     moon
                                  
               1 1 0 0 0 0      
     car       1 0 0 1 1 0 
      truck     0 0 0 1 0 1

Example document contents with simple binary ter
                       14
he example T0 matrix resulting from SVD being performed on X
                                                                                    
om figure 3.1 on page 15.                        0.44 0.13    0.48      0.70  0.26

                     Latent Semantic           0.30 0.33
                                              
                                         T0 =  −0.57 0.59
                                              
                                                                  0.51
                                                                  0.37
                                                                      −0.35 −0.65 
                                                                      −0.15
                                                                                  
                                                                             0.41 
                                                                                  
                            CHAPTER 3.  0.58 0.00
                                             TECHNOLOGY 0.00         −0.58  0.58 

                        Indexing
             2.16 0.00 0.00 0.00 0.00
           0.00 1.59 0.00 0.00 0.00 
     S0  0.00 0.00 1.28 0.00 0.00  
        =
                                            
                                                   0.25 0.73 −0.61

                                                           (3.16)
                                                                       0.16 −0.09

                                            
                         Figure 3.4: The 0.26
                                           example T0 matrix resulting from SVD being p
           0.00 0.13
           0.44    0.00 0.48 1.00 0.00 
                         0.00     0.70
        0.30 0.33                    from figure 3.1 on page 15.
                        0.51 −0.35 0.39
             0.00 0.00 0.00 0.00         −0.65 
 T0 =  −0.57 0.59
                        0.37 −0.15       0.41          (3.15)
        0.58 0.00       0.00 −0.58             
                                          0.58the                               
he example S0 matrix. The diagonal contains       singular values 0.00 0.00 0.00
 X.        0.25 0.73 −0.61        0.16 −0.09  2.16 0.00                         
                                                   0.00 1.59 0.00 0.00 0.00 
                                             S0 =  0.00 0.00 1.28 0.00 0.00 
                                                                                
he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00 
                                                            on X
                                                 0.00
om figure 0.75on page 15.
         3.1     0.29 −0.28 −0.00 −0.53              0.00 0.00 0.00 0.00 0.39
        0.28    0.53    0.75     0.00     0.29 
                                               
        0.20    0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the
 D0 =                  Figure 0.58       example 0
                                                         (3.17)
        0.45 −0.63      0.20 −0.000.000.19 
             2.16 0.00 0.00 0.00 of X.
                                               
           0.00 1.59 0.00 −0.58
        0.33 −0.22 −0.12 0.00 0.000.41    
                                           
           0.00 0.00 1.28 0.00 0.00 
     S0 =0.12 −0.41
                        0.33     0.58 −0.22            (3.16) k = 2
           0.00 0.00 0.00 1.00 0.00                                                 
                                                 0.75    0.29 −0.28 −0.00 −0.53
             0.00 0.00 0.00 0.00 0.3915  0.28           0.53     0.75    0.00  0.29 
documents seem to cover two different topics, namely space and
 of course the choice of words within the similar documents differ.
                          
so seems to be some overlap between documents. Compare doc-    

          Latent Semantic Indexing
                            0.75    0.29 −0.28 −0.00 −0.53
d5 and d6 . Apparently these documents have nothing in common
                           0.28    0.53    0.75    0.00  0.29 
                                                              
                           0.20    0.19 −0.45 0.58      0.63 
                   D0 = d0.45 2 −0.63 4 d5 d6 −0.00
                                                               
                                                                  (3.17
                           1 d d3 d 0.20                 0.19 
             cosmonaut  1 0 −0.220 −0.120 −0.58
                            0.33    1       0            0.41 
                                                  
             astronaut     0 1 −0.410 0.330 0.58 −0.22
                            0.12    0       0
      X=                                         
             moon          1 1 0 0 0 0           
       Figure car The example 0 0 matrix, and the 
             3.6:          1    D 0 1 1 0 shaded part is the D matrix.
              truck         0 0 0 1 0 1

           
3.1: Example document contents with simple binary term weightings.        
                            d1      d2      d3       d4     d5       d6
            cosmonout 0.85       0.52    0.28     0.13   0.21 −0.08      
                                                                         
            astronaut 0.36       0.36    0.16 −0.21 −0.03 −0.18          
      X=
       ˆ
            moon
                                                                           (3.18
                                                                          
                         1.00    0.71    0.36 −0.05      0.16 −0.21      
            car          0.98    0.13    0.21     1.03   0.62     0.41   
              truck       0.13 −0.39 −0.08         0.90   0.41     0.49

                                      16
                                      ˆ        T
   T                     
                         0               0.44   0.13
                        1             0.30   0.33   
                Matching a query
                                                                       −1
                                                       2.16 0.00
 Dq   =   0.28 0.53 =  1 
                        
                                   
                                       −0.57   0.59   
                                                       
                        0                              0.00 1.59
                                         0.58   0.00
                         0               0.25   0.73
          •
        3.5.“moon astronaut”
             CLUSTERING

          • [cosmonaut, astronaut, moon, car, truck]
gure 3.8: Forming X and performing the calculations leading to the vecto
                    q
          for the query document, D .
                                  q

             Xq =       0 1 1 0 0
                                             T                  
                                            0       0.44 0.13
                 d1   d2   d3    d4
                                               d5     d6
             X 0.41 1.00 0.00    
                               0.00
                                            1   0.30 0.33
                                                0.00
                                              0.00
                                                                   
                                                                       2.1
             Dq =
             ˆ      0.28 0.53 = 
                                                −0.57 0.59
                                            1                    
                                                                   
             X 0.75 1.00 0.94 −0.45
                                            −0.11 −0.71                0.0
                                            0   0.58 0.00        
                                            0       0.25 0.73
 Table 3.2: Similarity between query document and original documents.

         Figure 3.8: Forming Xq and performing the calculations leadi
                                   17
Clustering: Carrot
  USER INTERFACE




ure 5.4: Data flow when processing in   Carrot from input to output. F
        the manual, available from18the Carrot homepage.
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                    20
4. Experiments


• Performance measurement
• Tuning representation to data
• Evaluating clusters


                     21
Precision and recall

               Precision/Recall
 and recall are the traditional measurements for gaugin
rformance of an IR system. Precision is the proportion of
which is actually relevant to a given query, and recall i
f relevant material actually retrieved:

                         #relevant retrieved
             precision =
                          #total retrieved
                       #relevant retrieved
          recall =
                   #total relevant in collection
o measurements are defined in terms of each other, what is
erpolated precision at recall levels of 10%, 20%, . . . , 100
 are then plotted as a graph. Another measurement is
                             22
Performance
                     1
                                           Average precision normal
                                          Average precision stemmed
                    0.8
Average precision




                    0.6

                    0.4

                    0.2

                     0
                          0   50   100   150   200     250   300      350   400
                                         Number of factors
                                             23
Precision/Recall
6.3. EXPERIMENTS PERFORMED                                            79



               1
                                Interpolated recall, unstemmed
                                   Interpolated recall, stemmed
              0.8

              0.6
  Precision




              0.4

              0.2

               0
                    0     0.2       0.4            0.6     0.8    1
                                          Recall
80
                        Average precision
                                       CHAPTER 6. EXPERIMENTS


                  0.2
                                              Unstemmed
                                                Stemmed
                 0.15
     Precision




                  0.1


                 0.05


                   0
                         10              100              1000
                              Documents retrieved
Evaluating clusters
6.3. EXPERIMENTS PERFORMED


 Cluster   Elements Cluster title
                               Lingo
 L1               6 Denoted by this Abstract Pathname
 L2               3 Parent Directory
 L3               2 File Objects
 L4               2 Attributes
 L5               2 Value
 L6               5 Array of Strings
 L7               5 (Other)
 7               23 Total listed
                                STC
 S1              16 files, directories, pathname
                           26
CHAPTER 6. EXPERIMEN




                                                            L1
                                                            L2
                                                            L3
                                                            L4
                                                            L5
                                                            L6
                                                            L7
                                                                                        S1
                                                                                        S2
                                                                                        S3
                       #    Method                   Rel.
                       1    mkdir()                   •     •   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   •
                       2    mkdirs()                  •     ◦   •   ◦   ◦   ◦   ◦   ◦   •   ◦   •
                       3    createSubcontext()        ◦     ◦   ◦   ◦   •   ◦   ◦   ◦   ◦   ◦   •
                       4    isDirectory()             ◦     •   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   ◦
                       5    setCacheDirectory()       ◦     ◦   ◦   ◦   ◦   •   ◦   ◦   •   •   •
                       6    isFile()                  ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   ◦   •
                       7    getCanonicalPath()        ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   ◦   ◦
                       8    delete()                  ◦     •   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   ◦
                       9    createSubcontext()        ◦     ◦   ◦   ◦   ◦   ◦   ◦   ◦   ◦   ◦   ◦
                       10   createNewFolder()         ◦     ◦   ◦   ◦   •   ◦   ◦   •   ◦   •   •
                       11   create environment()      ◦     ◦   ◦   •   ◦   ◦   ◦   ◦   ◦   ◦   •
                       12   listFiles()               ◦     •   ◦   ◦   ◦   ◦   •   ◦   •   •   ◦
                       13   getParentDirectory()      ◦     ◦   •   •   ◦   ◦   ◦   ◦   •   ◦   ◦
                       14   list()                    ◦     •   ◦   ◦   ◦   ◦   •   ◦   •   •   ◦
                       15   createTempFile()          ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   •   •
                       16   setCurrentDirectory()     ◦     ◦   •   ◦   ◦   ◦   ◦   ◦   •   •   ◦
                       17   length()                  ◦     •   ◦   ◦   ◦   •   ◦   ◦   ◦   ◦   ◦
                       18   createFileSystemRoot()    ◦     ◦   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   •
                       19   createTempFile()          ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   •   •
                       20   listRoots()               ◦     ◦   ◦   •   ◦   ◦   ◦   ◦   •   ◦   ◦
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                   29
Future work
• Extensions
  • Feedback mechanism
  • Additional experiments: stop-words,
    weighting

• Applications:
  • Two-way Javadoc integration
  • Other applications; more text
                     30
Conclusion


• Fully functional prototype
• Clustering helpful but needs more work
• what else? synonymy vs. polysemy?


                    31

More Related Content

Viewers also liked

Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelKrzysztof Gorgolewski
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationOCLC
 
Repository and preservation systems
Repository and preservation systemsRepository and preservation systems
Repository and preservation systemsJisc
 
Software component reuse repository
Software component reuse repositorySoftware component reuse repository
Software component reuse repositorySandeep Singh
 
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...ariadnenetwork
 

Viewers also liked (6)

Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next level
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
 
Repository and preservation systems
Repository and preservation systemsRepository and preservation systems
Repository and preservation systems
 
Software component reuse repository
Software component reuse repositorySoftware component reuse repository
Software component reuse repository
 
Software resuse
Software  resuseSoftware  resuse
Software resuse
 
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

  • 1. A reuse repository with automated synonym support and cluster generation Laust Rud Jensen Århus University, 2004
  • 2. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 2
  • 3. 1. Introduction • Constructed a reuse support system • Fully functional prototype with performance enabling interactive use • Usable for other applications, as it is a general system 3
  • 4. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 4
  • 5. 2. Problem • Reuse is a Good Thing, but not done enough • Code reuse repository available, but needs search function 5
  • 6. Java • Platform independent • Easy to use • Well documented using Javadoc 6
  • 7.
  • 8.
  • 9. Javadoc problems • Browsing for relevant components is insufficient: • Assumes existing knowledge • Information overload 9
  • 10. Simple: keyword search • Exact word matching • Too literal, but stemming can help • Vocabulary mismatch, <20% agreement [Furnas, 1987] 10
  • 11. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 11
  • 12. 3. Solution • Create a search engine • Use automated indexing methods • Automatic synonym handling • Grouping search results to assist in information discernment 12
  • 13. Search engine technology • Information Retrieval method • Vector Space Model • Latent Semantic Indexing • Clustering done by existing Open Source system 13
  • 14. ourse the choice of words within the similar doc eems to be some overlap between documents. C Vector Space Model nd d6 . Apparently these documents have nothin   d1 d2 d3 d4 d5 d6  cosmonaut 1 0 1 0 0 0     astronaut 0 1 0 0 0 0  X=  moon   1 1 0 0 0 0    car 1 0 0 1 1 0  truck 0 0 0 1 0 1 Example document contents with simple binary ter 14
  • 15. he example T0 matrix resulting from SVD being performed on X   om figure 3.1 on page 15. 0.44 0.13 0.48 0.70 0.26 Latent Semantic  0.30 0.33  T0 =  −0.57 0.59  0.51 0.37 −0.35 −0.65  −0.15  0.41    CHAPTER 3.  0.58 0.00  TECHNOLOGY 0.00 −0.58 0.58   Indexing 2.16 0.00 0.00 0.00 0.00  0.00 1.59 0.00 0.00 0.00  S0  0.00 0.00 1.28 0.00 0.00   =  0.25 0.73 −0.61 (3.16) 0.16 −0.09  Figure 3.4: The 0.26 example T0 matrix resulting from SVD being p  0.00 0.13 0.44 0.00 0.48 1.00 0.00  0.00 0.70  0.30 0.33 from figure 3.1 on page 15.  0.51 −0.35 0.39 0.00 0.00 0.00 0.00 −0.65  T0 =  −0.57 0.59  0.37 −0.15 0.41  (3.15)  0.58 0.00 0.00 −0.58  0.58the   he example S0 matrix. The diagonal contains singular values 0.00 0.00 0.00 X. 0.25 0.73 −0.61 0.16 −0.09  2.16 0.00   0.00 1.59 0.00 0.00 0.00  S0 =  0.00 0.00 1.28 0.00 0.00    he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00  on X    0.00 om figure 0.75on page 15. 3.1 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.39  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the D0 =   Figure 0.58 example 0   (3.17)  0.45 −0.63 0.20 −0.000.000.19  2.16 0.00 0.00 0.00 of X.    0.00 1.59 0.00 −0.58  0.33 −0.22 −0.12 0.00 0.000.41      0.00 0.00 1.28 0.00 0.00  S0 =0.12 −0.41  0.33 0.58 −0.22  (3.16) k = 2  0.00 0.00 0.00 1.00 0.00   0.75 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.3915  0.28 0.53 0.75 0.00 0.29 
  • 16. documents seem to cover two different topics, namely space and of course the choice of words within the similar documents differ.  so seems to be some overlap between documents. Compare doc-  Latent Semantic Indexing 0.75 0.29 −0.28 −0.00 −0.53 d5 and d6 . Apparently these documents have nothing in common  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 0.58 0.63   D0 = d0.45 2 −0.63 4 d5 d6 −0.00   (3.17  1 d d3 d 0.20 0.19   cosmonaut  1 0 −0.220 −0.120 −0.58 0.33 1 0  0.41     astronaut 0 1 −0.410 0.330 0.58 −0.22 0.12 0 0 X=    moon 1 1 0 0 0 0   Figure car The example 0 0 matrix, and the   3.6: 1 D 0 1 1 0 shaded part is the D matrix. truck 0 0 0 1 0 1  3.1: Example document contents with simple binary term weightings.  d1 d2 d3 d4 d5 d6  cosmonout 0.85 0.52 0.28 0.13 0.21 −0.08     astronaut 0.36 0.36 0.16 −0.21 −0.03 −0.18  X= ˆ  moon  (3.18   1.00 0.71 0.36 −0.05 0.16 −0.21   car 0.98 0.13 0.21 1.03 0.62 0.41  truck 0.13 −0.39 −0.08 0.90 0.41 0.49 16 ˆ T
  • 17. T   0 0.44 0.13  1   0.30 0.33  Matching a query −1     2.16 0.00 Dq = 0.28 0.53 =  1      −0.57 0.59    0    0.00 1.59 0.58 0.00 0 0.25 0.73 • 3.5.“moon astronaut” CLUSTERING • [cosmonaut, astronaut, moon, car, truck] gure 3.8: Forming X and performing the calculations leading to the vecto q for the query document, D . q Xq = 0 1 1 0 0  T   0 0.44 0.13 d1 d2 d3 d4  d5 d6 X 0.41 1.00 0.00  0.00 1   0.30 0.33   0.00 0.00   2.1 Dq = ˆ 0.28 0.53 =    −0.57 0.59 1     X 0.75 1.00 0.94 −0.45  −0.11 −0.71 0.0 0   0.58 0.00  0 0.25 0.73 Table 3.2: Similarity between query document and original documents. Figure 3.8: Forming Xq and performing the calculations leadi 17
  • 18. Clustering: Carrot USER INTERFACE ure 5.4: Data flow when processing in Carrot from input to output. F the manual, available from18the Carrot homepage.
  • 19.
  • 20. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 20
  • 21. 4. Experiments • Performance measurement • Tuning representation to data • Evaluating clusters 21
  • 22. Precision and recall Precision/Recall and recall are the traditional measurements for gaugin rformance of an IR system. Precision is the proportion of which is actually relevant to a given query, and recall i f relevant material actually retrieved: #relevant retrieved precision = #total retrieved #relevant retrieved recall = #total relevant in collection o measurements are defined in terms of each other, what is erpolated precision at recall levels of 10%, 20%, . . . , 100 are then plotted as a graph. Another measurement is 22
  • 23. Performance 1 Average precision normal Average precision stemmed 0.8 Average precision 0.6 0.4 0.2 0 0 50 100 150 200 250 300 350 400 Number of factors 23
  • 24. Precision/Recall 6.3. EXPERIMENTS PERFORMED 79 1 Interpolated recall, unstemmed Interpolated recall, stemmed 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall
  • 25. 80 Average precision CHAPTER 6. EXPERIMENTS 0.2 Unstemmed Stemmed 0.15 Precision 0.1 0.05 0 10 100 1000 Documents retrieved
  • 26. Evaluating clusters 6.3. EXPERIMENTS PERFORMED Cluster Elements Cluster title Lingo L1 6 Denoted by this Abstract Pathname L2 3 Parent Directory L3 2 File Objects L4 2 Attributes L5 2 Value L6 5 Array of Strings L7 5 (Other) 7 23 Total listed STC S1 16 files, directories, pathname 26
  • 27. CHAPTER 6. EXPERIMEN L1 L2 L3 L4 L5 L6 L7 S1 S2 S3 # Method Rel. 1 mkdir() • • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 2 mkdirs() • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ • 3 createSubcontext() ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • 4 isDirectory() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 5 setCacheDirectory() ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • • 6 isFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ • 7 getCanonicalPath() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ 8 delete() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 9 createSubcontext() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 10 createNewFolder() ◦ ◦ ◦ ◦ • ◦ ◦ • ◦ • • 11 create environment() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • 12 listFiles() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 13 getParentDirectory() ◦ ◦ • • ◦ ◦ ◦ ◦ • ◦ ◦ 14 list() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 15 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 16 setCurrentDirectory() ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • • ◦ 17 length() ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ 18 createFileSystemRoot() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 19 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 20 listRoots() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ ◦
  • 28.
  • 29. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 29
  • 30. Future work • Extensions • Feedback mechanism • Additional experiments: stop-words, weighting • Applications: • Two-way Javadoc integration • Other applications; more text 30
  • 31. Conclusion • Fully functional prototype • Clustering helpful but needs more work • what else? synonymy vs. polysemy? 31