Multilayer Collection Selection and Search of Topically Organized Patents


Published on

We present a federated patent search system that explores three issues: (a) topical organization of patents based on their IPC, (b) collection selection of topically organised patent collections and (c) integration of collection selection tools to patent search systems.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Multilayer Collection Selection and Search of Topically Organized Patents

  1. 1. Multilayer Collection Selection and Search of Topically Organized Patents Michail Salampasis Vienna University of Technology Anastasia Giahanou University of Macedonia Giorgos Paltoglou University of Wolverhampton
  2. 2. 2 Contents Overview: Aim and Objectives of this work Distributed Information Retrieval / Federated Search  Topically Organised Patents  Integration of DIR in patent search: Multilayer Source Selection  Experiment Setup  Results  Conclusions
  3. 3. Aim of this work 3 To explore the thematic organization of patent documents using the subdivision of patent data by International Patent Classification (IPC) codes , and if this organization can be used to build search tools that could improve patent search effectiveness using DIR methods
  4. 4. Which search tools and how should be integrated? 4 It is a mistake if we think the search tools which should be integrated into patent search systems depend only on existing IR or text processing technologies, Probably it has more to do with the attitude that a patent search is conducted. Furthermore, it is also very important to deeply understand a search process and how a specific tool can attain a specific objective of this process and therefore increase its efficiency.
  5. 5. If these parameters are not carefully considered 5 • Professional searchers will be skeptical and with a very conservative attitude towards adopting search methods, tools and technologies beyond the ones which dominated their domain. • A typical example is patent search where professional search experts typically use the Boolean search syntax and quite complex intellectual classification schemes
  6. 6. Understanding Patent Search processes * * Taken from Mihai Lupu and Allan Hanbury, Review Patent Retrieval
  7. 7. Objectives 7 •The improvement of our method relates to the very fundamental step in professional patent search (step 3 in the use case presented by Lupu and Hanbury) which is “defining a text query, potentially by Boolean operators and specific field filters”. • In prior art search probably the most important filter is based on the IPC (CPC now) classification
  8. 8. Objectives 8 •The method and tool which we present in this paper can support this step by automatically selecting IPCs given a query, make a filtered search based on the query and the automatically selected IPCs •The tool can be used for classification search which will be used as a starting point to identify and closer examine technical concepts as these are expressed in IPCs and to which a patent could be related
  9. 9. 9 Distributed IR Elements composing a Distributed Information Retrieval System . . . (1) Source Representation . . . .Collection 1 Collection 2 Collection 3 Collection 4 Collection Ν (2) Source Selection ………… (3) Results Merging User
  10. 10. Topically Organised Patents based on IPC taxonomy 10 IPC is a standard taxonomy for classifying patents, and has currently about 71,000 nodes which are organized into a five-level hierarchical system which is also extended in greater levels of granularity. Patent documents produced worldwide have manually-assigned classification codes which in our experiments are used to topically organize, distribute and index patents through hundreds or thousands of sub-collections.
  11. 11. Topically Organised Patents 11
  12. 12. Topically Organised Patents 12 The patents in average have three IPC codes. In the experiments we report here, we allocated a patent to each sub-collection specified by at least one of its IPC code, i.e. a sub-collection might overlap with others in terms of the patents it contains. IPC are assigned by humans in a very detailed and purposeful assignment process, something which is very different by the creation of sub-collections using automated clustering algorithms or the naive division method by chronological or source order, a division method which has been extensively used in past DIR research
  13. 13. Topically Organised Patents 13
  14. 14. Analysis of IPC distribution of topics and their relevant documents 14 IPC Level # of topics # relevant docs per topic (a) # of IPC classes of each topic (b) # of IPC classes of relevant docs (c) # of common IPC classes between (b) and (c) Training Split 3 300 8.22 2.08 4.8 1.76 Split 4 300 8.22 3.1 8.76 2.34 Split 5 300 8.22 5.82 19.84 3.63 Testing Split 3 300 8.57 2.09 5.15 1.75 Split 4 300 8.57 2.95 9.02 2.21 Split 5 300 8.57 5.58 20.56 3.73
  15. 15. Experiment Setup 15 We indexed the collection with the Lemur toolkit. The fields which have been indexed are: title, abstract, description (first 500 words), claims, inventor, applicant and IPC class information. Patent documents have been pre-processed to produce a single (virtual) document representing a patent. Our pre-processing involves also stop-word removal and stemming using the Porter stemmer. In the experiments reported here we use the Inquery algorithm implementation of Lemur
  16. 16. Two different types of Source Selection Algorithms were used 16 Hyper-document approach (CORI) o The main characteristic of CORI which is probably the most widely used and tested source selection method is that it creates a hyper-document representing all the documents-members of a sub- collection. Source Selection as Voting o This is a shift of focus from estimating the relevancy of each remote collection to explicitly estimating the number of relevant documents in each.
  17. 17. Source Selection Results (level 3) 17
  18. 18. Source Selection Results (level 4) 18
  19. 19. Source Selection Results (level 5) 19
  20. 20. Discussion • The superiority of CORI as source selection method is unquestionable • best runs are those requesting fewer sub-collections 10 or 20 and more documents from each selected sub- collection • This fact is probably the result of the small number of relevant documents which exist for each topic 20
  21. 21. Results of Retrieval Results SPLIT4 10 Collections Selected 20 Collections Selected Pres@100 MAP@100 Pres@100 MAP@100 Optimal 0.313 0.128 0.313 0.128 Centralised 0.257 0.105 0.257 0.105 CORI-CORI 0.203 0.081 0.213 0.086 CORI-SSL 0.221 0.091 0.231 0.097 BordaFuse-SSL 0.077 0.035 0.087 0.039 Multilayer 0.256 0.105 0.261 0.105 SPLIT5 10 Collections Selected 20 Collections Selected Pres@100 MAP@100 Pres@100 MAP@100 Optimal 0.346 0.146 0.351 0.148 Centralised 0.257 0.105 0.257 0.105 CORI-CORI 0.267 0.107 0.259 0.105 CORI-SSL 0.27 0.11 0.263 0.107 BordaFuse-SSL 0.03 0.02 0.04 0.028 Multilayer 0.269 0.106 0.267 0.102
  22. 22. Conclusions DIR approaches managed to perform better than the centralized index approaches, with 9 DIR combinations scoring better than the best centralized approach. Much more work is required: o We plan to explore further this line of work with exploring modifications to state-of-the-art DIR methods which didn’t perform well enough in this set of experiments o Also, we would like to experiment with larger distribution levels based on IPC (subgroup level). We plan to report the runs using split-5 in a future paper. 22
  23. 23. 23 Thank you…