Patent Search: An important new test bed for IR


Published on

Patent Search: An important new test bed for IR
presented at the 9th Dutch-Belgian Information Retrieval Workshop (DIR 2009)
Enschede, The Netherlands

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Patent Search: An important new test bed for IR

  1. 1. Patent Search: An important new test bed for IR J. Tait, M. Lupu1 H. Berger, G. Roda, M. Dittenbach, A. Pesenhofer2 E. Graf, K. van Rijsbergen3 1 InformationRetrieval Facility Vienna, Austria 2 Matrixware Vienna, Austria 3 University of Glasgow Dept. of Computing Science Glasgow, UK DIR 2009 / Feb. 2-3, 2009
  2. 2. Patent Search. Patent search is a highly specialized form of information search. It is characterized by its target data type of information needs legal and economic implications
  3. 3. Target data Data for patent retrieval comes mainly from: patent databases from patent authorities (EPO, USPTO, JPO, SIPO, WIPO, etc.) scientific publications prior art databases ( A new acronym SIPO: State Intellectual Property Office of the Peoples’ Republic of China
  4. 4. Target data Characteristics of patent documents multilingual and ’legalese’ non uniform formats some are OCR’d figures, images, chemical formulas, DNA sequences include references to patent and non-patent literature A new acronym NPL: Non-Patent Literature
  5. 5. Information Needs. K.H. Atkinson, Towards a more rational patent search paradigm: depending on what group is doing the asking, the types of patent search requested may include simple patentability, clearance to market a product, validity, opposition to a patent being sought by another, infringement watch, creating IP landscapes for business development or R&D, infringement defense, litigation, prosecution support, and creation of portfolios for assignments, investments, mergers and acquisitions [ . . . ]
  6. 6. Legal and economic implications. patents are legal documents patent portfolios are assets for enterprises a single patent search can be worth several days of work High recall searches Missing even a single relevant document can have severe financial and economic impact. For example, when a granted patent becomes invalidated because of a document omitted at application time.
  7. 7. Introduction Patent Search A modern IR test bed Promoting take up of research Conclusion We have characterized the patent search problem by describing its target data, types of information needs, legal and economic implications. Next: evaluating IR techniques in the patent domain previous initiatives in the area of patent retrieval the CLEF-IP and TREC-Chem initiatives promoting take-up of research Tait et al. Patent Search: An important new test bed for IR
  8. 8. Test collections Test collections in Information Retrieval play a pivotal role in the evaluation of retrieval models. Domain-specific test collections already exist for: Web pages news stories legal documents blogs genomics patents
  9. 9. Pioneering work in patent retrieval. Patent retrieval task at the NTCIR Workshop1 since 2001. produced test collections primarily targeting Japanese patents retrieval tasks ad-hoc (goal: find patents on a given topic) invalidity search (goal: find patents invalidating a given claim) patent classification according to the F-term system Two new acronyms F-term (abbreviation of File-forming term) is the classification system used in Japan as a complement to IPC (International Patent Classification) 1
  10. 10. Evaluation tracks. The IRF has engaged in two pilot evaluation tracks on patent retrieval CLEF-IP TREC-Chem
  11. 11. CLEF-Intellectual Property Initiative. CLEF-IP coordinated by the IRF part of the Cross-Language Evaluation Forum2 will focus on the task of prior art search European patents as target data automatic extraction of relevance assessments Prior art search Prior art search consists in identifying all information (including NPL) that might be relevant to a patent’s claim of novelty. 2
  12. 12. Prior art search. The most common type of patent search. Performed at various stages of the patent life-cycle and with different intentions: before filing an application (novelty search or patentability search) to determine whether the invention fulfills the requirements of novelty inventive step before grant - results go into a search report attached to patent invalidity search: post-grant search used to unveil prior art that invalidates a patent’s claims of originality
  13. 13. Target data. The CLEF-IP evaluation track will restrict target data to patents. Target data: comprising 16 years (filing date between 1985 and 2000) of EPO patents 1.9 million patent documents corresponding to 1 million patents 75 GB, in XML format documents are in English, German, and French
  14. 14. Automatic extraction of relevance assessments. The data resulting from prior art searches is saved in the EPO or USPTO databases as: citations in patent applications citations in search report citations in opposition’s legal files The CLEF-IP track is going to extract this information (as much as possible) automatically in order to form a large set of topics.
  15. 15. Prior art from opposition procedures. According to the European patent law, a granted patent may be opposed. It is often the case that opponent provides new prior art that invalidates claim of originality of the invention. Patents cited in opposition procedures are very relevant prior art documents. They are the results of a very thorough invalidity search.
  16. 16. Crowdsourcing extraction of relevance assessments. Need to extract citations from documents arising from opposition procedures These documents are only are available as scanned images3 Will be using crowdsourcing for extracting these citations. A new word from business jargon Crowdsourcing. 3 at
  17. 17. Relevance and evaluation measures. Labels used in search reports: label means that cited document is X relevant when taken alone Y relevant in combination with other documents A relevant but not prejudicial to novelty or inventive step How to use these labels for defining new evaluation measures?
  18. 18. Challenges. As a result of the CLEF-IP track we expect to obtain new insights on: how to represent information need given by a patent query reformulation evaluation metrics for patent retrieval using machine translation for improving retrieval effectiveness
  19. 19. TREC Chemistry track. Ad-hoc search Target data: academic papers (Royal Society of Chemistry) chemical patent documents (class C in the IPC) Will use automatic extraction of citations for relevance assessments Challenges: chemical names and structures chemical interactions, relations, transformations, properties
  20. 20. Introduction Patent Search Pioneering work at NTCIR A modern IR test bed CLEF-IP Promoting take up of research TREC-Chem Conclusion The IRF is contributing to the creation of new patent test collections by organizing two tracks within the CLEF and TREC evaluation campaigns. In addition to the TREC and CLEF contributions, the IRF, together with Matrixware, is promoting several initiatives aimed at facilitating and improving the patent retrieval process. Tait et al. Patent Search: An important new test bed for IR
  21. 21. Introduction The IRF Patent Search Matrixware A modern IR test bed Promoting research Promoting take up of research Providing the tools Conclusion Current University Projects Promoting take up of research Next: presentation of the IRF and Matrixware promoting take up of research the IRF symposium the PaIR workshop providing the tools funding research in the area of patent retrieval Tait et al. Patent Search: An important new test bed for IR
  22. 22. IRF: the Information Retrieval Facility. New international not-for-profit foundation, based in Vienna, Its mission: to bridge the gap between the needs of the industry and the academic know-how to promote and facilitate research in large scale information retrieval maintain a facility that enables large scale information retrieval and in-depth data processing
  23. 23. Matrixware. Founded 2005 in Vienna 80 Employees > 15 Academic Partners Worldwide Implements solutions for access to patent information
  24. 24. Promoting research. Matrixware and the IRF have engaged in several initiatives aimed at promoting research and raising awareness in the area of patent retrieval. the Information Retrieval Facility Symposium an annual symposium held in Vienna to foster knowledge exchange between IR experts and IP professionals the PaIR workshop a workshop on Patent Information retrieval hosted by the CIKM conference
  25. 25. Providing the tools. Successful IR research conventionally depends on three elements: 1 the availability of test collections 2 access to suitable software systems on which to run experiments 3 access to sufficiently powerful hardware The IRF, supported by Matrixware, is providing all three of these.
  26. 26. Current University Projects. Accessibility of Information (Glasgow) Large Scale Logical Retrieval (Glasgow) Semantic Analysis of Patent Data (Sheffield and Nijmegen) Language Modeling for Patent Retrieval (Umass Amherst) OCR for patents (Umass Amherst)
  27. 27. Concluding remarks Patent retrieval is an interesting and important open challenge for IR researchers. The IRF and Matrixware have engaged in several projects aimed at promoting research in this area.
  28. 28. Introduction Patent Search Concluding remarks A modern IR test bed Invitation Promoting take up of research Closing Conclusion Invitation. You are invited to: join one of the evaluation tracks CLEF-IP TREC-Chem participate in the PaIR workshop participate in the Information Retrieval Facility Symposium Tait et al. Patent Search: An important new test bed for IR
  29. 29. Thank you for your attention.