II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)


Published on

Published in: Software, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

II-SDV 2014 The Challenges of Managing “Big Data” in the Patent Field: Patents for business (Olivier Huc, Minesoft, UK)

  1. 1. The Challenges of Managing “Big Data” in the Patent Field 14-15 April 2014, Nice Olivier Huc
  2. 2. Specialists in Patent Information Building Intelligent Patent Information Solutions since 1996 What we do Trusted by IP experts Worldwide Corporations, National Patent Offices, Patent Attorneys and Patent Search Firms worldwide International Customer Support Global client base With Offices and Support across Europe North America, and Asia
  3. 3. Patent Families Analytics Quality Control Fast Search Legal Status Review Alerts • 23 Full Text Collections • 48 Million Families • 103 Issuing Authorities • IPC, CPC US and JP classes • Quality Controlled content • Normalised data
  4. 4. 3 Patent Data Myths • Myth #1: Patent data is just another type of “Big Data” • Myth #2: Patent Data is handled automatically • Myth #3: Patent Data is consistent worldwide
  5. 5. • Patent Data volume might be smaller, data is more complex (languages, text, fields) • Patent data is not retrieved on the fly, it is hosted, indexed and optimized • There are multiple sources with overlap • Data quality is a major issue • Users have a low tolerance for errors The reality
  6. 6. • Total data volume exceeds 35 Tb • 49 million families and 103 publishing bodies • 95 million publications • 47 million full-texts including over 23 million non-Latin into English machine translations • 54 million clipped images and 45 million complete sets of drawings Database Facts
  7. 7. • Minesoft and RWS host their own data center, located just outside of London • Control • Confidentiality • Reactivity • Speed • Distributed search engine • Continuous data update and indexing => no need to interrupt or restart the online services, + new data immediately searchable Hardware & Search Engine
  8. 8. • Multiple data sources: • DOCDB weekly feeds (EPO) • National Patent Offices • Commercial collections • External information (such as National Registers) • Despite the complexity, having multiple sources for the same country is a great advantage: • Complementarity • Improved quality • Security • Speed Sources
  9. 9. • We perform stringent quality checks • Human • Programmatic • Manual checks on some source data collections as they arrive: e.g. Indian (IN), Thai (TH) and The Philippines (PH) • Errors in data are identified programmatically by strict pre-set parameters which are then manually corrected by our data team • e.g. IC8=AO1G1/00 • Although we follow EPO’s INPADOC rules for families (extended), we recreate all our families to ensure consistency Data Quality
  10. 10. Adding extra value to PatBase data: • Families are automatically reviewed and, then if necessary, rebuilt when we receive new and/or corrected information (e.g. priority) • Tagging of examples, paragraphs and claims is done in order to facilitate searching specific sections of text • Machine translation: when a family gets new text, the family is reassessed to see if a machine translation needs to be added/replaced/deleted. Data Quality
  11. 11. TW AN/PR inputs TW AN/PR outputs 083303675 Emperor year conversion & Type of application TW19940303675F 092128911 TW20030128911 092128911 TW20040201682U US AN/PR inputs US AN/PR outputs US29/356,858 20100303 Type of application & Year US20100356858F 1301618611 A US20110016186 AT AN/PR inputs AT AN/PR outputs A 709/95 Type of application & Year AT19950000709 GM647/96 AT19960000647U Standardisation of patent data Formatting application and priority information
  12. 12. • Formatting patent numbers and kind codes • Formatting dates Thailand use Buddhist years (Gregorian calendar year plus 543) US date format - 2011/09/02 (9 February 2011) European date format – 2011/09/02 (2 September 2011) 2007 Standardisation of patent data
  13. 13. The EPO standardize names to assist searching. PatBase contains both standard and non-standard names. Standard name assigned by the EPO Non-standard name consists of whatever is filed or published on the patent Standard Non-standard PIRELLI IND PIRELI SPA PIRELLI IND PIRELLA SPA PIRELLI IND PIRELLE S P A PIRELLI IND PIRELLI DPA PIRELLI IND PIRELLI S p A PIRELLI IND PIRELLI S A PIRELLI IND PIRELLI S P A PIRELLI IND PIRELLI S P A FIRMA PIRELLI IND PIRELLI S P A IT PIRELLI IND PIRELLI S P CA PIRELLI IND PIRELLI SPA IT PIRELLI IND PIRELLI SPP PIRELLI IND PIRELLU SPA PIRELLI IND PIRELLY SPA This is a small example set of the non-standard names that The EPO assign the standard name ‘Pirelli’ There are currently 188 non-standard names for the standard name ‘Pirelli’ Standardization of patent data
  14. 14. • Date Formats • All fields, e.g. patent classifications, assignees, text etc. have set parameters. Where these are not matched data errors are identified for manual editing. • If a text is illegible (we have programmatic systems in place measuring this) it will not be allowed into the database and be identified as requiring manual attention (often manual typing). • Character conversions We have thousands of symbol / letter conversions in our programs: • & is replaced by and • œ is replaced by oe • β is replaced by ss Data Improvements
  15. 15. Insertion of paragraph breaks and paragraph numbers Data Improvements Output in PatBase Source text
  16. 16. • Errors appear in source data so manual checks are essential • Example – Granted patent information from the Indian Patent Office Journal. Three different inventions have incorrectly been given the same publication number Manual checks IN000008
  17. 17. Data quality issues On the Thai patent office website - the same publication number is used for two different applications Patent copy for TH48405 A In PatBase Application number: TH19981004295 Publication number: TH48406 A Application number: TH19981002185 Publication number: TH48405 A Wrong number Correct number Manual checks
  18. 18. • Acquiring data from multiple sources enables us to supplement records, but also alerts us to errors thus ensuring accuracy KR20010012826 A – Glial Cell Line- Derived Neurotrophic Factor Receptors KR20010112826 A – Single phase six pole DC brushless axial fan motor of transistor type Source EPO – Error in information This EPO record is a combination of two inventions. The publication number does not match with the invention. Identifying data errors
  19. 19. Incorrect data received from source In cases such as these we correct the error in PatBase and inform the EPO NULL values were supplied in the EPO’s DOCDB file as Applicants Identifying data errors
  20. 20. Example of an incorrect assignment from the USPTO PatBase family 41683901 Excerpt from USPTO assignments database Identifying data errors
  21. 21. Translations • Principle: the English text of an equivalent is always better than the Machine Translation • All non-latin Texts are machine translated into English and indexed when added to PatBase • On a rolling basis we re-translate texts to benefit from the continuous improvements of translation engines
  22. 22. Machine translation • Machine translations are made as data is added, removed / rebuilt. This is all done before indexing. • We run a rolling re-translate and re-index program to optimize the quality of our machine translated full-text Original translation, Thai into English Re-translation, Thai into English Original translation, Thai into English Re-translation, Thai into English Translations
  23. 23. Re-translation Korean into EnglishOriginal translation, Korean into English Translations
  24. 24. Assignee translations • Non-latin assignees are indexed • Non-latin assignees are also translated • First 10,000 CN and JP assignees have been manually translated by RWS • All others are Machine Translated until an “official” Latin names appear in the family
  25. 25. Cross-lingual Tool • Initially developed by WIPO, CLIR (Cross Lingual Information Retrieval) allows our users to generate multilingual searches • Using an advanced statistical text analysis system based on the PCT corpus, the cross-lingual search tool identifies variants in multiple languages for search terms entered by the user. => Better translation – translated words originate from PCT applications
  26. 26. • Source: INPADOC • All legal status events are categorised with a PRS code • Challenge: 2628 different PRS codes, some no longer in use • Solution: Grouping similar legal events together: Legal Status Reassignment Deemed Withdrawn/Abandoned Examined Renewal Fees Paid Granted Lapsed/Expired/Ceased/Dead Licence Non-Entry into National Phase National Phase Entry Opposition Filed/Request for revocation Published Restored/Reinstated/Amended Revoked/Rejected/Annuled/Invalid Withdrawn/Abandoned/Terminated/Void
  27. 27. Legal Status Timeline
  28. 28. • Most patent databases are structured and optimized for Patent Searching, not for Analytics • At Minesoft, we developed a special database with proprietary meta tags dedicated to the analytics • Coverage is important – Beware of data gaps • Importance of a web service (API) • Importance of incorporating your own custom data or legal status information in your analysis Analytics
  29. 29. Thank you PatBase celebrates its 10th anniversary Olivier Huc – olivier@minesoft.com