Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Text Matching to Measure Patent Similarity
Sam Arts
Faculty of Business and Economics
KU Leuven
sam.arts@kuleuven.be
Bruno...
2
The United States Patent Classification System (USPC)
• Prior and current research relies on patent classification
(USPC...
3
• Unclear what the bias
– Type I: false positive (dissimilar patents, same USPC)
– Type II: false negative (similar pate...
4
• Title and abstracts from all US utility patents granted
between 1976-2013 (4.4 million)
• Concatenate title and abstra...
5
• Simple Jaccard index
– Range 0-1
• For each of 4.4 million patents, select closest text-matched
patent within same yea...
6
Validation: closest text-matched patents in same year
Patent pairs with a larger Jaccard are more like to belong to same...
Validation: expert assessment
7
• 5 independent R&D scientists
– Semiconductor devices, chemical engineering, power plants...
8
Validation: expert assessment
9
Estimate bias related to USPC
• For each of the 4.4 million patents select three USPC
matched patents
• Three common way...
10
Type I error – false positive matches
• Dissimilar patents, same USPC
• Low similarity
– Primary class: 0.054
– Primary...
11
Type II error – false negative matches
• Similar patents, different USPC
• Lower bound: % different USPC among patents ...
Validation: superiority text-matching over USPC
12
Text-matched patents are more like to belong to same patent family (doc...
Validation: superiority text-matching over USPC
13
14
Conclusions
• Text mining
– To measure patent similarity and select counterfactual control patents
– Outperforms USPC
•...
15
• Develop new measure of patent similarity based on text
• Validate new measure
– Same patent family, assignee, invento...
16
Test-based measure of similarity
17
• Title + abstract: Process for amplifying, detecting, and/or-cloning nucleic acid
sequences, The present invention is ...
Validation: superiority text-matching over USPC
18
Upcoming SlideShare
Loading in …5
×

Arts - Text matching to measure patent similarity

795 views

Published on

Parallel session 1, Monday 19 September

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Arts - Text matching to measure patent similarity

  1. 1. Text Matching to Measure Patent Similarity Sam Arts Faculty of Business and Economics KU Leuven sam.arts@kuleuven.be Bruno Cassiman IESE Business School, KU Leuven bcassiman@iese.edu Juan Carlos Gomez University of Guanajuato jc.gomez@ugto.mx OECD Blue Sky Conference 2016
  2. 2. 2 The United States Patent Classification System (USPC) • Prior and current research relies on patent classification (USPC) – To identify similar patents (counterfactual control) – e.g., Jaffe, Trajtenberg, and Henderson, 1993; Almeida, 1996; Agrawal, Cockburn, and Rosell, 2010 – To measure similarity between patents and patent portfolios – e.g., Argyres, 1996; Ahuja, 2000; Rosenkopf and Almeida, 2003; Makri, Hitt, and Lane, 2010 • USPC – Too broad – Changes over time (patents are reclassified) – Manually assigned – e.g. Thompson and Fox-Kean, 2005; Belenzon and Schankerman, 2013; …
  3. 3. 3 • Unclear what the bias – Type I: false positive (dissimilar patents, same USPC) – Type II: false negative (similar patents, different USPC) • No alternatives – Using subclasses instead of classes – e.g. Thompson and Fox-Kean, 2005 – Using all classes instead of primary – e.g. Benner and Waldfogel, 2008 • Unclear how alternatives affect Type I or Type II bias The United States Patent Classification System (USPC)
  4. 4. 4 • Title and abstracts from all US utility patents granted between 1976-2013 (4.4 million) • Concatenate title and abstract, lowercase, eliminate stop words (SMART system >600 words), words<2 characters, numbers, words which appear only once • Each patent collection of unique keywords • 526,561 keywords; avg 37 per patent • Drop patents with less than 10 keywords (0.3% of sample) Text-based measure of similarity
  5. 5. 5 • Simple Jaccard index – Range 0-1 • For each of 4.4 million patents, select closest text-matched patent within same year (cfr JHT 1993) – Min Jaccard of 0.05 (0.5% drop) – More drop when matching on USPC! • Avg Jaccard 0.24 – 14 common keywords for 2 patents with 37 keywords • As a baseline, select distant text-match patent within same year (Jaccard=0, closest filing date) Text matching (instead of USPC)
  6. 6. 6 Validation: closest text-matched patents in same year Patent pairs with a larger Jaccard are more like to belong to same patent family (docdb), inventor(s), assignee(s), and are more likely to cite each other
  7. 7. Validation: expert assessment 7 • 5 independent R&D scientists – Semiconductor devices, chemical engineering, power plants, genetics, and optical inspection systems • For each expert – Randomly select 10 baseline patents – For each baseline patent one random patent with Jaccard – 0.00 – 0.05-0.25, – 0.25-0.50, – 0.50-0.75, – 0.75 onwards – Randomize order and ask experts to rate similarity 1-7
  8. 8. 8 Validation: expert assessment
  9. 9. 9 Estimate bias related to USPC • For each of the 4.4 million patents select three USPC matched patents • Three common ways of matching, approximate filing date and … – Primary class – e.g. Jaffe et al. 1993 – No match for 2% of patents – Primary class and subclass (nested) – e.g., Almeida 1996 – No match for 20% of patents – All classes and subclasses – Jaccard overlap in subclasses – e.g. Agrawal et al. 2010 – No match for 4% of patents
  10. 10. 10 Type I error – false positive matches • Dissimilar patents, same USPC • Low similarity – Primary class: 0.054 – Primary class and subclass (nested): 0.092 – All classes and subclasses: 0.097 • Lower bound: % USPC matches with Jaccard=0 – Primary class: 12% – Primary class and subclass (nested): 4.3% – All classes and subclasses: 4.0%
  11. 11. 11 Type II error – false negative matches • Similar patents, different USPC • Lower bound: % different USPC among patents with Jaccard index of 1 – Primary class: 22.4% – Primary class and subclass (nested): 52.3% – All classes and subclasses: 20.0%
  12. 12. Validation: superiority text-matching over USPC 12 Text-matched patents are more like to belong to same patent family (docdb), inventor(s), assignee(s), and are more likely to cite each other
  13. 13. Validation: superiority text-matching over USPC 13
  14. 14. 14 Conclusions • Text mining – To measure patent similarity and select counterfactual control patents – Outperforms USPC • Fine-grained • Does not rely on human classification • No changes over time – Measure similarity between portfolio’s, aggregate keywords at portfolio level • Bias related to USPC – Matching on primary subclass instead of class reduces Type I but increases Type II – Matching on all subclasses instead of primary reduces both Type I and Type II – Unexpected large share of Type I and particularly Type II errors remain present • Code and data publically available – JAVA standard libraries, csv files with cleaned words and 200 closest matches.
  15. 15. 15 • Develop new measure of patent similarity based on text • Validate new measure – Same patent family, assignee, inventors, cite each other – Expert assessments • Estimate bias related to USPC • Validate superiority over USPC – Patent family, assignee, inventors, cite each other – Expert assessments Text mining
  16. 16. 16 Test-based measure of similarity
  17. 17. 17 • Title + abstract: Process for amplifying, detecting, and/or-cloning nucleic acid sequences, The present invention is directed to a process for amplifying and detecting any target nucleic acid sequence contained in a nucleic acid or mixture thereof. The process comprises treating separate complementary strands of the nucleic acid with a molar excess of two oligonucleotide primers, extending the primers to form complementary primer extension products which act as templates for synthesizing the desired nucleic acid sequence, and detecting the sequence so amplified. The steps of the reaction may be carried out stepwise or simultaneously and can be repeated as often as desired. In addition, a specific nucleic acid sequence may be cloned into a vector by using primers to amplify the sequence, which contain restriction sites on their non-complementary ends, and a nucleic acid fragment may be prepared from an existing shorter fragment using the amplification process • 52 unique keywords: acid act addition amplification amplified amplify amplifying carried cloned complementary comprises contained desired detecting directed ends excess existing extending extension form fragment invention mixture molar non-complementary nucleic oligonucleotide prepared present primer primers process products reaction repeated restriction separate sequence sequencesthe shorter simultaneously sites specific steps stepwise strands synthesizing target templates treating vector Text-based measure of similarity
  18. 18. Validation: superiority text-matching over USPC 18

×