SlideShare a Scribd company logo
1 of 20
Matching Crunchbase to
PATSTAT
Gianluca Tarasconi, ICRIOS DBA
rawpatentdata.blogspot.com
Carlo Menon – OECD, Science, Technology, and Innovation Directorate
In short:
1
- We present here a match between PATSTAT
and Crunchbase entities (companies and staff);
- Names matching also benefits for the use of
other information (staff vs inventors);
- The resulting database can be used in a
number of different domains (fi analysis of the role
of IP assets in securing venture capital; the
characterization of the IP portfolio of high-growth
patenting start-ups, of start-ups developing radical or
breakthrough innovations, and of inclusive start-ups)
First data source: PATSTAT
 PATSTAT is the short name for _EPO
worldwide PATent STATistical Database
 a single database covering 100 million
patents from 90 patent authorities
 developed by European patent Office
(EPO) in cooperation with WIPO, OECD
and Eurostat.
10
Second source: Crunchbase
 CrunchBase is presented in its website as “the
premier destination for discovering industry
trends, investments, and news about
hundreds of thousands of companies
globally."
 In the version used for this note, (January
2017), the database contains information on
more than 490.000 distinct entities
(companies and vc investors) located in 199
different countries;
10
Crunchbase tables
TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH
'acquisitions' 34667 Detail for each acquisition in the dataset
'awards' 676
'category_groups' 736
Mapping between categories and category groups
'competitors' 519237 List of competitors for each organization
'customers' 300303 List of customers for each organization
'event_relationships' 121271
Detail for each event participant in the dataset
'events' 32277 Detail for each event in the dataset
'funding_rounds' 152470
Detail for each funding round in the dataset
'funds' 5519
'investment_partners' 44258
Partners who are responsible for their firm's investments
'investments' 235868
Mapping between investors and investments
'investors' 49935
Active investors including organizations and individuals
'ipos' 11807 Detail for each IPO in the dataset
'jobs' 991323 List of all job and advisory roles
'org_parents' 6847
Parent-child mapping for each organization
'organization_descriptions' 306891 Long descriptions for organizations
'organizations' 492960
'people' 578694 All people in Crunchbase
'people_descriptions' 306481
'school' 10893
10
Crunchbase data (I)
10
Crunchbase data(II)
10
Crunchbase data (III)
10
Population selected for the match
 PATSTAT: IP5 (EP US CN R JP) priority
year >=2000
 Crunchbase: all entities excluded VC
10
PATSTAT match issues
 (A) Lack of comprehensive information about
applicants (only address information is available, not
standardized and often partial or missing).
 (B) Lack of entities disambiguation = the same entity
may have several separate database entries (different
spellings of a single organization or name changes
over time).
 (C) The distribution of the number of patents per
assignee is skewed; a small number of applicants
hold thousands of patents, the large majority less than
five patents.
10
Dealing with issues (A) address missing (I)
 30% of PASTAT and 25% of CB had no
valid country code
 For PATSTAT:
 - Find a homonym in the same patent family.
 - If more than one country code is found, the country of the
applicant with the higher number of patents (over the full
PATSTAT database) is assigned.
 - If no homonym is found, if the applicant belongs to a patent
family of only one patent (singleton), the nationality of patent
office is given (this case helps disentangle in particular cases of
SIPO and JPO only applicants).
 The algorithm leaves unsolved < 1%
10
Dealing with issues (A) address missing (II)
 For Crunchbase:
 - the modal country code of the people
reported to work for the company.
 - telephone country code, whenever
available and unambiguous.
10
Dealing with issues (B) entities disambiguation(I)
 (a) Standardized names from EEE-PPAT
database, now included in PATSTAT itself;
 (b) Non ascii character latinized;
 (c) Further process by removing the
remaining noise and most of the legal
designations;
 Steps b and c applied also to CB for
ensuring compatibility in match phase;
10
Match
 In the name match phase, four criteria are combined, listed below
in order of increasingly match accuracy:
 1. Perfect match: where names, removing legal designation, are
exactly the same.
 2. Alphanumeric match: where the names, keeping only [A-Z]
and [0-9] are the same (e.g.: I.B.M. = IBM = I B M).
 3. Jaro-Winkler distance: names are broken into tokens and the
similarity score is computed by the number of tokens in common,
weighted on the inverse of frequency.
 4. Levensthein distance (edit distance).
 (# 4 dropped since it proves to add a high number of false
positives)
10
Benchmark against BVD Orbis
 The comparison is based on a small
overlapping sample of 7.569 companies
that matches exactly and unambiguously
by the company name and country code.
 Benchmark used also for finetuning
threshold of JaroWinkler match
 First result 89% precision, 87% recall
10
Filtering and finetuning (I)
 Benchmark used also for fine-tuning
threshold of JaroWinkler match;
 Match improved also by adding information
on inventors matched to CB companies
staff;
10
Filtering and finetuning (II)
 Inventors-staff match steps:
 1) name clustering based on string
matching [bigrams in common]: 300 million
couples, corresponding to 9.8 million PATSTATperson IDs.
 2) inventor’s entity disambiguation of
patstat inventors (three criteria: at least one
applicant in common; at least one common IPC4 tag;
having one applicant with less than 50 inventors; at
least one coinventor in common; and being at
maximum three degrees of distance in patenting)
10
Filtering and finetuning (III)
 Disambiguation of PATSTAT inventors
produces 14.9 million possible matches
between the sample of Crunchbase and
the sample of disambiguated PATSTAT
inventors
 3) Matches are filtered based on: at least
one applicant in common; at least one
common IPC4 tag; and having one
applicant with less than 50 inventors.
10
Filtering and finetuning (IV)
 Matched inventors-staff helps to solve
doublematches and finetune names match
criteria.
 Final results vs benchmarck: 93%
precision, 92% recall
10
Final statistics
 Almost 50 thousand companies, out of the 447 thousand listed in
CrunchBase (excluding venture capital companies), are found to
own one or more patents, for a total of around 12 million patents.
Around 220 thousand of those have been applied for by
companies created after 2005. The share of patentees for US
companies is 15%, but the share doubles for companies
reporting at least one funding round.
 Regarding individuals, out of the 578 thousand professionals
listed in CrunchBase who could be potential patent inventors,
around 25 thousand are found to have a correspondent in
PATSTAT. These inventors account for 2,2 million patent
applications.
10

More Related Content

Similar to Matching Crunchbase to PATSTAT Databases for IP Analysis

064-petrie Building an harmonised international trademark database
064-petrie Building an harmonised international trademark database064-petrie Building an harmonised international trademark database
064-petrie Building an harmonised international trademark databaseinnovationoecd
 
Patstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysPatstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysGianluca Tarasconi
 
Strumsky lobo (2011) does patenting intensity beget quality
Strumsky lobo (2011) does patenting intensity beget qualityStrumsky lobo (2011) does patenting intensity beget quality
Strumsky lobo (2011) does patenting intensity beget qualityivan weinel
 
Patent database with one example
Patent database with one examplePatent database with one example
Patent database with one examplePallavi Belkar
 
Free internationalpatents
Free internationalpatentsFree internationalpatents
Free internationalpatentsGerry McKiernan
 
Finding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still WinsFinding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still WinsErik Oliver
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltkJanu Jahnavi
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...Kripa (कृपा) Rajshekhar
 
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdfAIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdfEssentiality Check
 
ADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked DataADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked DataAndrea Gazzarini
 
[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0JEONG HAN Eom
 
Text Analytics World - Expert System USA
Text Analytics World - Expert System USAText Analytics World - Expert System USA
Text Analytics World - Expert System USABradley Bennet
 

Similar to Matching Crunchbase to PATSTAT Databases for IP Analysis (20)

064-petrie Building an harmonised international trademark database
064-petrie Building an harmonised international trademark database064-petrie Building an harmonised international trademark database
064-petrie Building an harmonised international trademark database
 
Patstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysPatstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisys
 
Strumsky lobo (2011) does patenting intensity beget quality
Strumsky lobo (2011) does patenting intensity beget qualityStrumsky lobo (2011) does patenting intensity beget quality
Strumsky lobo (2011) does patenting intensity beget quality
 
Patent database with one example
Patent database with one examplePatent database with one example
Patent database with one example
 
International patents
International patentsInternational patents
International patents
 
Free internationalpatents
Free internationalpatentsFree internationalpatents
Free internationalpatents
 
Fip
FipFip
Fip
 
Fips
FipsFips
Fips
 
Patent database
Patent databasePatent database
Patent database
 
Finding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still WinsFinding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still Wins
 
EDI 2009 Case Law Update
EDI 2009 Case Law UpdateEDI 2009 Case Law Update
EDI 2009 Case Law Update
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdfAIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
 
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
 
ADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked DataADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked Data
 
Software patents
Software patents Software patents
Software patents
 
[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Text Analytics World - Expert System USA
Text Analytics World - Expert System USAText Analytics World - Expert System USA
Text Analytics World - Expert System USA
 

More from Gianluca Tarasconi

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Gianluca Tarasconi
 
PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?Gianluca Tarasconi
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time seriesGianluca Tarasconi
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by stepGianluca Tarasconi
 
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16Gianluca Tarasconi
 
Ep register for patent data analisys
Ep register for patent data analisysEp register for patent data analisys
Ep register for patent data analisysGianluca Tarasconi
 
Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Gianluca Tarasconi
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themGianluca Tarasconi
 
PRS inpadoc legal data reclassification: db structure and some insights
 PRS inpadoc legal data reclassification: db structure and some insights PRS inpadoc legal data reclassification: db structure and some insights
PRS inpadoc legal data reclassification: db structure and some insightsGianluca Tarasconi
 
Trackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseTrackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseGianluca Tarasconi
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatGianluca Tarasconi
 
Patent databases for business intelligence
Patent databases for business intelligencePatent databases for business intelligence
Patent databases for business intelligenceGianluca Tarasconi
 

More from Gianluca Tarasconi (15)

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 
PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time series
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by step
 
PATSTAT users 7 sins
PATSTAT users 7 sinsPATSTAT users 7 sins
PATSTAT users 7 sins
 
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
 
Ep register for patent data analisys
Ep register for patent data analisysEp register for patent data analisys
Ep register for patent data analisys
 
Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures
 
Industria italiana dal 78
Industria italiana dal 78Industria italiana dal 78
Industria italiana dal 78
 
Patenting in the south
Patenting in the southPatenting in the south
Patenting in the south
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve them
 
PRS inpadoc legal data reclassification: db structure and some insights
 PRS inpadoc legal data reclassification: db structure and some insights PRS inpadoc legal data reclassification: db structure and some insights
PRS inpadoc legal data reclassification: db structure and some insights
 
Trackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseTrackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal database
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for Patstat
 
Patent databases for business intelligence
Patent databases for business intelligencePatent databases for business intelligence
Patent databases for business intelligence
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Matching Crunchbase to PATSTAT Databases for IP Analysis

  • 1. Matching Crunchbase to PATSTAT Gianluca Tarasconi, ICRIOS DBA rawpatentdata.blogspot.com Carlo Menon – OECD, Science, Technology, and Innovation Directorate
  • 2. In short: 1 - We present here a match between PATSTAT and Crunchbase entities (companies and staff); - Names matching also benefits for the use of other information (staff vs inventors); - The resulting database can be used in a number of different domains (fi analysis of the role of IP assets in securing venture capital; the characterization of the IP portfolio of high-growth patenting start-ups, of start-ups developing radical or breakthrough innovations, and of inclusive start-ups)
  • 3. First data source: PATSTAT  PATSTAT is the short name for _EPO worldwide PATent STATistical Database  a single database covering 100 million patents from 90 patent authorities  developed by European patent Office (EPO) in cooperation with WIPO, OECD and Eurostat. 10
  • 4. Second source: Crunchbase  CrunchBase is presented in its website as “the premier destination for discovering industry trends, investments, and news about hundreds of thousands of companies globally."  In the version used for this note, (January 2017), the database contains information on more than 490.000 distinct entities (companies and vc investors) located in 199 different countries; 10
  • 5. Crunchbase tables TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH 'acquisitions' 34667 Detail for each acquisition in the dataset 'awards' 676 'category_groups' 736 Mapping between categories and category groups 'competitors' 519237 List of competitors for each organization 'customers' 300303 List of customers for each organization 'event_relationships' 121271 Detail for each event participant in the dataset 'events' 32277 Detail for each event in the dataset 'funding_rounds' 152470 Detail for each funding round in the dataset 'funds' 5519 'investment_partners' 44258 Partners who are responsible for their firm's investments 'investments' 235868 Mapping between investors and investments 'investors' 49935 Active investors including organizations and individuals 'ipos' 11807 Detail for each IPO in the dataset 'jobs' 991323 List of all job and advisory roles 'org_parents' 6847 Parent-child mapping for each organization 'organization_descriptions' 306891 Long descriptions for organizations 'organizations' 492960 'people' 578694 All people in Crunchbase 'people_descriptions' 306481 'school' 10893 10
  • 9. Population selected for the match  PATSTAT: IP5 (EP US CN R JP) priority year >=2000  Crunchbase: all entities excluded VC 10
  • 10. PATSTAT match issues  (A) Lack of comprehensive information about applicants (only address information is available, not standardized and often partial or missing).  (B) Lack of entities disambiguation = the same entity may have several separate database entries (different spellings of a single organization or name changes over time).  (C) The distribution of the number of patents per assignee is skewed; a small number of applicants hold thousands of patents, the large majority less than five patents. 10
  • 11. Dealing with issues (A) address missing (I)  30% of PASTAT and 25% of CB had no valid country code  For PATSTAT:  - Find a homonym in the same patent family.  - If more than one country code is found, the country of the applicant with the higher number of patents (over the full PATSTAT database) is assigned.  - If no homonym is found, if the applicant belongs to a patent family of only one patent (singleton), the nationality of patent office is given (this case helps disentangle in particular cases of SIPO and JPO only applicants).  The algorithm leaves unsolved < 1% 10
  • 12. Dealing with issues (A) address missing (II)  For Crunchbase:  - the modal country code of the people reported to work for the company.  - telephone country code, whenever available and unambiguous. 10
  • 13. Dealing with issues (B) entities disambiguation(I)  (a) Standardized names from EEE-PPAT database, now included in PATSTAT itself;  (b) Non ascii character latinized;  (c) Further process by removing the remaining noise and most of the legal designations;  Steps b and c applied also to CB for ensuring compatibility in match phase; 10
  • 14. Match  In the name match phase, four criteria are combined, listed below in order of increasingly match accuracy:  1. Perfect match: where names, removing legal designation, are exactly the same.  2. Alphanumeric match: where the names, keeping only [A-Z] and [0-9] are the same (e.g.: I.B.M. = IBM = I B M).  3. Jaro-Winkler distance: names are broken into tokens and the similarity score is computed by the number of tokens in common, weighted on the inverse of frequency.  4. Levensthein distance (edit distance).  (# 4 dropped since it proves to add a high number of false positives) 10
  • 15. Benchmark against BVD Orbis  The comparison is based on a small overlapping sample of 7.569 companies that matches exactly and unambiguously by the company name and country code.  Benchmark used also for finetuning threshold of JaroWinkler match  First result 89% precision, 87% recall 10
  • 16. Filtering and finetuning (I)  Benchmark used also for fine-tuning threshold of JaroWinkler match;  Match improved also by adding information on inventors matched to CB companies staff; 10
  • 17. Filtering and finetuning (II)  Inventors-staff match steps:  1) name clustering based on string matching [bigrams in common]: 300 million couples, corresponding to 9.8 million PATSTATperson IDs.  2) inventor’s entity disambiguation of patstat inventors (three criteria: at least one applicant in common; at least one common IPC4 tag; having one applicant with less than 50 inventors; at least one coinventor in common; and being at maximum three degrees of distance in patenting) 10
  • 18. Filtering and finetuning (III)  Disambiguation of PATSTAT inventors produces 14.9 million possible matches between the sample of Crunchbase and the sample of disambiguated PATSTAT inventors  3) Matches are filtered based on: at least one applicant in common; at least one common IPC4 tag; and having one applicant with less than 50 inventors. 10
  • 19. Filtering and finetuning (IV)  Matched inventors-staff helps to solve doublematches and finetune names match criteria.  Final results vs benchmarck: 93% precision, 92% recall 10
  • 20. Final statistics  Almost 50 thousand companies, out of the 447 thousand listed in CrunchBase (excluding venture capital companies), are found to own one or more patents, for a total of around 12 million patents. Around 220 thousand of those have been applied for by companies created after 2005. The share of patentees for US companies is 15%, but the share doubles for companies reporting at least one funding round.  Regarding individuals, out of the 578 thousand professionals listed in CrunchBase who could be potential patent inventors, around 25 thousand are found to have a correspondent in PATSTAT. These inventors account for 2,2 million patent applications. 10