SlideShare a Scribd company logo
Matching Crunchbase to
PATSTAT
Gianluca Tarasconi, ICRIOS DBA
rawpatentdata.blogspot.com
Carlo Menon – OECD, Science, Technology, and Innovation Directorate
In short:
1
- We present here a match between PATSTAT
and Crunchbase entities (companies and staff);
- Names matching also benefits for the use of
other information (staff vs inventors);
- The resulting database can be used in a
number of different domains (fi analysis of the role
of IP assets in securing venture capital; the
characterization of the IP portfolio of high-growth
patenting start-ups, of start-ups developing radical or
breakthrough innovations, and of inclusive start-ups)
First data source: PATSTAT
 PATSTAT is the short name for _EPO
worldwide PATent STATistical Database
 a single database covering 100 million
patents from 90 patent authorities
 developed by European patent Office
(EPO) in cooperation with WIPO, OECD
and Eurostat.
10
Second source: Crunchbase
 CrunchBase is presented in its website as “the
premier destination for discovering industry
trends, investments, and news about
hundreds of thousands of companies
globally."
 In the version used for this note, (January
2017), the database contains information on
more than 490.000 distinct entities
(companies and vc investors) located in 199
different countries;
10
Crunchbase tables
TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH
'acquisitions' 34667 Detail for each acquisition in the dataset
'awards' 676
'category_groups' 736
Mapping between categories and category groups
'competitors' 519237 List of competitors for each organization
'customers' 300303 List of customers for each organization
'event_relationships' 121271
Detail for each event participant in the dataset
'events' 32277 Detail for each event in the dataset
'funding_rounds' 152470
Detail for each funding round in the dataset
'funds' 5519
'investment_partners' 44258
Partners who are responsible for their firm's investments
'investments' 235868
Mapping between investors and investments
'investors' 49935
Active investors including organizations and individuals
'ipos' 11807 Detail for each IPO in the dataset
'jobs' 991323 List of all job and advisory roles
'org_parents' 6847
Parent-child mapping for each organization
'organization_descriptions' 306891 Long descriptions for organizations
'organizations' 492960
'people' 578694 All people in Crunchbase
'people_descriptions' 306481
'school' 10893
10
Crunchbase data (I)
10
Crunchbase data(II)
10
Crunchbase data (III)
10
Population selected for the match
 PATSTAT: IP5 (EP US CN R JP) priority
year >=2000
 Crunchbase: all entities excluded VC
10
PATSTAT match issues
 (A) Lack of comprehensive information about
applicants (only address information is available, not
standardized and often partial or missing).
 (B) Lack of entities disambiguation = the same entity
may have several separate database entries (different
spellings of a single organization or name changes
over time).
 (C) The distribution of the number of patents per
assignee is skewed; a small number of applicants
hold thousands of patents, the large majority less than
five patents.
10
Dealing with issues (A) address missing (I)
 30% of PASTAT and 25% of CB had no
valid country code
 For PATSTAT:
 - Find a homonym in the same patent family.
 - If more than one country code is found, the country of the
applicant with the higher number of patents (over the full
PATSTAT database) is assigned.
 - If no homonym is found, if the applicant belongs to a patent
family of only one patent (singleton), the nationality of patent
office is given (this case helps disentangle in particular cases of
SIPO and JPO only applicants).
 The algorithm leaves unsolved < 1%
10
Dealing with issues (A) address missing (II)
 For Crunchbase:
 - the modal country code of the people
reported to work for the company.
 - telephone country code, whenever
available and unambiguous.
10
Dealing with issues (B) entities disambiguation(I)
 (a) Standardized names from EEE-PPAT
database, now included in PATSTAT itself;
 (b) Non ascii character latinized;
 (c) Further process by removing the
remaining noise and most of the legal
designations;
 Steps b and c applied also to CB for
ensuring compatibility in match phase;
10
Match
 In the name match phase, four criteria are combined, listed below
in order of increasingly match accuracy:
 1. Perfect match: where names, removing legal designation, are
exactly the same.
 2. Alphanumeric match: where the names, keeping only [A-Z]
and [0-9] are the same (e.g.: I.B.M. = IBM = I B M).
 3. Jaro-Winkler distance: names are broken into tokens and the
similarity score is computed by the number of tokens in common,
weighted on the inverse of frequency.
 4. Levensthein distance (edit distance).
 (# 4 dropped since it proves to add a high number of false
positives)
10
Benchmark against BVD Orbis
 The comparison is based on a small
overlapping sample of 7.569 companies
that matches exactly and unambiguously
by the company name and country code.
 Benchmark used also for finetuning
threshold of JaroWinkler match
 First result 89% precision, 87% recall
10
Filtering and finetuning (I)
 Benchmark used also for fine-tuning
threshold of JaroWinkler match;
 Match improved also by adding information
on inventors matched to CB companies
staff;
10
Filtering and finetuning (II)
 Inventors-staff match steps:
 1) name clustering based on string
matching [bigrams in common]: 300 million
couples, corresponding to 9.8 million PATSTATperson IDs.
 2) inventor’s entity disambiguation of
patstat inventors (three criteria: at least one
applicant in common; at least one common IPC4 tag;
having one applicant with less than 50 inventors; at
least one coinventor in common; and being at
maximum three degrees of distance in patenting)
10
Filtering and finetuning (III)
 Disambiguation of PATSTAT inventors
produces 14.9 million possible matches
between the sample of Crunchbase and
the sample of disambiguated PATSTAT
inventors
 3) Matches are filtered based on: at least
one applicant in common; at least one
common IPC4 tag; and having one
applicant with less than 50 inventors.
10
Filtering and finetuning (IV)
 Matched inventors-staff helps to solve
doublematches and finetune names match
criteria.
 Final results vs benchmarck: 93%
precision, 92% recall
10
Final statistics
 Almost 50 thousand companies, out of the 447 thousand listed in
CrunchBase (excluding venture capital companies), are found to
own one or more patents, for a total of around 12 million patents.
Around 220 thousand of those have been applied for by
companies created after 2005. The share of patentees for US
companies is 15%, but the share doubles for companies
reporting at least one funding round.
 Regarding individuals, out of the 578 thousand professionals
listed in CrunchBase who could be potential patent inventors,
around 25 thousand are found to have a correspondent in
PATSTAT. These inventors account for 2,2 million patent
applications.
10

More Related Content

Similar to Matching PATSTAT to Crunchbase

Free internationalpatents
Free internationalpatentsFree internationalpatents
Free internationalpatentsGerry McKiernan
 
Patent database
Patent databasePatent database
Patent database
Pallavi Belkar
 
Finding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still WinsFinding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still Wins
Erik Oliver
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
Janu Jahnavi
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
Kripa (कृपा) Rajshekhar
 
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdfAIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
Essentiality Check
 
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Kripa (कृपा) Rajshekhar
 
ADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked DataADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked Data
Andrea Gazzarini
 
Software patents
Software patents Software patents
Software patents
Andres Guadamuz
 
[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0
JEONG HAN Eom
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
IDEAS - Int'l Data Engineering and Science Association
 
Text Analytics World - Expert System USA
Text Analytics World - Expert System USAText Analytics World - Expert System USA
Text Analytics World - Expert System USA
Bradley Bennet
 
Best Essay Writing Services Review. Online assignment writing service.
Best Essay Writing Services Review. Online assignment writing service.Best Essay Writing Services Review. Online assignment writing service.
Best Essay Writing Services Review. Online assignment writing service.
Vickie Western
 
CORFU-MTSR 2013
CORFU-MTSR 2013CORFU-MTSR 2013
Commercialization Options for a set of Wireless Patents
Commercialization Options for a set of Wireless PatentsCommercialization Options for a set of Wireless Patents
Commercialization Options for a set of Wireless Patents
Shanmukha S. Potti
 
Ipdr munich mar 2017 (david perkins)
Ipdr munich mar 2017 (david perkins)Ipdr munich mar 2017 (david perkins)
Ipdr munich mar 2017 (david perkins)
JAMSInternational
 

Similar to Matching PATSTAT to Crunchbase (20)

International patents
International patentsInternational patents
International patents
 
Free internationalpatents
Free internationalpatentsFree internationalpatents
Free internationalpatents
 
Fip
FipFip
Fip
 
Fips
FipsFips
Fips
 
Patent database
Patent databasePatent database
Patent database
 
Finding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still WinsFinding the Best Patents – Forward Citation Analysis Still Wins
Finding the Best Patents – Forward Citation Analysis Still Wins
 
EDI 2009 Case Law Update
EDI 2009 Case Law UpdateEDI 2009 Case Law Update
EDI 2009 Case Law Update
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdfAIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
AIPLA Presentations - Protecting Wearables AI, IoTs and GUIs.pdf
 
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
Metonymy Labs ICAIL-ASAIL-Workshop (June16, 2017)
 
ADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked DataADLUG 2012: Linking Linked Data
ADLUG 2012: Linking Linked Data
 
Software patents
Software patents Software patents
Software patents
 
[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0[BLT] BLTsystems IR v2.0
[BLT] BLTsystems IR v2.0
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Text Analytics World - Expert System USA
Text Analytics World - Expert System USAText Analytics World - Expert System USA
Text Analytics World - Expert System USA
 
Best Essay Writing Services Review. Online assignment writing service.
Best Essay Writing Services Review. Online assignment writing service.Best Essay Writing Services Review. Online assignment writing service.
Best Essay Writing Services Review. Online assignment writing service.
 
CORFU-MTSR 2013
CORFU-MTSR 2013CORFU-MTSR 2013
CORFU-MTSR 2013
 
Commercialization Options for a set of Wireless Patents
Commercialization Options for a set of Wireless PatentsCommercialization Options for a set of Wireless Patents
Commercialization Options for a set of Wireless Patents
 
Ipdr munich mar 2017 (david perkins)
Ipdr munich mar 2017 (david perkins)Ipdr munich mar 2017 (david perkins)
Ipdr munich mar 2017 (david perkins)
 

More from Gianluca Tarasconi

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Gianluca Tarasconi
 
PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?
Gianluca Tarasconi
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time series
Gianluca Tarasconi
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by step
Gianluca Tarasconi
 
PATSTAT users 7 sins
PATSTAT users 7 sinsPATSTAT users 7 sins
PATSTAT users 7 sins
Gianluca Tarasconi
 
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
Gianluca Tarasconi
 
Ep register for patent data analisys
Ep register for patent data analisysEp register for patent data analisys
Ep register for patent data analisys
Gianluca Tarasconi
 
Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures
Gianluca Tarasconi
 
Industria italiana dal 78
Industria italiana dal 78Industria italiana dal 78
Industria italiana dal 78
Gianluca Tarasconi
 
Patenting in the south
Patenting in the southPatenting in the south
Patenting in the south
Gianluca Tarasconi
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve them
Gianluca Tarasconi
 
PRS inpadoc legal data reclassification: db structure and some insights
 PRS inpadoc legal data reclassification: db structure and some insights PRS inpadoc legal data reclassification: db structure and some insights
PRS inpadoc legal data reclassification: db structure and some insights
Gianluca Tarasconi
 
Trackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseTrackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal database
Gianluca Tarasconi
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for Patstat
Gianluca Tarasconi
 
Patent databases for business intelligence
Patent databases for business intelligencePatent databases for business intelligence
Patent databases for business intelligenceGianluca Tarasconi
 

More from Gianluca Tarasconi (15)

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 
PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time series
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by step
 
PATSTAT users 7 sins
PATSTAT users 7 sinsPATSTAT users 7 sins
PATSTAT users 7 sins
 
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
QUELLO CHE I BREVETTI NON DICONO Aidb 2/12/16
 
Ep register for patent data analisys
Ep register for patent data analisysEp register for patent data analisys
Ep register for patent data analisys
 
Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures Using patstat in universities evaluation procedures
Using patstat in universities evaluation procedures
 
Industria italiana dal 78
Industria italiana dal 78Industria italiana dal 78
Industria italiana dal 78
 
Patenting in the south
Patenting in the southPatenting in the south
Patenting in the south
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve them
 
PRS inpadoc legal data reclassification: db structure and some insights
 PRS inpadoc legal data reclassification: db structure and some insights PRS inpadoc legal data reclassification: db structure and some insights
PRS inpadoc legal data reclassification: db structure and some insights
 
Trackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal databaseTrackin patent applicant changes with a temporal database
Trackin patent applicant changes with a temporal database
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for Patstat
 
Patent databases for business intelligence
Patent databases for business intelligencePatent databases for business intelligence
Patent databases for business intelligence
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 

Matching PATSTAT to Crunchbase

  • 1. Matching Crunchbase to PATSTAT Gianluca Tarasconi, ICRIOS DBA rawpatentdata.blogspot.com Carlo Menon – OECD, Science, Technology, and Innovation Directorate
  • 2. In short: 1 - We present here a match between PATSTAT and Crunchbase entities (companies and staff); - Names matching also benefits for the use of other information (staff vs inventors); - The resulting database can be used in a number of different domains (fi analysis of the role of IP assets in securing venture capital; the characterization of the IP portfolio of high-growth patenting start-ups, of start-ups developing radical or breakthrough innovations, and of inclusive start-ups)
  • 3. First data source: PATSTAT  PATSTAT is the short name for _EPO worldwide PATent STATistical Database  a single database covering 100 million patents from 90 patent authorities  developed by European patent Office (EPO) in cooperation with WIPO, OECD and Eurostat. 10
  • 4. Second source: Crunchbase  CrunchBase is presented in its website as “the premier destination for discovering industry trends, investments, and news about hundreds of thousands of companies globally."  In the version used for this note, (January 2017), the database contains information on more than 490.000 distinct entities (companies and vc investors) located in 199 different countries; 10
  • 5. Crunchbase tables TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH 'acquisitions' 34667 Detail for each acquisition in the dataset 'awards' 676 'category_groups' 736 Mapping between categories and category groups 'competitors' 519237 List of competitors for each organization 'customers' 300303 List of customers for each organization 'event_relationships' 121271 Detail for each event participant in the dataset 'events' 32277 Detail for each event in the dataset 'funding_rounds' 152470 Detail for each funding round in the dataset 'funds' 5519 'investment_partners' 44258 Partners who are responsible for their firm's investments 'investments' 235868 Mapping between investors and investments 'investors' 49935 Active investors including organizations and individuals 'ipos' 11807 Detail for each IPO in the dataset 'jobs' 991323 List of all job and advisory roles 'org_parents' 6847 Parent-child mapping for each organization 'organization_descriptions' 306891 Long descriptions for organizations 'organizations' 492960 'people' 578694 All people in Crunchbase 'people_descriptions' 306481 'school' 10893 10
  • 9. Population selected for the match  PATSTAT: IP5 (EP US CN R JP) priority year >=2000  Crunchbase: all entities excluded VC 10
  • 10. PATSTAT match issues  (A) Lack of comprehensive information about applicants (only address information is available, not standardized and often partial or missing).  (B) Lack of entities disambiguation = the same entity may have several separate database entries (different spellings of a single organization or name changes over time).  (C) The distribution of the number of patents per assignee is skewed; a small number of applicants hold thousands of patents, the large majority less than five patents. 10
  • 11. Dealing with issues (A) address missing (I)  30% of PASTAT and 25% of CB had no valid country code  For PATSTAT:  - Find a homonym in the same patent family.  - If more than one country code is found, the country of the applicant with the higher number of patents (over the full PATSTAT database) is assigned.  - If no homonym is found, if the applicant belongs to a patent family of only one patent (singleton), the nationality of patent office is given (this case helps disentangle in particular cases of SIPO and JPO only applicants).  The algorithm leaves unsolved < 1% 10
  • 12. Dealing with issues (A) address missing (II)  For Crunchbase:  - the modal country code of the people reported to work for the company.  - telephone country code, whenever available and unambiguous. 10
  • 13. Dealing with issues (B) entities disambiguation(I)  (a) Standardized names from EEE-PPAT database, now included in PATSTAT itself;  (b) Non ascii character latinized;  (c) Further process by removing the remaining noise and most of the legal designations;  Steps b and c applied also to CB for ensuring compatibility in match phase; 10
  • 14. Match  In the name match phase, four criteria are combined, listed below in order of increasingly match accuracy:  1. Perfect match: where names, removing legal designation, are exactly the same.  2. Alphanumeric match: where the names, keeping only [A-Z] and [0-9] are the same (e.g.: I.B.M. = IBM = I B M).  3. Jaro-Winkler distance: names are broken into tokens and the similarity score is computed by the number of tokens in common, weighted on the inverse of frequency.  4. Levensthein distance (edit distance).  (# 4 dropped since it proves to add a high number of false positives) 10
  • 15. Benchmark against BVD Orbis  The comparison is based on a small overlapping sample of 7.569 companies that matches exactly and unambiguously by the company name and country code.  Benchmark used also for finetuning threshold of JaroWinkler match  First result 89% precision, 87% recall 10
  • 16. Filtering and finetuning (I)  Benchmark used also for fine-tuning threshold of JaroWinkler match;  Match improved also by adding information on inventors matched to CB companies staff; 10
  • 17. Filtering and finetuning (II)  Inventors-staff match steps:  1) name clustering based on string matching [bigrams in common]: 300 million couples, corresponding to 9.8 million PATSTATperson IDs.  2) inventor’s entity disambiguation of patstat inventors (three criteria: at least one applicant in common; at least one common IPC4 tag; having one applicant with less than 50 inventors; at least one coinventor in common; and being at maximum three degrees of distance in patenting) 10
  • 18. Filtering and finetuning (III)  Disambiguation of PATSTAT inventors produces 14.9 million possible matches between the sample of Crunchbase and the sample of disambiguated PATSTAT inventors  3) Matches are filtered based on: at least one applicant in common; at least one common IPC4 tag; and having one applicant with less than 50 inventors. 10
  • 19. Filtering and finetuning (IV)  Matched inventors-staff helps to solve doublematches and finetune names match criteria.  Final results vs benchmarck: 93% precision, 92% recall 10
  • 20. Final statistics  Almost 50 thousand companies, out of the 447 thousand listed in CrunchBase (excluding venture capital companies), are found to own one or more patents, for a total of around 12 million patents. Around 220 thousand of those have been applied for by companies created after 2005. The share of patentees for US companies is 15%, but the share doubles for companies reporting at least one funding round.  Regarding individuals, out of the 578 thousand professionals listed in CrunchBase who could be potential patent inventors, around 25 thousand are found to have a correspondent in PATSTAT. These inventors account for 2,2 million patent applications. 10