2. In short:
1
- We present here a match between PATSTAT
and Crunchbase entities (companies and staff);
- Names matching also benefits for the use of
other information (staff vs inventors);
- The resulting database can be used in a
number of different domains (fi analysis of the role
of IP assets in securing venture capital; the
characterization of the IP portfolio of high-growth
patenting start-ups, of start-ups developing radical or
breakthrough innovations, and of inclusive start-ups)
3. First data source: PATSTAT
PATSTAT is the short name for _EPO
worldwide PATent STATistical Database
a single database covering 100 million
patents from 90 patent authorities
developed by European patent Office
(EPO) in cooperation with WIPO, OECD
and Eurostat.
10
4. Second source: Crunchbase
CrunchBase is presented in its website as “the
premier destination for discovering industry
trends, investments, and news about
hundreds of thousands of companies
globally."
In the version used for this note, (January
2017), the database contains information on
more than 490.000 distinct entities
(companies and vc investors) located in 199
different countries;
10
5. Crunchbase tables
TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH
'acquisitions' 34667 Detail for each acquisition in the dataset
'awards' 676
'category_groups' 736
Mapping between categories and category groups
'competitors' 519237 List of competitors for each organization
'customers' 300303 List of customers for each organization
'event_relationships' 121271
Detail for each event participant in the dataset
'events' 32277 Detail for each event in the dataset
'funding_rounds' 152470
Detail for each funding round in the dataset
'funds' 5519
'investment_partners' 44258
Partners who are responsible for their firm's investments
'investments' 235868
Mapping between investors and investments
'investors' 49935
Active investors including organizations and individuals
'ipos' 11807 Detail for each IPO in the dataset
'jobs' 991323 List of all job and advisory roles
'org_parents' 6847
Parent-child mapping for each organization
'organization_descriptions' 306891 Long descriptions for organizations
'organizations' 492960
'people' 578694 All people in Crunchbase
'people_descriptions' 306481
'school' 10893
10
9. Population selected for the match
PATSTAT: IP5 (EP US CN R JP) priority
year >=2000
Crunchbase: all entities excluded VC
10
10. PATSTAT match issues
(A) Lack of comprehensive information about
applicants (only address information is available, not
standardized and often partial or missing).
(B) Lack of entities disambiguation = the same entity
may have several separate database entries (different
spellings of a single organization or name changes
over time).
(C) The distribution of the number of patents per
assignee is skewed; a small number of applicants
hold thousands of patents, the large majority less than
five patents.
10
11. Dealing with issues (A) address missing (I)
30% of PASTAT and 25% of CB had no
valid country code
For PATSTAT:
- Find a homonym in the same patent family.
- If more than one country code is found, the country of the
applicant with the higher number of patents (over the full
PATSTAT database) is assigned.
- If no homonym is found, if the applicant belongs to a patent
family of only one patent (singleton), the nationality of patent
office is given (this case helps disentangle in particular cases of
SIPO and JPO only applicants).
The algorithm leaves unsolved < 1%
10
12. Dealing with issues (A) address missing (II)
For Crunchbase:
- the modal country code of the people
reported to work for the company.
- telephone country code, whenever
available and unambiguous.
10
13. Dealing with issues (B) entities disambiguation(I)
(a) Standardized names from EEE-PPAT
database, now included in PATSTAT itself;
(b) Non ascii character latinized;
(c) Further process by removing the
remaining noise and most of the legal
designations;
Steps b and c applied also to CB for
ensuring compatibility in match phase;
10
14. Match
In the name match phase, four criteria are combined, listed below
in order of increasingly match accuracy:
1. Perfect match: where names, removing legal designation, are
exactly the same.
2. Alphanumeric match: where the names, keeping only [A-Z]
and [0-9] are the same (e.g.: I.B.M. = IBM = I B M).
3. Jaro-Winkler distance: names are broken into tokens and the
similarity score is computed by the number of tokens in common,
weighted on the inverse of frequency.
4. Levensthein distance (edit distance).
(# 4 dropped since it proves to add a high number of false
positives)
10
15. Benchmark against BVD Orbis
The comparison is based on a small
overlapping sample of 7.569 companies
that matches exactly and unambiguously
by the company name and country code.
Benchmark used also for finetuning
threshold of JaroWinkler match
First result 89% precision, 87% recall
10
16. Filtering and finetuning (I)
Benchmark used also for fine-tuning
threshold of JaroWinkler match;
Match improved also by adding information
on inventors matched to CB companies
staff;
10
17. Filtering and finetuning (II)
Inventors-staff match steps:
1) name clustering based on string
matching [bigrams in common]: 300 million
couples, corresponding to 9.8 million PATSTATperson IDs.
2) inventor’s entity disambiguation of
patstat inventors (three criteria: at least one
applicant in common; at least one common IPC4 tag;
having one applicant with less than 50 inventors; at
least one coinventor in common; and being at
maximum three degrees of distance in patenting)
10
18. Filtering and finetuning (III)
Disambiguation of PATSTAT inventors
produces 14.9 million possible matches
between the sample of Crunchbase and
the sample of disambiguated PATSTAT
inventors
3) Matches are filtered based on: at least
one applicant in common; at least one
common IPC4 tag; and having one
applicant with less than 50 inventors.
10
19. Filtering and finetuning (IV)
Matched inventors-staff helps to solve
doublematches and finetune names match
criteria.
Final results vs benchmarck: 93%
precision, 92% recall
10
20. Final statistics
Almost 50 thousand companies, out of the 447 thousand listed in
CrunchBase (excluding venture capital companies), are found to
own one or more patents, for a total of around 12 million patents.
Around 220 thousand of those have been applied for by
companies created after 2005. The share of patentees for US
companies is 15%, but the share doubles for companies
reporting at least one funding round.
Regarding individuals, out of the 578 thousand professionals
listed in CrunchBase who could be potential patent inventors,
around 25 thousand are found to have a correspondent in
PATSTAT. These inventors account for 2,2 million patent
applications.
10