PATSTAT users 7 sins

PATSTAT users 7 deadly sins
Gianluca Tarasconi, ICRIOS DBA
rawpatentdata.blogspot.com
Leuven 20/9/2017

In short
 This presentation aims to show out 7
common errors user may incur in when
they use PATSTAT;
 This is in ideal the continuation of
‘PATSTAT 7 deadly sins’ from 2013
 Nevertheless there is only one sin user
have to avoid when using patent data:
 … SLOTH ….
10

Inventors / applicants are not always listed (I)
A part of applications miss Inventors and/or
applicants data
SELECT
Sum(If(b.APPLN_ID IS NULL, 1, 0)) AS noperson,
Count(c.APPLN_ID) AS n_APPLN_ID
FROM
patstat.tls207_pers_appln b
RIGHT JOIN patstat.tls201_appln c ON b.APPLN_ID = c.APPLN_ID
WHERE
Appln_kind <>”D2”
10
n appln_id no person %
221.595.818 18.202.821 9%
Autumn 2016 data

Inventors / applicants are not always listed (II)
10
 Limit to A,W applications, offices with >
10.000 applications
appln_auth appln_kind noperson n_APPLN_ID perc
LU A 57057 88522 64%
BE A 453348 784265 58%
NL A 382777 681266 56%
SE A 552912 1345982 41%
AT A 154041 751803 20%
CH A 341522 1839496 19%
FR A 793741 4501015 18%
DD A 99003 651159 15%
EA A 17272 118772 15%
GT A 2423 17932 14%
CA A 928350 7300631 13%
CS A 46955 381949 12%
GB A 579052 4924739 12%
DK A 68952 589986 12%

Person_id is not an entity id (I)
 Person_id in patstat do not identifies an
entity but a distinct name – address-
country
 Same entity  more person_ids
 Same person_id  more entity
10

Person_id is not an entity id: top inventors
SELECT
a.PERSON_NAME, a.PERSON_ADDRESS, a.PERSON_CTRY_CODE,
Count(c.APPLN_ID) AS Count_APPLN_ID,
Min(c.EARLIEST_FILING_YEAR) AS Min_EARLIEST_FILING_YEAR,
Max(c.EARLIEST_FILING_YEAR) AS Max_EARLIEST_FILING_YEAR
FROM
patstat.tls207_pers_appln b
INNER JOIN patstat.tls206_person a ON a.PERSON_ID = b.person_id
INNER JOIN patstat.tls201_appln c ON b.APPLN_ID = c.APPLN_ID
WHERE b.invt_seq_nr > 0 and c.EARLIEST_FILING_YEAR < 9999
GROUP BY a.PERSON_NAME, a.PERSON_ADDRESS,
a.PERSON_CTRY_CODE
ORDER BY Count_APPLN_ID DESC
10

Person_id is not an entity id : top inventors (II)
person_name ctry_code person_id n_app minyear maxyear
THE INVENTOR HAS WAIVED THE RIGHT TO BE MENTIONED 19584860 38067 2002 2015
KVASENKOV OLEG IVANOVICH RU 34298480 29682 2003 2015
WANG WEI 15786453 23156 1985 2015
ZHANG WEI 14837632 21771 1985 2015
NAME NOT GIVEN 13592151 17722 1964 2002
LI WEI 13615436 17298 1985 2015
VERZICHT DES ERFINDERS AUF NENNUNG 21108740 17260 1964 1993
WANG JUN 18500497 15755 1985 2015
LIU WEI 18697297 15319 1985 2015
LI JUN 18510590 14854 1985 2015
WANG LEI 18754169 14710 1986 2015
ZHANG LEI 18557049 14244 1987 2015
ZHANG JUN 18719351 12815 1985 2015
WANG JIAN 13113349 11936 1986 2015
WANG YONG 12656416 11844 1985 2016
ZHANG JIAN 14914085 11837 1985 2015
CHEN WEI 14837625 11706 1985 2015
WANG HUI 18663499 11452 1987 2015
LIU YANG 13930482 11126 1985 2015
LIU JUN 18710534 10927 1985 2015
LI LI 13632985 9958 1985 2015
AKTIENGESELLSCHAFT I. G. FARBENINDUSTRIE DE 17443080 9958 1897 1942
WANG TAO 18331978 9856 1985 2015
ZHANG YONG 18712075 9795 1985 2015
ZHANG LI 18704857 9716 1985 2015
10

Person_id is not an entity id: network analysis
SELECT
a.person_id, Count(DISTINCT b.person_id) AS n_coinv,
t6.PERSON_NAME, t6.PERSON_ADDRESS, t6.PERSON_CTRY_CODE
FROM
patstat.tls207_pers_appln a
INNER JOIN patstat.tls207_pers_appln b ON a.APPLN_ID = b.APPLN_ID
INNER JOIN patstat.tls206_person t6 ON t6.PERSON_ID = a.person_id
WHERE a.invt_seq_nr > 0 AND b.invt_seq_nr > 0
GROUP BY a.person_id, t6.PERSON_NAME, t6.PERSON_ADDRESS,
t6.PERSON_CTRY_CODE
ORDER BY person_id1 DESC
10

Person_id is not an entity id: network analysis
person_id n coinv name address
15786453 32384 WANG WEI
14837632 27602 ZHANG WEI
13615436 25550 LI WEI
18697297 21915 LIU WEI
18754169 21237 WANG LEI
18557049 20629 ZHANG LEI
18500497 20562 WANG JUN
18510590 19789 LI JUN
13113349 17270 WANG JIAN
13930482 16618 LIU YANG
18719351 16576 ZHANG JUN
14914085 16464 ZHANG JIAN
12656416 16208 WANG YONG
18663499 15686 WANG HUI
18704857 15224 ZHANG LI
14837625 15027 CHEN WEI
13632985 14882 LI LI
18331978 14780 WANG TAO
12656569 14656 LI YAN
18712075 14616 ZHANG YONG
10
Whang and Zhang Wei
have in common 120
Sipo patents; top 3
have 3 degrees of
distance networks of
about 900K inventors
person_id name 3 DoD
15786453 WANG WEI 943.562
14837632 ZHANG WEI 925.099
13615436 LI WEI 916.268

Person_id is not an entity id:
possible solution
 At analisys level the couple person_id –
appln_id identifies for sure one entity
 Starting at this level of disaggregation entities
should be disambiguated further with other
means
(FI appln 1
& 2 from same
applicant)
10

CPC codes coverage is incomplete (I)
 The Cooperative Patent Classification
(CPC) was initiated as a joint partnership
between the USPTO and the EPO;
 It has a more complete set of technologies
(fi green energy, nanotech);
 It started in 2011, it does not apply to all
type of patents (ie Utility models) and it has
backward data to be rebuilt.
10

CPC codes coverage is incomplete (II)
 Coverage of CPC allover patstat is far from
good and much smaller than IPC coverage
10
appln kind n app n with cpc cpc rate ipc rate
'A' 66.750.533 39.505.860 0.5918 0.8413
'U' 13.503.902 1.140.172 0.0844 0.9115
'W' 3.012.030 2.990.252 0.9928 0.9900

CPC coverage (type A)
10
APPLN
KIND
APPLN
AUTH
Count
APPLN_ID
count_app
with_cpc ratio
A AR 143884 103372 72%
A AT 587486 174977 30%
A AU 1374657 1114774 81%
A BE 646320 551552 85%
A BR 547104 374724 68%
A CA 3209303 1269659 40%
A CH 1048915 571085 54%
A CN 6343484 2155452 34%
A DE 4617268 3861583 84%
A DK 319177 119062 37%
A EP 3227647 3113078 96%
A ES 423071 202677 48%
A FI 251054 112028 45%
A FR 3098874 2387891 77%
A GB 3384892 2116655 63%
A GR 69272 24607 36%
A HK 133738 119890 90%
A HU 131491 73025 56%
A IE 91782 43044 47%
A IL 216193 122462 57%
A IN 106610 46024 43%
A IT 605707 326251 54%
A JP 13944907 4355789 31%
A KR 2831385 1425304 50%
A LU 68712 59814 87%
A MX 262534 236276 90%
A MY 50974 40612 80%
A NL 595393 528493 89%
A NO 222376 171392 77%
A NZ 141064 110223 78%
A PL 246209 79640 32%
A RU 658280 199365 30%
A SE 858651 330375 38%
A SG 102679 90508 88%
A SU 1363419 100573 7%
A TW 737206 497644 68%
A UA 55255 18206 33%
A US 12700957 11612249 91%
A ZA 293611 191492 65%
after Y2K
80%
71%
80%
94%
79%
25%
68%
32%
93%
6%
95%
88%
22%
98%
43%
80%
86%
51%
32%
77%
43%
58%
32%
54%
93%
94%
74%
90%
83%
81%
32%
32%
17%
90%
53%
66%
31%
98%
67%
SELECT a.APPLN_KIND, a.APPLN_AUTH,
Count(distinct a.APPLN_ID) AS Count_APPLN_ID, count(distinct
b.appln_id) count_app_with_cpc, count(distinct
b.appln_id)/Count(distinct a.APPLN_ID) as ratio
FROM
patstat.tls201_appln a LEFT JOIN patstat.tls224_appln_cpc b
ON a.APPLN_ID = b.appln_id
WHERE a.APPLN_KIND in ('A','W', 'U')
GROUP BY a.APPLN_KIND, a.APPLN_AUTH
Situation is not homegenueus
After Y2K things improve a bit

CPC coverage type U , W
APPLN
KIND
APPLN
AUTH
Count
APPLN_ID
count_app
with_cpc ratio after Y2K
U BR 103233 5179 5% 10%
U CN 5894022 251879 4% 4%
U DE 1406011 618249 44% 43%
U ES 327087 32007 10% 15%
U IT 139608 12912 9% 14%
U JP 4289887 113890 3% 7%
U KR 506761 44226 9% 16%
U RU 166613 5567 3% 4%
U TW 407155 30996 8% 7%
U UA 103880 2037 2% 2%
W CN 160005 155010 97% 97%
W DE 65673 65433 100% 100%
W EP 462944 461118 100% 100%
W FR 82356 81494 99% 98%
W GB 114614 114257 100% 100%
W IB 134635 133070 99% 99%
W JP 503441 497961 99% 99%
W KR 119158 118141 99% 99%
W SE 53444 53264 100% 100%
W US 1002525 1000291 100% 100%
10
Count for offices with > 50K
patents
Pct data coverage is almost full
Utility models not really possible to
use.

Missing data for PCT equivalent
 EP data where originated from regional
phase of a PCT patent can be partial
 At least Abstract and Citations could be
missign and have to be extracted from PCT
equivalent (column INTERNAT_APPLN_ID
in tls201)
10
APPLN_ID APPLN_AUTH APPLN_NR APPLN_KIND IPR_TYPE INTERNAT_APPLN_ID int_phase reg_phase nat_phase GRANTED
347305EP 99931561 A PI 30241523Y Y N 1

Missing abstracts
APPLN_KIND Count_APPLN_ID Abstracts ratio
A (ep) 3227647 1849737 57%
W (pct) 3012030 2992978 99%
10
select
a.APPLN_KIND,
Count(a.APPLN_ID) AS Count_APPLN_ID,
Count(b.APPLN_ID) AS Abstracts,
Count(b.APPLN_ID) / Count(a.APPLN_ID) AS ratio
FROM
patstat.tls201_appln a
LEFT JOIN patstat.tls203_appln_abstr b
ON a.APPLN_ID = b.APPLN_ID
WHERE
(a.APPLN_AUTH = 'EP' AND a.appln_kind = 'A') or
a.appln_kind = 'W‘ group by a.APPLN_KIND
About 40% of abstracts for EPO
Should be extracted from PCT
equivalent

Missing citations
 Euro -PCT applications:
 Citations of the WO publications are not repeated in
the later EP publication. Instead a NPL citation with
the text “See also references of WO xxxxxxx ” is
included.
 There are more citations in an Euro-PCT than is
obvious.
 In 2016 NPL citations that had the value “none” or
“see also references...” have been removed from the
data but related citations have not been replenished…
10

 Example: EP1103560 equivalent to WO0006594
 From citations table we would agree it has only 2
NPL (and one of them is "SEE ALSO
REFERENCES OF WO0006594 ”)
Missing PCT citations (II)
APPLN_ID
PUBLN_AUTH
+ NR PUBLN_ID
NPL_CITN
SEQ _NR NPL_PUBLN_ID NPL_BIBLIO
347305 EP1103560 511640 1 950236893
No further relevant documents
disclosed
347305 EP1103560 511640 3 950236894
See also references of WO
0006594A1

Missing PCT citations (III)
 As a matter of fact, seeking in espacenet the
corresponding WO we find:
http://worldwide.espacenet.com/publicationDetails/citedDocuments?CC=WO&NR=0006594A1&KC=A1&FT=D&
ND=4&date=20000210&DB=EPODOC&locale=en_EP

Data transmission gaps from national offices to
EPO (I)
 PATSTAT covers about 100 patent
authorities, but with inequal coverage and
pubblication lags.
 Good coverage and short lags for EU
countries; less good and regular for
national patent authorities outside EU
(except big players ie US JP…)
10

EPO (II)
 Data coverage for Docdb available at:
 https://www.epo.org/searching-for-patents/helpful-
resources/data/tables/weekly.html
 Nevertheless file is difficult to use
10
EDATE CC KC YEAR NB_DOC MIN_PN MAX_PN FIRST_DATE LAST_DATE LAST_ADDED LAST_EXCH
02/09/2017AM A2 2001 1 949 949 10/06/2001 10/06/2001 08/10/2015 15/10/2015
02/09/2017AM A2 2004 1 1402 1402 17/03/2004 17/03/2004 15/02/2011 24/02/2011
02/09/2017AM A2 2006 1 1813 1813 15/09/2006 15/09/2006 05/01/2017 13/04/2017
02/09/2017AM U 2009 1 170 170 26/10/2009 26/10/2009 01/08/2012 09/08/2012
02/09/2017AM U 2010 1 194 194 26/04/2010 26/04/2010 17/08/2017 24/08/2017
gaps
1011
912
0
182
Add a column GAPS for
same office, type of
publication

EPO (III)
10
A B U A B U
AT 496 2596 IS 912 1653
AU 2256 IT 4344
BA 803 JO 6193
BG 758 429 JP 376 0 451
BR 3183 0 3241 KR 476 874
BY 2678 KZ 2826 6488
CA 0 2394 LT 640 764
CH 66 0 LV 3087 867
CL 1533 MC 3085
CN 189 6370 6370 MD 860 3472
CR 1668 3062 MX 1114 1343
CY 614 MY 2575 0
DD 217 NL 3135 353
DE 0 196 NZ 1013
DK 1032 962 OA 2224
DO 3348 2057 PH 543
EC 1049 2513 PT 613 842 2083
EE 1360 RO 0 1009
EG 1291 RS 322 500 500
ES 336 RU 424
FI 63 402 SE 3100 4025
FR 0 SG 1335
GB 267 267 SI 1314 1287
GC 3084 SV 2487
GE 1554 1672 2344 TH 5597
GR 1136 586 6225 TJ 1470 2101 2030
GT 590 5556 TR 48 0 1225
HK 5061 TW 224 975 1642
HN 421 5510 UA 1449 1816
HU 1130 4996 839 US 70 336
ID 177 1127 UY 251 566
IL 1146 UZ 1641 1673
IN 1035 2153 YU 1229 1465 715
ZA 471
We see some countries for
some type of patents;
Orange / red : very
problematic cases; anyway
one application alone could
interrupt a gaps giving
misguiding results…

EPO (IV)
02/09/2017AU A A 2005 3 1475702 3432402 17/03/2005 25/08/2005 17/09/2005 02/03/2017 420
02/09/2017AU A A 2010 1 6326480 6326480 29/04/2010 29/04/2010 12/05/2010 20/05/2010 1708
10
02/09/2017IN B 2010 10 237550 264673 01/01/2010 17/12/2010 01/04/2016 17/08/2017 77
02/09/2017IN B 2011 6 239400 247731 07/01/2011 13/05/2011 16/02/2016 08/12/2016 21
02/09/2017IN B 2012 1 253973 253973 14/09/2012 14/09/2012 31/03/2016 07/04/2016 490
Australia: we have a problem
India: we have a problem bigger than expected
Authorities should be examined case by case, also using some count
by year, benchmarked with previous

 Two possible errors:
 different transmission timeframe (decay of patent count in BR starts before GB);
 Partial data transmission: counts are different than official data from patent office
Data transmission gaps from national
offices to EPO (V)
BR GB IN
1990 10851 30055 2209
1991 10122 29991 2002
1992 9103 30089 1958
1993 10272 29901 2032
1994 10992 29560 2529
1995 13557 29909 2554
1996 15580 30448 1679
1997 18589 31219 1383
1998 19032 32828 1026
1999 21019 35222 750
2000 20725 36996 690
2001 20626 36884 705
2002 19265 36318 757
2003 20909 35452 1049
2004 22816 33794 1113
2005 23973 31066 1691
2006 23472 30495 1973
2007 16078 30848 2215
2008 10088 28816 2541
2009 8843 27103 2507
2010 5028 25363 2988
2011 539 24010 872
2012 7 7955 28

Citations double counts (I)
 Citations in Patstat are stored as publication to
publication, by origin.
 Simple citation counts on TLS212 can lead on
misguiding results.
 Appln_id to appln_id citations help to clarify
10
Select sum(Count_CITED_PAT_PUBLN_ID) n_pub_cited, sum(count_distinct_appln_cited) n_distinct_appln_cited
(SELECT
t11.APPLN_ID, Count(t12.CITED_PAT_PUBLN_ID) AS Count_CITED_PAT_PUBLN_ID,
Count(DISTINCT t11b.APPLN_ID) AS count_distinct_appln_cited
FROM
patstat.tls212_citation t12
INNER JOIN patstat.tls211_pat_publn t11 ON t11.PAT_PUBLN_ID = t12.PAT_PUBLN_ID
INNER JOIN patstat.tls211_pat_publn t11b ON t12.CITED_PAT_PUBLN_ID = t11b.PAT_PUBLN_ID
WHERE t12.CITED_PAT_PUBLN_ID > 0
GROUP BY t11.APPLN_ID

APPLN_ID PAT_PUBLN_ID CITED_PAT_PUBLN_ID CITED_APPLN_ID
1 293253293 306927614 16980819
1 293253293 301830017 17000979
1 293253293 298485954 13388690
1 387522680 306927614 16980819
1 387522680 301830017 17000979
1 387522680 298485954 13388690
Citations double counts (II)
Case 1: appln_id 1 has 2
publications showing exactly
same citations
3 291964096 300128315 49123163
3 291964096 295303503 13538355
3 387535649 300128315 49123163
3 387535649 296195755 53888801
3 387535649 306928379 52488529
Case 2: appln_id 1 has 2
publications showing 1 common
and 2 different citations

Citations double counts (III)
Case 3: same citation shows
with different origin
Case 4: same citation 4 times
same publication citing, same
origin [data error, could be
sistematic with multiple priority
from some offices]
23 289129312 305684503 16736817
23 289129312 293787435 15702748
23 289129312 308462347 48996652APP
23 289129312 293787435 15702748
23 289129312 305684503 16736817
23 289129312 308462347 48996652SEA
23 289129312 327902045 50318244
23 289129312 296433878 20546518
23 289129312 297350607 24023115
23 289129312 296449840 47357637
705 306929092 309035661 22852464SEA
705 306929092 309035659 22771241SEA
705 306929092 307833757 50586872SEA
705 306929092 385933558 9632978SEA
705 316022028 337326981 16587695APP
705 316022028 310119518 16447723APP
705 316022028 310119519 48241555APP
705 316022028 314809416 50308201APP
705 316022028 314809413 9718355APP
705 316022028 314809416 50308201APP
705 316022028 314809416 50308201APP
705 316022028 314809413 9718355APP
705 316022028 314809416 50308201APP

Citations double counts (IV)
publn_auth n pub cited
n app
cited ratio
GB 6174382 4038130 0,65
US 250838242 1,86E+08 0,74
DE 9428710 7927305 0,84
AT 331095 287720 0,86
JP 29539409 26474356 0,89
10
n pub cited n app cited ratio
225.858.262 189.409.321 0,83862
How does it perform allover
patstat?
Focused in (offices
with ratio < 0.9):

Citations doublecount (V)
 47K cases of self citations…
10
APPLN_ID PAT_PUBLN_ID CITED_APPLN_ID CITED_PAT_PUBLN_ID
53803383 55765553 53803383 278711582
US7478445 US2008052829
But is the same patent…

 Solution: count distinct citations by
appln_id citing and cited;
 Move to a separate table citation origin
data;
 Use also number of citing docdb families
(provided in TLS201).
Citations double counts (VI)

Counting correctly number of claims
 Number of claims is often used an indicator of value
 US data: relates to granted patents only (A documents
until 2000, B1 or B2 documents afterwards) which
were published on or after 1975-01-01
 EP data: relates to both published applications (kind
code "A") from 1978 and granted patents (kind code
"B") from 1980.
 The number of claims will be "0" for all EP A
documents originating from a PCT published in
English, French or German (so called "Euro-PCTs").
10

Counting correctly number of claims (II)
 Claims number changes overtime: select
the publication phase more relevant to your
research question; also language may
change number of claims (but PATSTAT
keeps the higher number)
10
PAT_PUBLN_ID PUBLN_AUTH PUBLN_NR PUBLN_KIND PUBLN_DATE PUBLN_CLAIMS Colonna1
311822768 'EP' '1878578' 'A1' '2008-01-16' '11'
311822783 'EP' '1878578' 'B1' '2009-09-30' '22'
100%
more…
311822763 'EP' '0000034' 'A1' '1978-12-20' '14'
311822766 'EP' '0000034' 'B1' '1984-05-23' '7' 50% less

Counting correctly number of claims (IV)
 Average change in
number of claims
10
SELECT PUBLN_AUTH,
sum(Min_PUBLN_CLAIMS) as min_claims,
sum(Max_PUBLN_CLAIMS) as
max_claims
from
(SELECT
b.PUBLN_AUTH,
b.appln_id,
Max(cast(b.PUBLN_CLAIMS as
unsigned)) AS Max_PUBLN_CLAIMS,
Min(cast(b.PUBLN_CLAIMS as
unsigned)) AS Min_PUBLN_CLAIMS
FROM
patstat.tls211_pat_publn b
WHERE
(b.PUBLN_AUTH = 'EP' OR
b.PUBLN_AUTH = 'US') AND
cast(b.PUBLN_CLAIMS as unsigned) > 0
GROUP BY
b.PUBLN_AUTH, b.appln_id) b
GROUP BY PUBLN_AUTH
PUBLN_AUTH min_claims max_claims ratio
EP 29.005.016 31.407.694 0,9235
US 87.474.259 87.474.617 0,999996
Average number of
claims can be a good
proxy

Conclusions
 PATSTAT is a great source of data but
cannot be taken ‘as is’.
 Data collection is examiner centered, thus
all ‘accessories’ data need a validation.
 Seeking data gaps ex-ante can save a lot
of work ex-post
10

Conclusions (II)
10
 A saint has a past; a sinner has a future
(Lord Illingworth).
 Both sentences mean a lot of work to do
when using patent data! (myself).
 A saint is a sinner who never gave up
(Yogananda)

PATSTAT users 7 sins

Recommended

Recommended

More Related Content

Similar to PATSTAT users 7 sins

Similar to PATSTAT users 7 sins (20)

More from Gianluca Tarasconi

More from Gianluca Tarasconi (15)

Recently uploaded

Recently uploaded (20)

PATSTAT users 7 sins