Transcript of "Patstat and patstat related resources for patent data analisys"
By Gianluca Tarasconi – Kites Univ. Bocconi / O.S.T.
About the speaker Background in Management Engineering @ Politecnico of Milan Database Architect @ KITeS (previosly CESPRI) since 2002 Project manager for data production in EU Projects STI-NET, TENIA, AEGIS and EU Tenders ICT network impact, INNOVA, Higly Cited Patents, Measurement and analysis of knowledge and R&D exploitation flows, assessed by patent and licensing data Collaborations on database projects with: MIT, LSE, Danish Board of Technology, Bonn Graduate School of Economic, Universtät Mainz, BETA … Redactor of blog rawpatentdata.blogspot.com
What is PATSTAT is a snapshot of the EPO database for over about 70 million applications from more than 80 application authorities, containing bibliographic data, citations and family links. It requires the data to be loaded in the customers own database. + low cost of ownership - costs of implementation
Data Sorces for PATSTAT Source for EP data is DOCDB (EPO master documentation database) Source for other offices are files provided by other patent authorities + Good coverage for US, EU states, JP, EPO, WIPO - For other authorities gaps and leaks not easy to identify
Implementing the DB (I) Over 20 tables in a relational DB with application is as main primary key EPO adds / improves data each ediction
Implementing the DB (II) + standard scripts, a growing community to exchange procedures etc. (example) - need a person who has both DB and patent data knowledge
Plug & play extensionsDatasets that can be added with no effort: Regpat: OECD dataset giving NUTS3 for each applocant / inventor (EP only) Han: OECD Harmonized applicants names dataset (EP only) eee_ppat: KUL/Eurostat standard names and sector allocation (all patstat) Tls221: Epo legal data table, allowing to include changes of ownership, oppositions... (example) ape-inv: Inventors disambiguation tools and academic inventors.Note: all tables, but TLS221 are free of cost
Some papers using Kites-PatstatDBLissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research Evaluation, 17(2): 87-102.Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181.Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). Lattivita brevettuale dei docenti universitari: LItalia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf]Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) - Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". ForthcomingFrazzoni S., Mancusi M., Rotondi Z., Sobrero M., Vezzulli A., (2011), “Relationship with banks and access to credit for innovation and internationalization in SMEs”, L’EUROPA E OLTRE. Banche e imprese nella nuova globalizzazione, XVI Rapporto sul sistema finanziario italiano, Edibank, 2011. ISBN 978-88-449-0495-1.V. Sterzi: Patent quality and ownership: An analysis of UK faculty patenting, Research Policy, 2012 (forthcoming)
Some advancedapplications OST patent applicants data quality procedure and Match with ORBIS OST common identifier among Patstat WoS, Framework programs DBs
Applicants data qualityprocedure and Match withORBIS (I) Goal of the procedure is to clean and standardize patent applicants names (ie removing type of company, common misspelling etc.) After names C&S a procedure has been developed in order to apply 5 different match algorithms in order to give allow the best matches with ORBIS company names.
Applicants data qualityprocedure and Match withORBIS (II) Data quality procedure developed using portable query and tables (see Tarasconi - Sharing names/address cleaning patterns for Patstat from patstat users day 2011) Match procedure developed aiming to be multiporpose (IE has already been used to match TM vs Patents applicants @ KITeS) Code and tables available for MySql and Oracle. http://documents.epo.org/projects/babylon/eponet.nsf/0/92ab5eb34ff406d1c125795d0050bbc c/$FILE/PATSTAT_user_day_2011_presentations.zip
Applicants data qualityprocedure and Match withORBIS (III) C&S step results: from 12.280.000 pat. applicants to about 3.800.000 companies Match against: 353.294 Orbis Companies in Nace 2540, 2630, 2651, 2910, 3030, 3011, 8422 (defense) Results: 94726 Patent applicants against 66256 Orbis companies Benchmark: Againsts a sample of 1% validation returned a precision rate of 91% and a recall of 95%
OST Common identifier (I)Data cathegories existing across patent, scientific publications and Framework programs data: PATSTAT FPS WOS inventors/applicant participantsGeographic data s addresses addresses affiliations addresses inventors,Individuals applicants contacts authorscompanies applicants participants affiliationssci /tech taxonomies IPC TPs subject cathegories
OST Common identifier (II)1)DEFINE ATOMIC ENTITIES AND NON AMBIGUOS JOINS Even if they regard similar entities there are differences among datasets on the granularity they use on data. (ie in WOS affiliations may be by lab / dept while patents may be by IP office: different size) Bridge dataset should use a entity size allowing unique data match across different sets. This might need some changes also in existing databases. Bridge dataset should also make possible a hierarchic structure of entities allowing join at different level to main datasets.
OST Common identifier (IV) 2) TIMESERIES 2a) DATASET ASINCHRONIES Data may enter the database with different time frame depending from the dataset. (IE PATSTAT is a full update so a snapshot at moment of data creation, WOS is an incremental update; so name changes/M&A could make same entity different in 2 datasets; note also geographic entities change with time: counties, countries…) Bridge tables must have a time-related dimension. 2b) DATA TRANFORMATIONS Data change within time. (IE companies may merge, split [most critical case], change name, change owner…) Bridge tables must have a continuation dimension allowing to follow transformation of entities.
OST Common identifier (V) Timeseries examples Sarajevo chg from YU to BS in 1992 BEFORE Sarajevo YU BS AFTER Sarajevo YU 1800 1991 Sarajevo BS 1992 9999
OST Common identifier (V) OBJECT / PROPERTIES DATASTRUCTURE Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store PROPERTIES/STATUS/EVENTS, so FI contain following fields: PROPERTY NAME (ie ownership, affiliation…) PROPERTYVALUE (ie new owner, new affiliation) DATEFROM DATETO CHGREASON (if blank is still valid) VALUE1…N (ie type of acquisition, % ownership…) Along with properties must also be defined how properties are inherited among entities (IE CNRS Bordeaux inherits from CNRS ownership, probably sector of activity… ) (1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer Science Department of the University of Arizona
APPENDIX: Temporal database Example (I)NOVARTISNovartis pharma is originated by merge of CIBA(1884) GEIGY (1758) and Sandoz (1876)Until 1970 they are 3 separate entitiesLEGPCODE LEGPNAME 1 CIBA 2 GEIGHY 3 SANDOZ 4 CIBA SUB 1..N 5 GEIGHY SUB 1…N 6 SANDOZ SUB 1…NLEGPCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON 1 OWNERSHIP FULLOWN 1 100 1884 9999 2 OWNERSHIP FULLOWN 2 100 1758 9999 3 OWNERSHIP FULLOWN 3 100 1876 9999 4 OWNERSHIP FULLOWN 1 100 1884 9999 5 OWNERSHIP FULLOWN 2 100 1758 9999 6 OWNERSHIP FULLOWN 3 100 1876 9999 19