Tracking applicants history and structure using a temporal database
1. Tracking Applicants
history and structure
using a temporal
database
Gianluca Tarasconi – Kites Univ. Bocconi / OST
Blog: http://rawpatentdata.blogspot.com
2. OST Common identifier (I)
This presentation will deal with some preliminary results originated in
OST common identifier project.
The Observatoire des Sciences et des Techniques (OST) designs and
produces R&D indicators.
By providing both indicators and expertise OST serves the actors of
research system.
To carry out its missions, OST maintains a database on international
research, constructed from multiple sources (mainly CORDIS,
PATSTAT and ISI-WOS)
Common Identifier project aims to create a bridge among all
kind of entities existing across the 3 main datasets.
3. OST Common identifier (II)
Data categories existing across patent, scientific
publications and Framework programs data:
PATSTAT CORDIS WOS
inventors/appl. participants
Geographic data affiliations addresses
addresses addresses
inventors,
Individuals contacts authors
applicants
Companies applicants participants affiliations
Sci /tech
IPC thematic priorities subject cathegories
taxonomies
4. OST Common identifier (III)
Pilot for project feasibility is building a data
structure that may embed data for legal
persons, starting with PATSTAT
applicants (and soon WOS affiliations)
Patstat ISI WoS
Applicants C.I. data
affiliations
structure
5. Data constrains (I)
1) DEFINE ATOMIC ENTITIES AND
NON AMBIGUOS JOINS
• Data involved may contain the same legal entity at different
level (ie for ABB: exists ABB holding, IP depts, divisions,
JVs…)
• C.I. dataset should use a entity size allowing unique data
match across different sets.
• C.I. dataset should also make possible a hierarchic structure
of entities allowing join at different level to main datasets.
6. Data constrains (II)
THE REGENTS OF THE UNIVERSITY OF CALIFORNIA,
001 BERKELEY
FROM 002 UNIVERSITY OF CALIFORNIA, BERKELEY OFFICE OF
TECHNOLOGY LICENSING
MATHEMATICAL SCIENCE PUBLISHERS DEPARTMENT
003 OF MATHEMATICS UNIVERSITY OF CALIFORNIA,
BERKELEY
004 UNIVERSITY OF CALIFORNIA AT BERKELEY
TO 005 UNIV OF CALIFORNIA BERKELEY
001 father to 002
001 father to 003
002 brother to 003
Using a mix of datamining tecniques to disambiguate applicants
and to rebuild relationships among them (keywords lists);
Choose of atomic entity relies also on sector allocation developed
starting from KUL algorithm with further developments.
7. Data constrains (III)
• 2) TIMESERIES
• 2a) DATASET ASINCHRONIES
• Data may enter the database with different time
frame depending from the dataset.
• (IE PATSTAT is a snapshot at moment of data
creation, WOS is an incremental update; so name
changes/M&A could make same entity different in 2
datasets; note also geographic entities change with
time: counties, countries…)
• C.I. tables must have a time-related dimension.
8. Data constrains (IV)
• 2b) DATA TRANFORMATIONS
• Data change within time.
(IE companies may merge, split [most critical case], change
name, change ownership…)
• C.I. tables must have a dimension allowing to follow
transformation of entities.
9. Data constrains (V)
Person_id Patstat ed.
C.I. TABLE CNRS PATSTAT 2010000 201210
CNRS PATSTAT 302452 201304
10012 CNRS
10013 CNRS Bordeaux
10014 Lab 3 CNRS Brodeaux
datasource
10012 2010000 201210
10012 302452 201304
Sarajevo chg from YU to BS in 1992
In other words, we must
have validity dates and NO: Sarajevo YU BS
data source data
YES: Sarajevo YU 1800 1991
Sarajevo BS 1992 9999
10. How to deal with:
• PROPERTIES / EVENTS BASED DATASTRUCTURE
• Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store PROPERTIES of our
ENTITIES (ie applicants) by date, datasource and EVENTS that can change such properties
• PROPERTY NAME (ie ownership, affiliation…)
• PROPERTYVALUE (ie new owner, new affiliation)
• DATEFROM (these are
• DATETO validity date from analist point of view)
• DATASOURCE (ie patstat ediction: that is validity date from DB point of view)
• Events occur to change values of any property and may, by default let other be inherited by the entity
from that moment onward (FI: Acquisition of C from B, that is owned by A, makes C inherit
ownership from C)
• Main event that may occur are: BIRTH, MERGE, DEMERGE, SPINOFF, ACQUISITION BY, DEATH.
• Along with properties must also be defined how properties are inherited among entities (IE CNRS
Bordeaux inherits from CNRS ownership)
• (1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer Science Department of
the University of Arizona
11. Example (I):
• NOVARTIS
• Novartis pharma is originated by merge of CIBA (1884)
GEIGY (1758) and Sandoz (1876)
• Until 1970 they are 3 separate entities
PCODE PNAME
1 CIBA
2 GEIGHY
3 SANDOZ
4 CIBA SUB 1..N
5 GEIGHY SUB 1…N
6 SANDOZ SUB 1…N
PCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
1 OWNERSHIP FULLOWN 1 100 1884 9999
2 OWNERSHIP FULLOWN 2 100 1758 9999
3 OWNERSHIP FULLOWN 3 100 1876 9999
4 OWNERSHIP FULLOWN 1 100 1884 9999
5 OWNERSHIP FULLOWN 2 100 1758 9999
6 OWNERSHIP FULLOWN 3 100 1876 9999
13. Populating the datasets:
• Building realtionships and changes cannot rely only on
datamining
• A lot of free resources are available
• SEC: http://www.sec.gov/edgar/searchedgar/companysearch.html
• NASDAQ: http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A
• FULL WW UNIV LIST:
http://www.unibo.it/Portale/Guida/Universita+nel+mondo/Unimondo.htm
• INDIAN NAME CHANGES: http://www.smctradeonline.com/change-of-
name.aspx
• SELECTION OF OTHER DBS:
http://www.anderson.ucla.edu/x14519.xml#company
• Another great resource for name change is TLS221
listing more than 500.000 events with name change
of applicant (net names about 85.000 after
cleaning)
14. Next steps:
• Development is in progress with the goal of classifing 85% of
applications for EPO USPTO INPI PCT by June 2013;
• WoS and CORDIS data should be integrated soon after then
phase 2 should start aiming to classify 95% of applications
• Comments, ideas but also proposals of
collaboration are welcome