SlideShare a Scribd company logo
1 of 14
Tracking Applicants
history and structure
using a temporal
database


    Gianluca Tarasconi – Kites Univ. Bocconi / OST
                           Blog: http://rawpatentdata.blogspot.com
OST Common identifier (I)
This presentation will deal with some preliminary results originated in
  OST common identifier project.

The Observatoire des Sciences et des Techniques (OST) designs and
  produces R&D indicators.

By providing both indicators and expertise OST serves the actors of
  research system.

To carry out its missions, OST maintains a database on international
  research, constructed from multiple sources (mainly CORDIS,
  PATSTAT and ISI-WOS)

Common Identifier project aims to create a bridge among all
 kind of entities existing across the 3 main datasets.
OST Common identifier (II)
Data categories existing across patent, scientific
 publications and Framework programs data:

                      PATSTAT             CORDIS                   WOS
                   inventors/appl.      participants
 Geographic data                                           affiliations addresses
                      addresses          addresses
                     inventors,
   Individuals                            contacts               authors
                     applicants

   Companies         applicants         participants            affiliations

     Sci /tech
                        IPC          thematic priorities    subject cathegories
   taxonomies
OST Common identifier (III)
Pilot for project feasibility is building a data
 structure that may embed data for legal
 persons, starting with PATSTAT
 applicants (and soon WOS affiliations)


    Patstat                             ISI WoS
    Applicants      C.I. data
                                       affiliations
                    structure
Data constrains (I)
1) DEFINE ATOMIC ENTITIES AND
NON AMBIGUOS JOINS
• Data involved may contain the same legal entity at different
level (ie for ABB: exists ABB holding, IP depts, divisions,
JVs…)

• C.I. dataset should use a entity size allowing unique data
match across different sets.

• C.I. dataset should also make possible a hierarchic structure
of entities allowing join at different level to main datasets.
Data constrains (II)
                           THE REGENTS OF THE UNIVERSITY OF CALIFORNIA,
                     001   BERKELEY
 FROM                002   UNIVERSITY OF CALIFORNIA, BERKELEY OFFICE OF
                           TECHNOLOGY LICENSING
                           MATHEMATICAL SCIENCE PUBLISHERS DEPARTMENT
                     003   OF MATHEMATICS UNIVERSITY OF CALIFORNIA,
                           BERKELEY
                     004   UNIVERSITY OF CALIFORNIA AT BERKELEY
 TO                  005   UNIV OF CALIFORNIA BERKELEY

                           001    father to         002
                           001    father to         003
                           002    brother to        003




 Using a mix of datamining tecniques to disambiguate applicants
 and to rebuild relationships among them (keywords lists);
 Choose of atomic entity relies also on sector allocation developed
 starting from KUL algorithm with further developments.
Data constrains (III)
• 2) TIMESERIES
• 2a) DATASET ASINCHRONIES
• Data may enter the database with different time
  frame depending from the dataset.
• (IE PATSTAT is a snapshot at moment of data
  creation, WOS is an incremental update; so name
  changes/M&A could make same entity different in 2
  datasets; note also geographic entities change with
  time: counties, countries…)

• C.I. tables must have a time-related dimension.
Data constrains (IV)
• 2b) DATA TRANFORMATIONS
• Data change within time.
(IE companies may merge, split [most critical case], change
name, change ownership…)

• C.I. tables must have a dimension allowing to follow
  transformation of entities.
Data constrains (V)
                                                                 Person_id Patstat ed.
      C.I. TABLE             CNRS       PATSTAT                     2010000     201210
                             CNRS       PATSTAT                      302452     201304
 10012 CNRS
 10013 CNRS Bordeaux
 10014 Lab 3 CNRS Brodeaux

                                                                 datasource
                                10012                  2010000       201210
                                10012                   302452       201304



                                            Sarajevo chg from YU to BS in 1992
 In other words, we must
 have validity dates and        NO:         Sarajevo      YU BS
 data source data
                                YES:        Sarajevo      YU 1800 1991
                                            Sarajevo      BS 1992 9999
How to deal with:
• PROPERTIES / EVENTS BASED DATASTRUCTURE
•   Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store PROPERTIES of our
    ENTITIES (ie applicants) by date, datasource and EVENTS that can change such properties

•   PROPERTY NAME           (ie ownership, affiliation…)
•   PROPERTYVALUE           (ie new owner, new affiliation)
•   DATEFROM                (these are
•   DATETO                                 validity date from analist point of view)
•   DATASOURCE              (ie patstat ediction: that is validity date from DB point of view)

•   Events occur to change values of any property and may, by default let other be inherited by the entity
    from that moment onward (FI: Acquisition of C from B, that is owned by A, makes C inherit
    ownership from C)

•   Main event that may occur are: BIRTH, MERGE, DEMERGE, SPINOFF, ACQUISITION BY, DEATH.

•   Along with properties must also be defined how properties are inherited among entities (IE CNRS
    Bordeaux inherits from CNRS ownership)

•   (1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer Science Department of
    the University of Arizona
Example (I):
• NOVARTIS
• Novartis pharma is originated by merge of CIBA (1884)
  GEIGY (1758) and Sandoz (1876)
• Until 1970 they are 3 separate entities
PCODE PNAME
    1 CIBA
    2 GEIGHY
    3 SANDOZ
    4 CIBA SUB 1..N
    5 GEIGHY SUB 1…N
    6 SANDOZ SUB 1…N


PCODE    PROPNAME      PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
        1 OWNERSHIP    FULLOWN             1                  100      1884   9999
        2 OWNERSHIP    FULLOWN             2                  100      1758   9999
        3 OWNERSHIP    FULLOWN             3                  100      1876   9999
        4 OWNERSHIP    FULLOWN             1                  100      1884   9999
        5 OWNERSHIP    FULLOWN             2                  100      1758   9999
        6 OWNERSHIP    FULLOWN             3                  100      1876   9999
Example (II):
    • 1970 first EVENT: merge CIBA + GEIGHY = CIBA GEIGHY LTD
   PCODE           PNAME
               1 CIBA
               2 GEIGHY
               3 SANDOZ
               4 CIBA SUB 1..N
                                                               Merge event occurs…
               5 GEIGHY SUB 1…N
                                     PCODE     EVENT       EVENTCODE2       EVENTTEXT   EVENTPERC        DATE
               6 SANDOZ SUB 1…N
               7 CIBA GEIGY LTD.             1 MERGE                    7                           50          1970
                                             2 MERGE                    7                           50          1970


           PCODE     PROPNAME      PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON
                    1 OWNERSHIP    FULLOWN             1                         100     1884   9999 MERGE
                                                                                                   1969
                    2 OWNERSHIP    FULLOWN             2                         100     1758   9999 MERGE
                                                                                                   1969
                    3 OWNERSHIP    FULLOWN             3                         100     1876   9999
                                                                                                  9999
                    4 OWNERSHIP    FULLOWN             1                         100     1884   9999 INHERIT
                                                                                                   1969
                    5 OWNERSHIP    FULLOWN             2                         100     1758   9999 INHERIT
                                                                                                   1969
                    6 OWNERSHIP    FULLOWN             3                         100     1876   9999
                                                                                                  9999
                    7 OWNERSHIP    FULLOWN             7                         100     1970     9999
                    4 OWNERSHIP    FULLOWN             7                         100     1970     9999 INHERIT
                    5 OWNERSHIP    FULLOWN             7                         100     1970     9999 INHERIT


1996 second merge: CIBA GEIGHY + Sandoz = Novartis can be described in the same way
Populating the datasets:
• Building realtionships and changes cannot rely only on
  datamining

• A lot of free resources are available
• SEC: http://www.sec.gov/edgar/searchedgar/companysearch.html
• NASDAQ: http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A
• FULL WW UNIV LIST:
  http://www.unibo.it/Portale/Guida/Universita+nel+mondo/Unimondo.htm
• INDIAN NAME CHANGES: http://www.smctradeonline.com/change-of-
  name.aspx
• SELECTION OF OTHER DBS:
  http://www.anderson.ucla.edu/x14519.xml#company

• Another great resource for name change is TLS221
  listing more than 500.000 events with name change
  of applicant (net names about 85.000 after
  cleaning)
Next steps:
• Development is in progress with the goal of classifing 85% of
  applications for EPO USPTO INPI PCT by June 2013;

• WoS and CORDIS data should be integrated soon after then
  phase 2 should start aiming to classify 95% of applications

• Comments, ideas but also proposals of
  collaboration are welcome

More Related Content

Viewers also liked

Patstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysPatstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysGianluca Tarasconi
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themGianluca Tarasconi
 
Patents and Innovation - Mip 2010
Patents and Innovation - Mip 2010Patents and Innovation - Mip 2010
Patents and Innovation - Mip 2010Gualtiero Dragotti
 
Chapter 1 What is Public Relations?
Chapter 1 What is Public Relations?Chapter 1 What is Public Relations?
Chapter 1 What is Public Relations?Barbara Nixon
 

Viewers also liked (6)

Patstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisysPatstat and patstat related resources for patent data analisys
Patstat and patstat related resources for patent data analisys
 
Patstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve themPatstat 7 deadly sin and how to solve them
Patstat 7 deadly sin and how to solve them
 
Non-Patent Exclusivities
Non-Patent Exclusivities Non-Patent Exclusivities
Non-Patent Exclusivities
 
Patents and Innovation - Mip 2010
Patents and Innovation - Mip 2010Patents and Innovation - Mip 2010
Patents and Innovation - Mip 2010
 
Patents & market exclusivity
Patents & market exclusivityPatents & market exclusivity
Patents & market exclusivity
 
Chapter 1 What is Public Relations?
Chapter 1 What is Public Relations?Chapter 1 What is Public Relations?
Chapter 1 What is Public Relations?
 

More from Gianluca Tarasconi

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Gianluca Tarasconi
 
PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?Gianluca Tarasconi
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time seriesGianluca Tarasconi
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by stepGianluca Tarasconi
 
Matching PATSTAT to Crunchbase
Matching PATSTAT to CrunchbaseMatching PATSTAT to Crunchbase
Matching PATSTAT to CrunchbaseGianluca Tarasconi
 

More from Gianluca Tarasconi (6)

Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 
PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?PATSTAT & Patentsview: complements or substitutes?
PATSTAT & Patentsview: complements or substitutes?
 
Patents applicants: how to create the full time series
Patents applicants: how to create the full time seriesPatents applicants: how to create the full time series
Patents applicants: how to create the full time series
 
Patstat indicators step by step
Patstat indicators step by stepPatstat indicators step by step
Patstat indicators step by step
 
Matching PATSTAT to Crunchbase
Matching PATSTAT to CrunchbaseMatching PATSTAT to Crunchbase
Matching PATSTAT to Crunchbase
 
PATSTAT users 7 sins
PATSTAT users 7 sinsPATSTAT users 7 sins
PATSTAT users 7 sins
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 

Tracking applicants history and structure using a temporal database

  • 1. Tracking Applicants history and structure using a temporal database Gianluca Tarasconi – Kites Univ. Bocconi / OST Blog: http://rawpatentdata.blogspot.com
  • 2. OST Common identifier (I) This presentation will deal with some preliminary results originated in OST common identifier project. The Observatoire des Sciences et des Techniques (OST) designs and produces R&D indicators. By providing both indicators and expertise OST serves the actors of research system. To carry out its missions, OST maintains a database on international research, constructed from multiple sources (mainly CORDIS, PATSTAT and ISI-WOS) Common Identifier project aims to create a bridge among all kind of entities existing across the 3 main datasets.
  • 3. OST Common identifier (II) Data categories existing across patent, scientific publications and Framework programs data: PATSTAT CORDIS WOS inventors/appl. participants Geographic data affiliations addresses addresses addresses inventors, Individuals contacts authors applicants Companies applicants participants affiliations Sci /tech IPC thematic priorities subject cathegories taxonomies
  • 4. OST Common identifier (III) Pilot for project feasibility is building a data structure that may embed data for legal persons, starting with PATSTAT applicants (and soon WOS affiliations) Patstat ISI WoS Applicants C.I. data affiliations structure
  • 5. Data constrains (I) 1) DEFINE ATOMIC ENTITIES AND NON AMBIGUOS JOINS • Data involved may contain the same legal entity at different level (ie for ABB: exists ABB holding, IP depts, divisions, JVs…) • C.I. dataset should use a entity size allowing unique data match across different sets. • C.I. dataset should also make possible a hierarchic structure of entities allowing join at different level to main datasets.
  • 6. Data constrains (II) THE REGENTS OF THE UNIVERSITY OF CALIFORNIA, 001 BERKELEY FROM 002 UNIVERSITY OF CALIFORNIA, BERKELEY OFFICE OF TECHNOLOGY LICENSING MATHEMATICAL SCIENCE PUBLISHERS DEPARTMENT 003 OF MATHEMATICS UNIVERSITY OF CALIFORNIA, BERKELEY 004 UNIVERSITY OF CALIFORNIA AT BERKELEY TO 005 UNIV OF CALIFORNIA BERKELEY 001 father to 002 001 father to 003 002 brother to 003 Using a mix of datamining tecniques to disambiguate applicants and to rebuild relationships among them (keywords lists); Choose of atomic entity relies also on sector allocation developed starting from KUL algorithm with further developments.
  • 7. Data constrains (III) • 2) TIMESERIES • 2a) DATASET ASINCHRONIES • Data may enter the database with different time frame depending from the dataset. • (IE PATSTAT is a snapshot at moment of data creation, WOS is an incremental update; so name changes/M&A could make same entity different in 2 datasets; note also geographic entities change with time: counties, countries…) • C.I. tables must have a time-related dimension.
  • 8. Data constrains (IV) • 2b) DATA TRANFORMATIONS • Data change within time. (IE companies may merge, split [most critical case], change name, change ownership…) • C.I. tables must have a dimension allowing to follow transformation of entities.
  • 9. Data constrains (V) Person_id Patstat ed. C.I. TABLE CNRS PATSTAT 2010000 201210 CNRS PATSTAT 302452 201304 10012 CNRS 10013 CNRS Bordeaux 10014 Lab 3 CNRS Brodeaux datasource 10012 2010000 201210 10012 302452 201304 Sarajevo chg from YU to BS in 1992 In other words, we must have validity dates and NO: Sarajevo YU BS data source data YES: Sarajevo YU 1800 1991 Sarajevo BS 1992 9999
  • 10. How to deal with: • PROPERTIES / EVENTS BASED DATASTRUCTURE • Data structure proposed should be a TEMPORAL DATABASE(1), allowing to store PROPERTIES of our ENTITIES (ie applicants) by date, datasource and EVENTS that can change such properties • PROPERTY NAME (ie ownership, affiliation…) • PROPERTYVALUE (ie new owner, new affiliation) • DATEFROM (these are • DATETO validity date from analist point of view) • DATASOURCE (ie patstat ediction: that is validity date from DB point of view) • Events occur to change values of any property and may, by default let other be inherited by the entity from that moment onward (FI: Acquisition of C from B, that is owned by A, makes C inherit ownership from C) • Main event that may occur are: BIRTH, MERGE, DEMERGE, SPINOFF, ACQUISITION BY, DEATH. • Along with properties must also be defined how properties are inherited among entities (IE CNRS Bordeaux inherits from CNRS ownership) • (1) See Richard T. Snodgrass. "TSQL2 Temporal Query Language". www.cs.arizona.edu. Computer Science Department of the University of Arizona
  • 11. Example (I): • NOVARTIS • Novartis pharma is originated by merge of CIBA (1884) GEIGY (1758) and Sandoz (1876) • Until 1970 they are 3 separate entities PCODE PNAME 1 CIBA 2 GEIGHY 3 SANDOZ 4 CIBA SUB 1..N 5 GEIGHY SUB 1…N 6 SANDOZ SUB 1…N PCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON 1 OWNERSHIP FULLOWN 1 100 1884 9999 2 OWNERSHIP FULLOWN 2 100 1758 9999 3 OWNERSHIP FULLOWN 3 100 1876 9999 4 OWNERSHIP FULLOWN 1 100 1884 9999 5 OWNERSHIP FULLOWN 2 100 1758 9999 6 OWNERSHIP FULLOWN 3 100 1876 9999
  • 12. Example (II): • 1970 first EVENT: merge CIBA + GEIGHY = CIBA GEIGHY LTD PCODE PNAME 1 CIBA 2 GEIGHY 3 SANDOZ 4 CIBA SUB 1..N Merge event occurs… 5 GEIGHY SUB 1…N PCODE EVENT EVENTCODE2 EVENTTEXT EVENTPERC DATE 6 SANDOZ SUB 1…N 7 CIBA GEIGY LTD. 1 MERGE 7 50 1970 2 MERGE 7 50 1970 PCODE PROPNAME PROPVALUE STATUSCODE2 STATUSTEXT STATUSPERC DATEFROM DATETO CHGREASON 1 OWNERSHIP FULLOWN 1 100 1884 9999 MERGE 1969 2 OWNERSHIP FULLOWN 2 100 1758 9999 MERGE 1969 3 OWNERSHIP FULLOWN 3 100 1876 9999 9999 4 OWNERSHIP FULLOWN 1 100 1884 9999 INHERIT 1969 5 OWNERSHIP FULLOWN 2 100 1758 9999 INHERIT 1969 6 OWNERSHIP FULLOWN 3 100 1876 9999 9999 7 OWNERSHIP FULLOWN 7 100 1970 9999 4 OWNERSHIP FULLOWN 7 100 1970 9999 INHERIT 5 OWNERSHIP FULLOWN 7 100 1970 9999 INHERIT 1996 second merge: CIBA GEIGHY + Sandoz = Novartis can be described in the same way
  • 13. Populating the datasets: • Building realtionships and changes cannot rely only on datamining • A lot of free resources are available • SEC: http://www.sec.gov/edgar/searchedgar/companysearch.html • NASDAQ: http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A • FULL WW UNIV LIST: http://www.unibo.it/Portale/Guida/Universita+nel+mondo/Unimondo.htm • INDIAN NAME CHANGES: http://www.smctradeonline.com/change-of- name.aspx • SELECTION OF OTHER DBS: http://www.anderson.ucla.edu/x14519.xml#company • Another great resource for name change is TLS221 listing more than 500.000 events with name change of applicant (net names about 85.000 after cleaning)
  • 14. Next steps: • Development is in progress with the goal of classifing 85% of applications for EPO USPTO INPI PCT by June 2013; • WoS and CORDIS data should be integrated soon after then phase 2 should start aiming to classify 95% of applications • Comments, ideas but also proposals of collaboration are welcome