1. Big Data for Research
for“Science,Technology andEntrepreneurship:Class14”
2017/1/11 16:40-18:20 19:00
GRIPS SciREX Center/Specialist
NISTEP/Visiting Researcher
Hitotsubashi University MIC/Affiliate Researcher
Toyo University/Lecturer
Yasushi HARA
ya-hara@grips.ac.jp
twitter: harayasushi
facebook: yasushi.hara
2. Contents
• Introduction
• Database List
• J-global/KAKEN database
• RPA System
• Good Design Prize Database
• Scopus
• Web of Science
• PATSTAT
• Showcases
• Researcher network of Dr. Ryoji Noyori
• Identifying Core Paper for Dr. Satoshi Omura
• Knowledge flow analysis for Blue LED.
• Conclusion
3. Introduction:
Data(base) for Research.
• Research Question List
• Who’s the Star Scientist? ex. (Zucker and Darby 1997)
• Impact of Open Journal movement?
• The Paper written by PhD holder has more citation than the paper by non-
PhD holder? (Shimizu and Hara 2011)
• The high number of backward citation induces the high number of forward
citation?
• What is the Role of Self-citation?
• Technological Trajectory could be traced via data?
• Ranking of authors/organization in typical scientific field.
• Etc….
2017/1/11 3
4. Framework of Innovation Indicators
(Pakes and Griliches 1984)
Other
Economic
Factors
Non-Knowledge Factors
of Production Output:
Productivity
Firm’s Value
Patent
Patenting
Propensity
Inputs to Innovation
R&D, designing,
marketing research etc…
Knowhow and
First Mover Advantage
Paper
In-tangible
knowledge
1/11/2017 4
5. Data; from micro, meso to macro
Macro
National/Global level
Meso
Industry/firm level
(University/Company)
Micro
Individual Level
(Scientist/Inventor)
PATENT
- Inventor
- Assignee
- Patent
Number
- IPC
- Patent Family
- Non Patent
Literature
PAPER
- Author
- Organization
- Category
- Acknowledgement
DESIGN
- No.
- Designer Name
FUND
- No.
- Tied Patent/Paper N.
Science
Linkage
Economic Census
Innovation
Survey(NISTEP)
INPUT-OUTPUT
TABLE (I/O)
Macro
Economic
Model
Funding Database
Press Release
Survey of Research
and Development
(Statistics JAPAN)
SNA (System of National
Accounts; GDP)
6. Data Source Type Duration Source Availability Accessibility
Scopus (Web Interface) Publication 1970-
2017
Elsevier Researcher/Student/Staff in GRIPS Web from GRIPS network
Scopus (XML data; 1.4TB) Publication 1998-
2015
Elsevier Researcher/Student/Staff in SciREX
Center/GIST
Workstation (accessible via
grips_faculty)
Thomson Innovation PATENT 1850-
2016
Thomson
Reuters
1 user account license Web from GRIPS network
PATSTAT PATENT 1960-
2016
EPO Researcher/Staff in SciREX Center Workstation (accessible via
grips_faculty)
Web of Science
(Web Interface)
Publication 1993-
2016
Thomson
Reuters
Researcher/Student/Staff in GRIPS Web from GRIPS network
Web of Science (XML data) Publication 1980-2016 Thomson
Reuters
Researcher/Student/Staff in GRIPS
(max. 5 users)
Workstation (accessible via
grips_faculty)
J-global Publication/PATENT 1964-
2016
JST Researcher/Staff in SciREX Center Workstation (internal)
FMDB database FUNDING 1980-
2016
JST Researcher/Staff in SciREX Center [only
for Arimoto/Kuroda/Ikeuchi/Michel and
HARA]
Workstation (internal)
NIKKEI Press Release Database Press Release 2003-
2014
NIKKEI Researcher/Staff in SciREX Center/GIST Workstation (internal)
NIKKEI Article (Japanese) Database Newspaper article 1995-
2015
NIKKEI Researcher/Staff in SciREX Center/GIST Workstation (internal)
Input-Output Table I/O table 1995-2011 Applied Researcher/Staff in SciREX Center Workstation (internal)
Good Design Award Database Product Database 1959-2016 JDP Creative Commons Workstation / Web
2017/1/11 6
THEDATASETINGIST/SciREXCentre
7. Data; from micro to macro
Macro
National/Global level
Meso
Industry/firm level
(University/Company)
Micro
Individual Level
(Scientist/Inventor)
PATENT
- Inventor
- Assignee
- Patent
Number
- IPC
- Patent Family
- Non Patent
Literature
PAPER
- Author
- Organization
- Category
- Acknowledgement
DESIGN
- No.
- Designer Name
FUND
- No.
- Tied Patent/Paper
Science
Linkage
Economic Census
Innovation
Survey(NISTEP)
INPUT-OUTPUT
TABLE
Macro
Economic
Model
Funding Database
IIP Patent
[JPO]
PATSTAT
[EPO]
PatentsView
[USPTO]
Web of
Science
J-global
DWPI
Scopus
NISTEP
Design
Patent DB
KAKEN DB
Press Release
Survey of Research
and Development
(Statistics JAPAN)
SNA (System of National
Accounts; GDP)
Good Design
DB
8. But,
Very Unfortunately….
• There is no “exhaustively complete”
database
• Disambiguation is always needed.
• Fluctuation in organization
• Fluctuation in author name
• Fluctuation in country
• Fluctuation in paper category
• When using scientific paper database for
your own research, you will need to (1)
choose appropriate database for research
question, (2) and clean up the database to
get rid of the fluctuation in every single
entries.
2017/1/11 8
10. Dataset & Software
• Dataset: J-global and KAKEN database
• Managed by JST Information Planning Department
• J-global
• PATENT/Scientific Paper Database
• Covering Scientific Paper written in Japanese
• KAKEN-DB
• Database of KAKEN Fund
2017/1/11 10
11. Data Structure of J-global
・Researcher Information
・Bibliographic Information
・Patent Information
Forward Citation
Backward Citation
・Institutional Information
2017/1/11 11
12. Data Structure of KAKEN DB
・Researcher Information
・Institution Information
・Scientific Paper tied with KAKEN
・PATENT tied with KAKEN
・Funding Information (KAKEN)
2017/1/11 12
16. Value of the database?
• Feasible for Science, Technology and Innovation Activity in JAPAN
• Better to use Scopus and/or Thomson Innovation, PATSTAT if aim to
seek global comparison.
• Use with KAKEN-DB
• Treated Group vs. Control Group analysis
• Connect with Web of Science and Scopus based on ID
2017/1/11 16
17. 2. Creating Your Own Database
based on RPA (Robotics Process
Automation) system
2017/1/11 17
18. Create Database: BizRobo
• Automatic Data Cloaning
System from Web/PDF
• Making Web Scraping through
GUI
• Web Scraping : data scraping
used for extracting data from
websites.
• Fetching the data from Web
and/or PDF document, making
database sheet in
CSV/Excel/SQL format.
• Based on Kapow Katalyst.
• https://www.youtube.com/wa
tch?v=EHsOI51dPo0
Source: https://bizrobo.com/case/20161019_564.html2017/1/11 18
21. Data Description
• Principal Implementing Business
• Category
• Company
• Outline
• Producer
• Director
• Designer
• Based on Creative Commons
22. Value of the database?
• Connecting the dots
• Role of the design itself and the designer in R&D process.
• In-house R&D or Collaborative Research in Design?
• Harmonization between design and technology
• Binding the data between Patent, Design Patent.
• Role of Designer in R&D process.
• Availability
• Creative Commons License
• Freely used by SQL schema (coming soon)
24. Web of Science
• Supplied by Clarivate Analytics, Inc.
• Variable Set available in Scopus
Database
• Author’s Name
• Title
• Journal Title
• Backward Citation
• Forward Citation
• Affiliation and its address
• Data Coverage
• 1900-2016 (in Hitotsubashi)
• 1993-2016 (in GRIPS)
25. Information embed in Web of Science
Abstract
Title
Author Name
Journal Name
and Pages
Publish Year
Keyword
Organization Name
and Address
Publisher
Classification
Type of Article and
Language
Backward and
Forward Citation
2017/1/11 25
26. Backward Citation and Forward Citation
time : t
"An Approach to the Study of
Entrepreneurship," THE TASKS OF
ECONOMIC HISTORY (Supplemental
Issue of THE JOURNAL OF
ECONoMIc HISTORY), VI (1946), 1-15
Oscar Lange, "A Note on
Innovations," Review of Economic
Statistics, XXV (1943), 19-25
F. W. Taussig, Inventors and
Money-Makers (New York: The
Macmillan Company, 1915).
Fritz Redlich, The Molding of American
Banking—Men and Ideas (New York: Hafner
Publishing Company, 1947).
Robert A. Gordon, Business
Leadership in the Large Corporation
(Washington, D.C.: The Brookings
Institution, 1945).
F. J. Marquis and S. J. Chapman on the
managerial stratum ,of the Lancashire cotton
industry in the Journal of the Royal Statistical
Society, LXXV, Pt. III (1912). 293-306.
Forward CitationBackward Citation
・Finding from Backward citation
-- How the author uses prior knowledge?
-- When there prior knowledge has been
emerged?
-- What’s the core knowledge for scientific
discovery?
・Data Could not retrieve from Backward Citation
-- Critical Scientific Discoveries which did not cite
-- Cited Knowledge but did not cite in References
page
・Finding from Forward Citation
-- The Importance of Paper itself
-- “Standing on the shoulders of the giants”
-- Knowledge Diffusion
-- Is the paper still anew?
・Data Could not retrieve from Forward citation
-- Number is Number. Do the accumulative
number of Forward citation reflect the
importance of paper (kinda Research Question)
-- 1-tier citation did not cover the core prior
knowledge.
2017/1/11 26
27. Research Example: Dr. Ryoji Noyori’s citation network
Ryoji Noyori
• Emeritus professor of Nagoya University
• Nobel Prize in Chemistry (2001)
• Born: 3 Sep. 1938, Japan
• Affiliation at the time of the award: Nagoya
University, Japan
• Prize motivation: “for their work on chirally
catalysed hydrogenation reactions”
• Field: Industrial chemistry, organic chemistry
By using co-author-network analysis, we try
to ensure
(1) trajectory of his research,
(2) relationship with co-recipient Laureates;
Dr. William S. Knowles and Dr. K. Barry
Sharpless http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2001/
28. Case 1: Dr. Ryoji Noyori
• 1-tier co-authorship
• Dr. Knowleds and Dr. Noyori
• No direct co-authorship
• Dr. Sharpless and Dr. Noyori
• It would be highly related to
laboratory system of
scientific community
especially in chemistry and
captured out the spill over
process between scientists.
Co-authorship between Dr. Noyori and Dr. Sharpless
Data: Web of Knowledge / Visualization: VantagePoint
29. Case 1: Dr. Ryoji Noyori
Co-authorship of Dr. Noyori (1976-1980 ; in organization)
Co-authorship of Dr. Noyori (1981-1990 ; in organization)
• Continuous Collaboration with
pharmaceutical company, and in other
sector (ex. Nippon steel Chem.)
31. Scopus
• Supplied by Elsevier, Inc.
• Variable Set available in Scopus Database
• Author’s Name
• Title
• Journal Title
• Backward Citation
• Forward Citation
• Affiliation and its address
• Funding
• Impact Factor
+
・Corresponding Author
・Scientific Ordering (sequence number of authorship).
・EID (Scopus Paper ID)
・Author ID
・Affiliation ID
• Data Set: 1998-2015
33. Value of the database?
• Very Comprehensive dataset for
scientific paper
• Disambiguation for author name
and organization name.
• Availability
• Web : Accessible via GRIPS Network
• XML : Through Workstation
(ID/password required)
• SQL : Coming Soon! (Based on
MySQL)
• Comparison between Scopus and
Web of science
• Accumulated Journal list is differ
See (Del Bo. 2016)
34. Suppl.
Data Coverage Comparison between J-global and Scopus
Source: http://researchmap.jp/outline/sympo2015/rmapsympo2015_lecture04.pdf
Foreign Journal (海外誌)
Japanese Journal (国内誌)
Scopus: 20,246 Journals.
J-global: 14,525 Journals.
J-global [Japanese Journal]:
9,514 Journals.
J-global [Foreign Journal]:
5,011 Journals.
Scopus [Foreign Journal]:
19,829 Journals.
Scopus [Japanese Journal]:
417 Journals.
9,162
1,168 3,843
352 65
15,986
Analyzed on December 2014 by JST.
39. PATSTAT
• Patent Database by EPO
• Data Coverage
• Application
• Inventors
• Priorities
• Patents
• IPC Classes
• Citations
• Non Patent Literature
2017/1/11 39
40. Data Structure of PATSTAT
http://documents.epo.org/projects/babylon/eponet.nsf/0/95da6bccf12e54a1c1257aa1002e2d
1d/$FILE/patstat_data%20elements_v1.1.pdf
3/8/2015 40
43. Pros and Cons of the database
• Pros
• Covering EPO/USPTO/JPO patent database.
• Well Structured Database
• You’ll need the knowledge of SQL query language.
• Cons
• Need supplemental database for JPO patent.
• IIP patentdatabase (free) or patR.
• PATENT Family Control
• Use Thomson Innovation;
• Accessibility
• Thorough SQL server.
44. Research Example3; Shuji Nakamura
• Created Efficient Blue LEDs
• Lighting plays a major role in our quality
of life. The development of light-
emitting diodes (LEDs) has made more
efficient light sources possible. Creating
white light that can be used for lighting
requires a combination of red, green,
and blue light.
• Blue LEDs proved to be much more
difficult to create than red and green
diodes. During the 1980s and 1990s
Isamu Akasaki, Hiroshi Amano, and Shuji
Nakamura successfully used the
difficult-to-handle semiconductor
gallium nitride to create efficient blue
LEDs.
http://www.nobelprize.org/nobel_prizes/physics/laureates/2014/nak
amura-facts.html
45. Micro-level analysis
Aim
Try to capture the knowledge flow from basic science to disruptive
innovation
Methods
Citation Analysis
Based on the BLUE LED patent issued by Shuji Nakamura in 2007, trace the
data of backward citation to realize the knowledge flow among
inventor/researcher and organizations.
This study based on JST/RISTEX funding program fusibility study by Dr. Hiroshi
Shimizu, associate professor in Institute of Innovation Research, Hitotsubashi
University
45
51. How to Use the Database?
• Use Freely via Web interface
• Web of Science(Web), Scopus (Web)
• Accessible via GRIPS Network
• Input-Output Table
• Call: ya-hara@grips.ac.jp
• Accessible via Workstation Server
• Web of Science(XML), Scopus(XML -> SQL), PATSTAT(SQL)
• Assign IDs and Password
• Accessible via grips_faculty network
• Call : ya-hara@grips.ac.jp
2017/1/11 51
52. How to Use the Database? (2)
• Managed via ID
• Thomson Innovation
• Ask ya-hara@grips.ac.jp
• Database with User restriction
• NIKKEI Press Release (CSV), KAKEN DB (SQL), J-global Database (SQL)
• Based on Collaborative Agreement with JST and NIKKEI
• Current Member: Prof. Arimoto, Prof. Kuroda, Ikeuchi, HARA, and Post. Doc Micheal
• Based on Creative Commons
• Good Design Award Database
• Accessible via SQL database.
2017/1/11 52
53. Conclusion
• Conclusion
• Data analysis alone is IN VAIN. Research question First.
• Then you should find appropriate database, and fetching the data to make
quantitative analysis.
• Data Cleaning and Disambiguation is mandatory.
• Advanced
• Better to learn SQL/NoSQL schema.