SlideShare a Scribd company logo
Predicting the
“Next Big Thing”
in Science
ADRIAN MLADENIĆ GROBELNIK
ADRIAN.GROBELNIK@GMAIL.COM
BRITISH INTERNATIONAL SCHOOL LJUBLJANA
LJUBLJANA, SLOVENIA
#scichallenge2017
What is this research project about?
 The aim is to make a C++ program for predicting which scientific topics will
become important in the future
 To predict the future of science, I have used Machine Learning algorithms
to learn how science behaved in the past, and to use the resulting model
to predict future trends in science
 To analyse how science evolved in the past, I used the data from the
recently released “Microsoft Academic Graph” which includes 125 million
scientific articles from the year 1800 to the present
Research Hypothesis
 My research hypothesis is that the science topics which will
become important in the future, already exist in today’s scientific
articles
 …they are just not visible yet,
 …but it is possible to identify them with Machine Learning
 The task is to find early indicators suggesting which scientific topics
in today’s literature will likely become important in the future
Context: How does science evolve?
 The main element of science is an invention
 Inventions always happen at the beginning of a scientific process
 After an invention happens, there is a period of scientific
exploration, to prove the invention is useful
 Some inventions prove themselves, and some do not
 If an invention proves itself, new products and research is done
involving ideas from the invention
 …less useful inventions usually get forgotten
Context: How to detect scientific
inventions and concepts?
 Scientists are typically strict and consistent when naming things
 In the same way, inventions and other scientific concepts get names which
are then used in scientific articles
 In this project I have used the names from the titles of scientific articles
to track how particular scientific topics evolve through time
 We can spot when a scientific topic appears for the first time, we can count
how frequently it appears, and we can spot when it stops being used
 …this is my base for predicting the “next big thing” in science
What data do we have available?
 There are many databases of scientific articles in the world, but only some are
open and available for research.
 The biggest open database of scientific articles is “Microsoft Academic Graph”
which was released for research use in 2016
 The database size is 130 Gigabytes
 It includes references to 125 million scientific articles from the year 1800 to the present
from all areas of science
 Each scientific article in the database is described by: (a) title, (b) authors and their (c)
institutions, (d) journal/conference where it was published, and (e) the year of publication
 Data available from: https://www.microsoft.com/en-us/research/project/microsoft-
academic-graph/
The task to be solved
 The core task in this project is to use the data from over 200 years of
science and to extract what are early signs of a scientific topic
becoming successful
 With Machine Learning algorithms I trained a statistical model to
classify scientific topics which became successful and which didn’t
 The trained model I am using on the current data (after 2010) to
predict which topics will be hot and relevant in the near future (in
early 2020s)
Description of the experiment (1/2)
 From 125 million article titles I extracted 2.5 million candidate topics
 …each topic is described by a phrase of the size 1 to 5 words
 …the phrase must appear at least 100 times in the database of article titles
 Each topic is represented by a set of features (attributes) describing the
first 10 years after its appearance
 …features include frequency and trend (slope from linear regression) of an
appearance of the topic within institutions, journals and conferences
 …each topic is described by approx. 55,000 features, represented in a feature
vector
Description of the experiment (2/2)
 Each topic is classified either as:
 Positive, if it became popular in the past (has increased by a factor 2 after the 10 years
from the topic’s first appearance), or as
 Negative, if the topic didn’t attract much attention
 We split the topics into a training (70%) and test set (30%)
 …where the training set is used to train the model and testing set used to test the model
 For machine learning I used the Perceptron algorithm which is relatively easy to
implement (https://en.wikipedia.org/wiki/Perceptron)
 …I used an improved version of the Perceptron (MaxMargin)
Key statistical results
 The statistical model, trained with the MaxMargin Perceptron
algorithm produced the following results on the testing data:
 Precision: 74%
 Recall: 72%
 F1 (a combination of both): 73%
 …this means, the model correctly predicts the success of
approx. 73% of all scientific topics (either successful ones or
unsuccessful ones)
Key descriptive results
 Looking at the resulting statistical model we can see:
 If a scientific topic gets increasingly used by important research
institutions (universities and research institutes)
 …and is getting published by important journals and conferences
 …within 10 years from the invention (when the initial mention is
spotted)
 …then, we can expect the increased use of the topic (by a factor
two or more) by science and industry in the next 5 years
Examples of best topics and features
 Example Best Topics (as predicted by the model):
 Collisions, efficient, proton proton collisions, higgs boson, system, quark,
particles, hadron, mobile augmented reality, variable quantum,
advanced network, molecular dynamics simulations
 Example Best Features (as identified by the Perceptron training):
 CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited,
Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials
, Atw-international Journal for Nuclear Power
Summary
 In this research project I analyzed 125 million articles from “Microsoft
Academic Graph” from over 200 years of science
 I made a program in C++ to process 130 Gigabytes of data and to
build a machine learning model to predict which scientific topics will
become important in the future
 The resulting model predicts 73% of the scientific topics which became
important in the history of science
 C++ code and detailed results are available from: https://goo.gl/8luSwz

More Related Content

What's hot

Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSE
MashudRana9
 
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers - Applied Mathematics and Sciences: An International Journal ...Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
mathsjournal
 
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
mathsjournal
 
Significant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational SciencesSignificant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational Sciences
Editor IJCATR
 
Applied Mathematics and Sciences: An International Journal (MathSJ)
Applied Mathematics and Sciences: An International Journal (MathSJ)Applied Mathematics and Sciences: An International Journal (MathSJ)
Applied Mathematics and Sciences: An International Journal (MathSJ)
mathsjournal
 
Call for papers - International Journal on Computational Science & Applicatio...
Call for papers - International Journal on Computational Science & Applicatio...Call for papers - International Journal on Computational Science & Applicatio...
Call for papers - International Journal on Computational Science & Applicatio...
ijcsa
 
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
mathsjournal
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
DataminingTools Inc
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEEFINALYEARSTUDENTPROJECTS
 
Deep learning
Deep learningDeep learning
Deep learning
Chris Orwa
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
Nattiya Kanhabua
 
Interactive mathematica
Interactive mathematicaInteractive mathematica
Interactive mathematica
guest58ca949
 
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
 An Independent Study Comparing SPSS to  Intellectus Statistics: Preliminary ... An Independent Study Comparing SPSS to  Intellectus Statistics: Preliminary ...
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
Statistics Solutions
 
Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)
Jamshaid Ashraf
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
Richard Zijdeman
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
Rashid Ansari
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
Paolo Missier
 
OMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS Publishing Group | Journal of Applied & Computational MathematicsOMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS International
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Richard Zijdeman
 
Transparency and reproducibility in research
Transparency and reproducibility in researchTransparency and reproducibility in research
Transparency and reproducibility in research
Louise Corti
 

What's hot (20)

Application of-statistics-in-CSE
Application of-statistics-in-CSEApplication of-statistics-in-CSE
Application of-statistics-in-CSE
 
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers - Applied Mathematics and Sciences: An International Journal ...Call for Papers - Applied Mathematics and Sciences: An International Journal ...
Call for Papers - Applied Mathematics and Sciences: An International Journal ...
 
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
Call for Papers (December Issue) - Applied Mathematics and Sciences: An Inter...
 
Significant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational SciencesSignificant Role of Statistics in Computational Sciences
Significant Role of Statistics in Computational Sciences
 
Applied Mathematics and Sciences: An International Journal (MathSJ)
Applied Mathematics and Sciences: An International Journal (MathSJ)Applied Mathematics and Sciences: An International Journal (MathSJ)
Applied Mathematics and Sciences: An International Journal (MathSJ)
 
Call for papers - International Journal on Computational Science & Applicatio...
Call for papers - International Journal on Computational Science & Applicatio...Call for papers - International Journal on Computational Science & Applicatio...
Call for papers - International Journal on Computational Science & Applicatio...
 
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...Call for Paper - Applied Mathematics and Sciences: An International Journal (...
Call for Paper - Applied Mathematics and Sciences: An International Journal (...
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
Deep learning
Deep learningDeep learning
Deep learning
 
Ranking Related News Predictions
Ranking Related News PredictionsRanking Related News Predictions
Ranking Related News Predictions
 
Interactive mathematica
Interactive mathematicaInteractive mathematica
Interactive mathematica
 
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
 An Independent Study Comparing SPSS to  Intellectus Statistics: Preliminary ... An Independent Study Comparing SPSS to  Intellectus Statistics: Preliminary ...
An Independent Study Comparing SPSS to Intellectus Statistics: Preliminary ...
 
Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)Domain Ontology Usage Analysis Framework (OUSAF)
Domain Ontology Usage Analysis Framework (OUSAF)
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
OMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS Publishing Group | Journal of Applied & Computational MathematicsOMICS Publishing Group | Journal of Applied & Computational Mathematics
OMICS Publishing Group | Journal of Applied & Computational Mathematics
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
Transparency and reproducibility in research
Transparency and reproducibility in researchTransparency and reproducibility in research
Transparency and reproducibility in research
 

Similar to Predicting the “Next Big Thing” in Science - #scichallenge2017

A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health Care
IJCSIS Research Publications
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
Applying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainApplying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domain
Angelo Salatino
 
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
FajarMaulana962405
 
Berlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony HeyBerlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony Hey
Cornelius Puschmann
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
Eric Meyer
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
Rinke Hoekstra
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
Angelo Salatino
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information Systems
Sergej Lugovic
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
Kan Yuenyong
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Francesco Osborne
 
Computer Science Research Methodologies
Computer Science Research MethodologiesComputer Science Research Methodologies
Computer Science Research Methodologies
IJCSIS Research Publications
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
Angelo Salatino
 
The New e-Science
The New e-ScienceThe New e-Science
The New e-Science
David De Roure
 
The New e-Science (Bangalore Edition)
The New e-Science (Bangalore Edition)The New e-Science (Bangalore Edition)
The New e-Science (Bangalore Edition)
David De Roure
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
Salam Shah
 
Data Science & Analytics (light overview)
Data Science & Analytics (light overview) Data Science & Analytics (light overview)
Data Science & Analytics (light overview)
Shalin Hai-Jew
 
How can the use of computer simulation benefit the monitoring and mitigation ...
How can the use of computer simulation benefit the monitoring and mitigation ...How can the use of computer simulation benefit the monitoring and mitigation ...
How can the use of computer simulation benefit the monitoring and mitigation ...
BrennanMinns
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
anoop bk
 

Similar to Predicting the “Next Big Thing” in Science - #scichallenge2017 (20)

A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health Care
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
Applying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domainApplying machine learning techniques to big data in the scholarly domain
Applying machine learning techniques to big data in the scholarly domain
 
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
Applied Optimization and Swarm Intelligence (Springer Tracts in Nature-Inspir...
 
Berlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony HeyBerlin 6 Open Access Conference: Tony Hey
Berlin 6 Open Access Conference: Tony Hey
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
Design Science in Information Systems
Design Science in Information SystemsDesign Science in Information Systems
Design Science in Information Systems
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Computer Science Research Methodologies
Computer Science Research MethodologiesComputer Science Research Methodologies
Computer Science Research Methodologies
 
Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics Invited Talk: Early Detection of Research Topics
Invited Talk: Early Detection of Research Topics
 
The New e-Science
The New e-ScienceThe New e-Science
The New e-Science
 
The New e-Science (Bangalore Edition)
The New e-Science (Bangalore Edition)The New e-Science (Bangalore Edition)
The New e-Science (Bangalore Edition)
 
Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
 
Data Science & Analytics (light overview)
Data Science & Analytics (light overview) Data Science & Analytics (light overview)
Data Science & Analytics (light overview)
 
How can the use of computer simulation benefit the monitoring and mitigation ...
How can the use of computer simulation benefit the monitoring and mitigation ...How can the use of computer simulation benefit the monitoring and mitigation ...
How can the use of computer simulation benefit the monitoring and mitigation ...
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 

Recently uploaded

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 

Recently uploaded (20)

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 

Predicting the “Next Big Thing” in Science - #scichallenge2017

  • 1. Predicting the “Next Big Thing” in Science ADRIAN MLADENIĆ GROBELNIK ADRIAN.GROBELNIK@GMAIL.COM BRITISH INTERNATIONAL SCHOOL LJUBLJANA LJUBLJANA, SLOVENIA #scichallenge2017
  • 2. What is this research project about?  The aim is to make a C++ program for predicting which scientific topics will become important in the future  To predict the future of science, I have used Machine Learning algorithms to learn how science behaved in the past, and to use the resulting model to predict future trends in science  To analyse how science evolved in the past, I used the data from the recently released “Microsoft Academic Graph” which includes 125 million scientific articles from the year 1800 to the present
  • 3. Research Hypothesis  My research hypothesis is that the science topics which will become important in the future, already exist in today’s scientific articles  …they are just not visible yet,  …but it is possible to identify them with Machine Learning  The task is to find early indicators suggesting which scientific topics in today’s literature will likely become important in the future
  • 4. Context: How does science evolve?  The main element of science is an invention  Inventions always happen at the beginning of a scientific process  After an invention happens, there is a period of scientific exploration, to prove the invention is useful  Some inventions prove themselves, and some do not  If an invention proves itself, new products and research is done involving ideas from the invention  …less useful inventions usually get forgotten
  • 5. Context: How to detect scientific inventions and concepts?  Scientists are typically strict and consistent when naming things  In the same way, inventions and other scientific concepts get names which are then used in scientific articles  In this project I have used the names from the titles of scientific articles to track how particular scientific topics evolve through time  We can spot when a scientific topic appears for the first time, we can count how frequently it appears, and we can spot when it stops being used  …this is my base for predicting the “next big thing” in science
  • 6. What data do we have available?  There are many databases of scientific articles in the world, but only some are open and available for research.  The biggest open database of scientific articles is “Microsoft Academic Graph” which was released for research use in 2016  The database size is 130 Gigabytes  It includes references to 125 million scientific articles from the year 1800 to the present from all areas of science  Each scientific article in the database is described by: (a) title, (b) authors and their (c) institutions, (d) journal/conference where it was published, and (e) the year of publication  Data available from: https://www.microsoft.com/en-us/research/project/microsoft- academic-graph/
  • 7. The task to be solved  The core task in this project is to use the data from over 200 years of science and to extract what are early signs of a scientific topic becoming successful  With Machine Learning algorithms I trained a statistical model to classify scientific topics which became successful and which didn’t  The trained model I am using on the current data (after 2010) to predict which topics will be hot and relevant in the near future (in early 2020s)
  • 8. Description of the experiment (1/2)  From 125 million article titles I extracted 2.5 million candidate topics  …each topic is described by a phrase of the size 1 to 5 words  …the phrase must appear at least 100 times in the database of article titles  Each topic is represented by a set of features (attributes) describing the first 10 years after its appearance  …features include frequency and trend (slope from linear regression) of an appearance of the topic within institutions, journals and conferences  …each topic is described by approx. 55,000 features, represented in a feature vector
  • 9. Description of the experiment (2/2)  Each topic is classified either as:  Positive, if it became popular in the past (has increased by a factor 2 after the 10 years from the topic’s first appearance), or as  Negative, if the topic didn’t attract much attention  We split the topics into a training (70%) and test set (30%)  …where the training set is used to train the model and testing set used to test the model  For machine learning I used the Perceptron algorithm which is relatively easy to implement (https://en.wikipedia.org/wiki/Perceptron)  …I used an improved version of the Perceptron (MaxMargin)
  • 10. Key statistical results  The statistical model, trained with the MaxMargin Perceptron algorithm produced the following results on the testing data:  Precision: 74%  Recall: 72%  F1 (a combination of both): 73%  …this means, the model correctly predicts the success of approx. 73% of all scientific topics (either successful ones or unsuccessful ones)
  • 11. Key descriptive results  Looking at the resulting statistical model we can see:  If a scientific topic gets increasingly used by important research institutions (universities and research institutes)  …and is getting published by important journals and conferences  …within 10 years from the invention (when the initial mention is spotted)  …then, we can expect the increased use of the topic (by a factor two or more) by science and industry in the next 5 years
  • 12. Examples of best topics and features  Example Best Topics (as predicted by the model):  Collisions, efficient, proton proton collisions, higgs boson, system, quark, particles, hadron, mobile augmented reality, variable quantum, advanced network, molecular dynamics simulations  Example Best Features (as identified by the Perceptron training):  CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited, Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials , Atw-international Journal for Nuclear Power
  • 13. Summary  In this research project I analyzed 125 million articles from “Microsoft Academic Graph” from over 200 years of science  I made a program in C++ to process 130 Gigabytes of data and to build a machine learning model to predict which scientific topics will become important in the future  The resulting model predicts 73% of the scientific topics which became important in the history of science  C++ code and detailed results are available from: https://goo.gl/8luSwz