Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Predicting the “Next Big Thing” in Science - #scichallenge2017


Published on

We made a system to predict which scientific topics will become important in the future. To predict the future of science, we have used Machine Learning algorithms to learn how science behaved in the past and to use the resulting model to predict future trends in science.

Published in: Data & Analytics

Predicting the “Next Big Thing” in Science - #scichallenge2017

  2. 2. What is this research project about?  The aim is to make a C++ program for predicting which scientific topics will become important in the future  To predict the future of science, I have used Machine Learning algorithms to learn how science behaved in the past, and to use the resulting model to predict future trends in science  To analyse how science evolved in the past, I used the data from the recently released “Microsoft Academic Graph” which includes 125 million scientific articles from the year 1800 to the present
  3. 3. Research Hypothesis  My research hypothesis is that the science topics which will become important in the future, already exist in today’s scientific articles  …they are just not visible yet,  …but it is possible to identify them with Machine Learning  The task is to find early indicators suggesting which scientific topics in today’s literature will likely become important in the future
  4. 4. Context: How does science evolve?  The main element of science is an invention  Inventions always happen at the beginning of a scientific process  After an invention happens, there is a period of scientific exploration, to prove the invention is useful  Some inventions prove themselves, and some do not  If an invention proves itself, new products and research is done involving ideas from the invention  …less useful inventions usually get forgotten
  5. 5. Context: How to detect scientific inventions and concepts?  Scientists are typically strict and consistent when naming things  In the same way, inventions and other scientific concepts get names which are then used in scientific articles  In this project I have used the names from the titles of scientific articles to track how particular scientific topics evolve through time  We can spot when a scientific topic appears for the first time, we can count how frequently it appears, and we can spot when it stops being used  …this is my base for predicting the “next big thing” in science
  6. 6. What data do we have available?  There are many databases of scientific articles in the world, but only some are open and available for research.  The biggest open database of scientific articles is “Microsoft Academic Graph” which was released for research use in 2016  The database size is 130 Gigabytes  It includes references to 125 million scientific articles from the year 1800 to the present from all areas of science  Each scientific article in the database is described by: (a) title, (b) authors and their (c) institutions, (d) journal/conference where it was published, and (e) the year of publication  Data available from: academic-graph/
  7. 7. The task to be solved  The core task in this project is to use the data from over 200 years of science and to extract what are early signs of a scientific topic becoming successful  With Machine Learning algorithms I trained a statistical model to classify scientific topics which became successful and which didn’t  The trained model I am using on the current data (after 2010) to predict which topics will be hot and relevant in the near future (in early 2020s)
  8. 8. Description of the experiment (1/2)  From 125 million article titles I extracted 2.5 million candidate topics  …each topic is described by a phrase of the size 1 to 5 words  …the phrase must appear at least 100 times in the database of article titles  Each topic is represented by a set of features (attributes) describing the first 10 years after its appearance  …features include frequency and trend (slope from linear regression) of an appearance of the topic within institutions, journals and conferences  …each topic is described by approx. 55,000 features, represented in a feature vector
  9. 9. Description of the experiment (2/2)  Each topic is classified either as:  Positive, if it became popular in the past (has increased by a factor 2 after the 10 years from the topic’s first appearance), or as  Negative, if the topic didn’t attract much attention  We split the topics into a training (70%) and test set (30%)  …where the training set is used to train the model and testing set used to test the model  For machine learning I used the Perceptron algorithm which is relatively easy to implement (  …I used an improved version of the Perceptron (MaxMargin)
  10. 10. Key statistical results  The statistical model, trained with the MaxMargin Perceptron algorithm produced the following results on the testing data:  Precision: 74%  Recall: 72%  F1 (a combination of both): 73%  …this means, the model correctly predicts the success of approx. 73% of all scientific topics (either successful ones or unsuccessful ones)
  11. 11. Key descriptive results  Looking at the resulting statistical model we can see:  If a scientific topic gets increasingly used by important research institutions (universities and research institutes)  …and is getting published by important journals and conferences  …within 10 years from the invention (when the initial mention is spotted)  …then, we can expect the increased use of the topic (by a factor two or more) by science and industry in the next 5 years
  12. 12. Examples of best topics and features  Example Best Topics (as predicted by the model):  Collisions, efficient, proton proton collisions, higgs boson, system, quark, particles, hadron, mobile augmented reality, variable quantum, advanced network, molecular dynamics simulations  Example Best Features (as identified by the Perceptron training):  CERN, Journal of Proteomics & Bioinformatics, Industrial Research Limited, Circulation-cardiovascular Imaging, Molecular BioSystems, Metamaterials , Atw-international Journal for Nuclear Power
  13. 13. Summary  In this research project I analyzed 125 million articles from “Microsoft Academic Graph” from over 200 years of science  I made a program in C++ to process 130 Gigabytes of data and to build a machine learning model to predict which scientific topics will become important in the future  The resulting model predicts 73% of the scientific topics which became important in the history of science  C++ code and detailed results are available from: