Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presentation on Data4Impact methodology & results in the workshop on the use of big data technologies for advanced research assessment


Published on

The workshop on the use of big data technologies for advanced research assessment was part of a two day event, co-organised by OpenAIRE and Data4Impact, with support of Science Europe, explored mechanisms for research policy monitoring and indicators, and how to link these to infrastructure and services. The first day was focused on open science indicators as these emerge from national and EU initiatives, while the second day explored more advanced aspects of indicators for innovation and societal impact.

The presentation of the second workshop day includes the introduction to Data4Impact, presents our conceptual framework, and discusses the development of a series of indicators on the performance and societal impact of 40+ research programmes in the health domain.

Published in: Data & Analytics
  • I tried Semenax and noticed results after 3 days! I had tried other products... But Semenax WORKS! ♣♣♣
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Presentation on Data4Impact methodology & results in the workshop on the use of big data technologies for advanced research assessment

  1. 1. Open Science and Big data in support of measuring R&I Indicators Ghent, 28 May 2019
  2. 2. Introduction to Data4Impact
  3. 3. Data4Impact: the basics • Call: CO-CREATION-08-2016-2017: Better integration of evidence on the impact of research and innovation in policy making • Expected impacts:  Improved monitoring of R&I activities: new indicators for assessing research and innovation performance, including the impact of research and innovation policies  Prove value to the society: determining the societal impact of research and innovation funding in order better to justify research and innovation spending Data4Impact addresses key challenges and expected impacts of CO-CREATION-08-2016-2017 through a data driven approach
  4. 4. What is big data? Definition of Big Data: "Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." Key properties of Big Data:  Volume, i.e. no sampling is generally applied  Variety, i.e. structured and unstructured data from various sources, in different formats  Velocity, i.e. real-time/rapid data  Veracity, i.e. variations in data quality, cleaning, processing, etc. Non-intrusiveness -> Big Data is a byproduct of digital interaction and communication Key objective: make Big Data small!
  5. 5. Big data versus traditional methods: pros and cons No sampling, bottom-up, scalable Low administrative burden Short/no data lags New data and indicators Risk of misidentification Data veracity Lack of persistent identifiers Data mishandling, ethics
  6. 6. Where? Start with an individual Individual level Who participated in the programme? Who were members of the extended team? Organisation/team level Research teams in universities & research centres; Small companies and large enterprises Project/programme level Data aggregated at project or programme level Analytical dimensions Within researchers themselves; between researchers; between researchers and organisations; between organisations; between projects; between programmes Key questions: - Whom exactly did the programme attract? - What happened during and after the projects? - What was the impact?
  7. 7. How? Build a Knowledge Graph, Integrate Data
  8. 8. Why/what? Answer questions that matter to funders without ever asking a beneficiary 1 2 3 Outputs, products and interventions - Outputs, products and interventions - Collaborations - Scientific publications - Intellectual Property Rights - Scientific prizes Outcome-level indicators - Innovations - Dissemination activities - Further funding/ investment - Next destinations - Effects on the company/ private sector - New companies/ organizations created Impact level indicators - Impact on health and welfare/ Health and environmental impacts - Impacts on creativity, culture & society/ Social, economic, capability and cultural impact - Influence on policy making/ political impact
  9. 9. Ask less, know more Evaluating Planning Storytelling
  10. 10. Tracking individual researchers Organisation news/public relations
  11. 11. Tracking organisations
  12. 12. Tracking organisations
  13. 13. Tracking projects
  14. 14. Key facts about Data4Impact Project dimension Coverage Levels of data collection Organisation Project (for EU FP programmes only) Programme Programmes covered Over 40 health funders in the Europe + EU FPs Data collection Yes (strong effort) Data integration Yes (moderate effort) Machine learning, NLP, entity recognition Yes (strong effort) Topic modelling Yes (strong effort) Project duration & budget 2 years, EUR 1.5 million
  15. 15. Key facts about Data4Impact Input data EC monitoring data (Health & SC1 projects, health related), PubMed data Data sources: output level indicators EC monitoring data (Cordis) OpenAIRE Europe PMC (incl. full text data) PATSTAT (incl. abstracts & full texts) data Data sources: result level indicators Company websites Social media (Twitter) Clinical guidelines repositories Data sources: impact level indicators EC monitoring data EMA data on human medicinal products & orphan medicines DrugBank data Company websites Social media (Twitter) News/media sites
  16. 16. Key objectives and results Data4Impact
  17. 17. Overview of D4I and TRR projects Input data EC monitoring data (Health & SC1 projects, health related), PubMed data Data sources: output level indicators EC monitoring data (Cordis) OpenAIRE Europe PMC (incl. full text data) PATSTAT (incl. abstracts & full texts) data Data sources: result level indicators Company websites Social media (Twitter) Clinical guidelines repositories Data sources: impact level indicators EC monitoring data EMA data on human medicinal products & orphan medicines DrugBank data Company websites Social media (Twitter) News/media sites
  18. 18. Data4Impact: objectives Objective 1: define, develop, analyse new indicators for assessing the performance of EU and national R&I systems.
  19. 19. Data4Impact: objectives Objectives 2+3: gather data at input, throughput, output and impact levels, derive facts and understand impact on health-related challenges Objectives 4+5: perform community-driven validation and develop user- centered tools
  20. 20. Key achievements Data4Impact offers unique coverage of data sources, with an aim to link them through specific entities Data4Impact covers all key stages of the R&I lifecycle in the health domain, i.e. basic research -> translational & applied research -> innovation & uptake on the market -> clinical practice & public health New indicators and line of thinking investigated on academic impact  The funder and society perspective: funding timely and relevant research? Do the ‘right thing’ by funding rare topics?  If a funder enters an area where few others invest, does this imply stronger impact?  How does this interact with the researcher/organization perspective? Data4Impact was first/one of the first to track data to medium- and long-term economic and societal/health impacts, i.e. link previous project activities to events that happened recently
  21. 21. Data4Impact framework
  22. 22. Conceptual framework
  23. 23. What is societal impact? Societal Impact as a... "demonstrable contribution that excellent research makes to society and the economy. This can involve academic impact, economic and societal impact [...]“ (Economic and Social Research Council, ESRC)
  24. 24. General requirements • Take the temporality, multicausality, and multifacetedness of societal impact into account • Consider the academic, economic, and societal impact dimension and illustrate respective impact generation paths • Take advantage of all the data/data sources covered by Data4Impact • For analytical purpose we distinct between: 1) Academic Impact (effects on academic system and scientific practice) 2) Economic Impact (effects on the economy and value creation) 3) Societal Impact (effects on policy, society, and culture as well as on individual behaviour, subjective wellbeing, and life satisfaction)
  25. 25. Initial position: Simplified linear logic Input Throughput Output Impact Processes before any R&I activity starts as well as the resources that are needed. Intermediate results of R&I activities, i.e. documented knowledge. Further processing of knowledge generated during R&I activities Demonstrable contribution to academia, economy, and society
  26. 26. Simplified conceptual framework
  27. 27. Key indicators: An overview I Analytical stage Indicator Input indicators Funding Volume Throughput and output indicators Publication Patents Innovation outputs produced by projects New companies Innovation outputs produced by companies Innovation activities carried out by companies
  28. 28. Key indicators: An overview II Analytical stage Indicator Academic impact Funding priorities Timeliness of research Funding exclusivity Technological value/significance Economic impact Economic and innovation performance of companies Continuity of innovation activities Societal/health impact Health (service) impact Societal awareness/relevance of research Congruence of research funding with societal priorities Newly launched medicines and medicinal products
  29. 29. Tracking Innovation Activities
  30. 30. Input Throughput Output Impact Keeping Track of the Whole Process - automated - granular - scalable - applicable to other settings Tracking Innovation Activities
  31. 31. Input Throughput Output Impact Keeping Track of the Whole Process - automated - granular - scalable - applicable to other settings Tracking Innovation Activities Funding
  32. 32. Input - FP7/H2020 Projects Core Set Extended Set FP7 998 8332 H2020 669 2253 [eHealth] [SC 1] [> 20% PubMed] Number of research projects of the EU Framework Programme
  33. 33. Input - FP7/H2020 Projects – DATA CORDIS ● Call document ● Project description ● Final or periodic project reports (project summary) ● Scholarly publications deriving from the project ● Patents ● Results in Brief – Expected Impact automatic extraction of pertinent info from associated documents (AI) and metadata
  34. 34. Input - FP7/H2020 Projects EC Contribution for the FP7-Core and H2020-Core Projects
  35. 35. Topics in the Health Sector ICD - 10 Chapters • International statistical Classification of Diseases and related health problems • international standard for reporting diseases and health conditions • diagnostic standard for all clinical and research purposes • ICD classes associated with every project
  36. 36. Topics in the Health Sector ICD - 10 Example: Neoplasms Malignant neoplasms, stated or presumed to be primary, of specified sites, except of lymphoid, haematopoietic and related tissue C00-C75 Malignant neoplasms of ill-defined, secondary and unspecified sites C76-C80 Malignant neoplasms, stated or presumed to be primary, of lymphoid, haematopoietic and related tissue C81-C96 Malignant neoplasms of independent (primary) multiple sites C97-C97 In situ neoplasms D00-D09 Benign neoplasms D10-D36 Neoplasms of uncertain or unknown behaviour D37-D48
  37. 37. Input (Funding) & Topics H2020-CoreFP7-Core
  38. 38. Input Project-level data Funding allocation by - Organizations: type (private, public), geographic location - Funder - ICD chapters Comparisons over time.
  39. 39. Input Throughput Output Impact Keeping Track of the Whole Process - automated - granular - scalable - applicable to other settings Tracking Innovation Activities Publications Patents cited in PP Other Innovations
  40. 40. Input - FP7/H2020 Projects – DATA CORDIS ● Call document ● Project description ● Final or periodic project reports (project summary) ● Scholarly publications deriving from the project ● Patents ● Results in Brief – Expected Impact automatic extraction of pertinent info from associated documents (AI) and metadata
  41. 41. • The PROMARK project had the following KeyTerms: • Objective: “ is the most common in males in Europe, causing over 87,000 deaths in 2006. Early diagnosis and treatment are key factors in determining survival but based on the commonly-used have low specificity and result in excessive treatment of localized lesions that might never progress to symptomatic cancer. that help determine which of the early stage tumors will remain confined to the prostate and which will progress to an invasive, aggressive form of the disease are urgently needed. Using , we identified 4 distinct common that increase the risk of ...” Example: project PROMARK– id: 202059 Data Analysis – KeyTerms
  42. 42. • The PROMARK project focuses on the following Disease(s):  Prostate Cancer (ICD11: 2C82 - Malignant neoplasms of prostate) Example: project PROMARK– id: 202059 Data Analysis – Disease Tagging
  43. 43. The PROMARK project filed the following Patents: Patent applications have been filed for all cancer risk variants and PSA variants identified through PROMARK. The following patent applications have been filed. 1. PCT/IS2010/050002 and EP 10772098.9: Three variants conferring risk of prostate cancer 2. PCT/IS2008/000021 and EP 08854482.0: Prostate cancer risk locus on 11q13 3. PCT/IS2012/000006: New risk variant on 8q24 cancer 4. PCT/IS2011/050012 and EP 11821224.0: Three variants that associate with levels of PSA 5. PCT/IS2012/050013: Variant that associates with increased risk of prostate cancer, glioma and basal cell carcinoma. Example: project PROMARK– id: 202059 Data Analysis – Insight Extraction
  44. 44. • The PROMARK project developed the following Biorepository: 1. Establishment of comprehensive collections of biological specimens and clinical data for prostate cancer biomarker research in four European populations. Main results: Samples and clinical information from close to 5 500 prostate cancer cases and over 7 000 controls from different parts of Europe were collected. The first biospecimen repository for prostate cancer in Romania was established. Example: project PROMARK– id: 202059 Data Analysis – Insight Extraction
  45. 45. • The PROMARK project performed the following Study: To search for sequence variants that associate with PSA levels, we performed a genome-wide association study and follow-up analysis using PSA information from 15 757 Icelandic and 454 British men not diagnosed with prostate cancer. Example: project PROMARK– id: 202059 Data Analysis – Insight Extraction
  46. 46. Throughput & Output Innovation “Insights” from Project Portfolios • Diagnostic Tools • Treatment • Drug • Protocol • Biomarker • Biorepository • Gene • Metabolite • Clinical Trial • Method • Patent • Device • Material • Infrastructure • Software • System • Prototype • Study • Publication • Company • Education • Employment • Dissemination • *Impact • *Outcome
  47. 47. Documents FP7 Core FP7 Extended H2020 Core H2020 Extended Rest/Other Pubs 4205 42916 500 5657 Pubs in PubMed 25980 68521 1324 8590 Throughput - Publications
  48. 48. Symptoms, signs and Certain infectious and Congenital malformations, deformations and chromosomal abnormalities; 11 Diseases of the blood and blood- forming organs and certain disorders involving the immune mechanism; 18 abnormal clinical and laboratory findings, not elsewhere classified; 6 parasitic diseases; 37 Mental and behavioural disorders; 9 Diseases of the digestive system; 22 Injury, poisoning and certain other consequences of external causes; 21 Diseases of the eye and adnexa; 41 Endocrine, nutritional and metabolic diseases; 46 Diseases of the skin and subcutaneous tissue; 30 Diseases of the genitourinary system; 74 Diseases of the respiratory system; 64 Diseases of the musculoskeletal system and connective tissue; 25 Diseases of the nervous system; 68 Diseases of the circulatory system; 63 Neoplasms; 132 Throughput – # Patents by ICD Class FP7 Extended
  49. 49. Treatment Standard Publication Prototype Protocol Protein Metabolite Material Gene Employment Education Drug Dissemination Diagnostic Tool Clinical Trial Biorepository Biomarker 0 5000 10000 15000 20000 25000 30000 35000 40000 Device Infrastructure Method Software System Study Output – Innovations - FP7- Extended
  50. 50. Output – Pubs in Patents Funder Number of publications analysed Share of publications cited in patents at least once National Institutes of Health (US) 397886 4,4% Wellcome Trust (UK) 97434 6,8% European Commission 84038 5,5% National Science Foundation (US) 52366 4,5% Medical Research Council (UK)* 45246 10,0% Research Councils UK* 39214 2,9% Biotechnology and Biological Sciences Research Council (UK)* 22260 9,8% National Health and Medical Research Council (Australia) 21181 2,3% Swiss National Science Foundation (Switzerland) 15961 5,3% Austrian Science Fund (Austria) 13816 5,6%
  51. 51. Output – Creation of New Companies ● 430 newly created companies in FP7 ● 51 of which in FP7-Core ● Sample of FP7-Core projects with 2 or more new companies formed Project Number Project Acronym # Spin-offs 201924 EDICT 3 223744 DOPAMINET 2 201418 READNA 2 278832 hiPAD 2 279039 ComplexINC 2
  52. 52. Collaboration Networks ICD Ch9 Diseases of the Circulatory System Technological Diffusion - Organization networks (public vs private, geographic location, etc): size, density, key bridge organizations, across fields, fine detail within a subfield
  53. 53. Input Throughput Output Impact Keeping Track of the Whole Process - automated - granular - scalable - applicable to other settings Tracking Innovation Activities Project portofolios, PubMed,, etc Insight extractors & other NLP algorithms
  54. 54. Input Throughput Output Impact Linking across - funders/programs - organization (type, location, etc) - ICD class - Time  IMPACT Tracking Innovation Activities Project portofolios, PubMed,, etc Insight extractors & other NLP algorithms
  55. 55. Impact
  56. 56. Input Throughput Output Impact Keeping Track of the Whole Process - automated - granular - scalable - applicable to other settings Academic, Economic, Societal Impact
  57. 57. Academic Impact Topic Modelling Preliminary Results
  58. 58. Topic Modelling Publications • > 5 million • H2020, FP7 • 20% of sample from 40+ funders of D4I Deep Learning NLP Expert 442 Topics 9 major categories Linked to funders, organizations, authors countries, etc.
  59. 59. Citations Clinicopathologic and 11C-Pittsburgh compound B implications of Thal amyloid phase across the Alzheimer’s disease spectrum An autoradiographic evaluation of AV-1451 Tau PET in dementia Deciphering Interactions of Acquired Risk Factors and ApoE- mediated Pathways in AlzheimerΒ΄s Disease What is normal in normal aging? Effects of aging, amyloid and Alzheimer's disease on the cerebral cortex and the hippocampus Soluble apoE complex: mechanism and therapeutic target for APOE4-induced AD risk Role of genes linked to sporadic Alzheimer's disease risk in the production of Β -amyloid peptides Proteolytic Cleavage of Apolipoprotein E4 as the Keystone for the Heightened Risk Associated with Alzheimer’s Disease MeSH alzheimer disease amyloid beta peptides amyloid neurodegenerative diseases Brain apolipoprotein e4 amyloidosis Text Amyloid Alzheimer Apoe Neurodegeneration Neurodegenerative Abeta Brain Dementia Aggregation Fibrils Tau Cognitive Pathology Plaques Deposition impairment aging Phrases alzheimer disease neurodegenerative diseases amyloid fibrils amyloid deposition Keywords alzheimer disease neurodegeneration amyloid dementia geriatrics Wikipedia terms Alzheimer's_disease Neurodegeneration Apolipoprotein_E Amyloid Neuropathology What is this Topic about?? Alzheimer’s disease Topic Modelling Identifying Topics
  60. 60. Topic Modelling – What for? (1/2) • identify active areas of research: discover hidden themes (topics) • understand what is actually produced: calc topic distributions per document / project(grant) / funder • analyze active research areas on several dimensions (e.g., geographic regions, funders, etc.)
  61. 61. • discover clusters and communities, assess research collaboration: topic based similarity analysis • identify emerging research areas: topic based trend analysis • assess coverage, identify gaps or new challenges: compare funded research • assess the relevance and impact of research in the society using new indicators Topic Modelling – What for? (2/2)
  62. 62. Topic Modelling & ICD Chapters Topic Modelling • automation • granularity • bottom up • process is not field-related • changes in set of topics over time • ICD Chapters provide another piece of information
  63. 63. Topic category Estimated Share of Research Output in PubMed # Research Topics in the Data4Impact Topic Model 1. Infectious Diseases 7,2% 34 2. Non-Communicable Diseases 18,6% 86 3. Health systems, public health & epidemiology 14,5% 63 4. Diagnostics, treatment development, surgery 6,4% 26 5. Molecular cell biology 26,1% 118 6. Methods, models, technologies, databases 11,5% 46 7. Physiology 3,2% 15 8. Cognition and behaviour 4,6% 18 9. Other 7,9% 36 Total 100,0% 442 Topic Modelling – Major Categories
  64. 64. Distributed (Big) Data analytics HCI design & user experience GPU Topic Modelling Identify topic trends
  65. 65. Distributed (Big) Data analytics HCI design & user experience GPU Topic Modelling Trendy Topics
  66. 66. Topic Modelling Old-fashioned Topics Relational DBs Programming
  67. 67. Topic Modelling Important but declining (?) Genetic algorithms P2P networks & content distribution
  68. 68. topicid title 318 protein interaction / binding 365 molecular dynamics & protein structure 275 gene expression analysis 69 brain function 111 snps & genetic association 209 Diabetes 315 depression & anxiety 68 genome sequencing 470 hiv epidemiology 284 breast cancer 319 cardiovascular disease (risk) 109 smoking and public health 48 kidney disease 403 genetics (mutation, disease) 351 escherichia coli infections 226 graphene & nanotechnology 121 obesity 312 lung / pulmonary disease Academic Impact – Common Topics
  69. 69. topicid title 123 eating disorders 306 arsenic exposure & public health 397 ovarian cancer 164 gastric cancer 465 glioblastoma 269 genomics & exome sequencing 248 psoriasis 117 mosquitoes & public health 462 hepatitis B infection (hbv) 47 lung cancer 212 oral / dental health 327 thyroid disease, hormone, cancer 296 hodgkin lymphoma 71 clinical biomarkers & diagnosis 11 multiple sclerosis 490 pet imaging 489 pharmacokinetics 101 epilepsy Academic Impact – Rare Topics
  70. 70. Academic Impact – Timeliness of Research Funder Share of research output in top-10% fastest growing research topics National Health and Medical Research Council (Australia) 24,7% Research Councils UK* 23,5% European Commission 19,5% National Institutes of Health (US) 16,7% Swiss National Science Foundation (Switzerland) 16,2% Wellcome Trust (UK) 14,5% Biotechnology and Biological Sciences Research Council (UK)* 11,2% Medical Research Council (UK)* 11,1% Total PubMed 9,9%
  71. 71. Academic Impact – Timeliness of Investment Topic name Estimated share of research output in the EU Framework Programmes Estimated share of research output in PubMed (fast- growing topics) Copy number variations (genome) 0,5% 0,2% Graphene & nanotechnology 1,3% 0,4% Complement activation 0,9% 0,2% DNA sequence processing 0,3% 0,2% Cleft palate <0,1% 0,3% Gut microbiota 0,4% 0,2%
  72. 72. topicid title 226 graphene & nanotechnology 69 brain function 111 snps & genetic association 318 protein interaction / binding 351 escherichia coli infections 228 proteomics & mass spectrometry 433 climate change 68 genome sequencing 365 molecular dynamics & protein structure 400 influenza virus 272 vaccination & immunization 275 gene expression analysis 266 hiv infection 258 embryonic stem cells 71 clinical biomarkers & diagnosis 117 mosquitoes & public health 254 alzheimer disease 403 genetics (mutation, disease) Academic Impact – EC Funded Topics
  73. 73. Academic Impact Topic View: Cardiovascular Diseases Funder Rank National Institutes of Health (US) 1 Medical Research Council (UK)* 2 European Commission 3 Wellcome Trust (UK) 4 British Heart Foundation (UK) 5 National Health and Medical Research Council (Australia) 6 Research Councils UK* 7 Swedish Research Council (Sweden) 8 Chief Scientist Office (UK) 9 Cancer Research UK 10 Topic Size: large - x2 of average topic in PubMed Topic Trend: growing - 1.25 times larger in 2012-18, than 2005-11 Topic Exclusivity: - low (many funders investing on topic)
  74. 74. Academic Impact – Summary Topic modelling 1. Automated 2. Granular 3. Bottom up 4. Not field related Publication Links + Topics & Trends allow for comparisons across: - Funders - Projects - Authors/Organizations - Geographic locations - Over time
  75. 75. Social Media Impact topic models topic searches search results indicators News Blogs Fora Twitter
  76. 76. Most discussed topics 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 Indicator: rank topics by the number of mentions We show the top-20 topics Dates: 13 January – 07 February 2019
  77. 77. Topics’ Engagement .000% 5.000% 10.000% 15.000% 20.000% 25.000% 30.000% 35.000% Indicator: engaging articles Meaning: The % of articles of each topic that have at least one share on Facebook. Dates: 13 January – 07 February 2019
  78. 78. Flu Indicator: Mentions of flu across the 138 topic models, in news, blogs, fora & twitter Dates: 13 January – 31 March 2019
  79. 79. Cardiovascular risk factors Indicator: risk factors of cardiovascular diseases We show the Share-of-Voice for each factor Dates: 13 January – 31 March 2019
  80. 80. Economic impact
  81. 81. Tracking of data from company websites Why? Current methodologies affected by low and dropping response rates, relatively high running costs and substantial data lags Big data offers data scalability, completeness and speed Growing interest in the big data, e.g. future editions of the European Innovation Scoreboard to contain data derived from big data approaches
  82. 82. Process (how?) Input data from Cordis + Orbis Scraping/Crawling Language recognition/ Translation Database with text data from company websites Randomly selecting and labelling a sub set of data Model development List of innovation mentions by stage and type Aggregating to a number of unique innovations Database with company innovation counts Aggregation Visualisation
  83. 83. Classification of innovations (what?) Innovations Innovation type Input data Company URL link Innovation output Product innovation Service, process, other innovation Innovation activity Licensing activities Private/public funding attracted Certification & standardisation M&A + Extraction of entities (product names, trademarks, copyright) associated with innovation outputs and activities
  84. 84. Key results: FP7-Core set Key results: 2097 FP7 & H2020 companies analysed in total, over 1.5 million URL links harvested, over 15,000 innovation texts identified
  85. 85. Key results: FP7-Core Set Indicator Indicator value (FP7-Core projects) Number of companies analysed in the FP7-Core set 1395 Estimated share of enterprises with evidence of innovation activities 46.0% Average number of innovation outputs and activities identified per company 16.1 Estimated share of highly innovative enterprises 7.4% Estimated share of enterprises with evidence of licensing activities (incl. patent/trademark license agreements) 9.3% Estimated share of enterprises involved in activities related to acquisitions 20.0% Estimated share of enterprises with evidence of private investment/capital attracted 8.0%
  86. 86. Examples of innovations identified 2019-06-07 90
  87. 87. Examples of innovations identified 2019-06-07 91
  88. 88. Uptake of R&I by companies Estimated uptake of innovation outputs and activities in FP7-Core projects, by ICD class
  89. 89. Uptake of R&I activities: targeted approach Aiming at a simple but powerful first-line screening tool for HRF mutations, we have developed and validated a reverse-hybridization assay (HRF StripAssay) for the rapid and simultaneous detection of 22 most common HRF mutations: H20N, H20P, I268T, V377I (HIDS); R260W, D303N, L305P, T348M, L353P, Y570C (CAPS); C30R, C33Y, D42Del, T50M, C70R, C73W, R92Q (TRAPS); M680I(G/A), M680I(G/C), M694I(G/A), V726A (FMF). Reliable genotyping of recombinant mutant clones and a selection of reference DNA samples was achieved by means of teststrips presenting parallel arrays of allele-specific oligonucleotides. We demonstrated that the prototype HRF StripAssay is capable of detecting all 22 mutions, as well as identifying homozygotes by the absence of the corresponding wild-type signal.
  90. 90. Summary Company websites proved to be a rich source of data for innovation outputs and activities State-of-the-art web scraper and NLP model developed, approach is scalable and can handle multiple languages New data and indicators which can be reproduced in frequent batches
  91. 91. Summary Useful for: - Monitoring and ex-post evaluation: first use cases for the EIS built; possible to link company innovations to previous research activities - Storytelling: rich source of data for innovation success stories and case studies - Proposal evaluation: innovation track record, previous commercialisation activities, investment attracted, etc. Caveats, weaknesses and areas for further work: - Process and service innovations captured to a lesser degree - Eudamed (EU database for CE marked medical devices and technologies, opening in 2020) offers a rich source of data for further work
  92. 92. Societal and health impact
  93. 93. Linking medicines to R&I Why? No data currently tracked in a systematic way on the contributions of R&I to new products on the market Large investments made in translational medicine and close-to-market research, but little known about the uptake New products on the market is a proxy for economic impact, but also health/societal impact, e.g. orphan medicines, new non-generic medicines, medicines treating highly resistant pathogens
  94. 94. Process (how?)
  95. 95. Key results: human medicinal products authorized by the EMA
  96. 96. Selected results: top-5 medicines with the strongest links to FP7 Medicine name Active substance Marketing authorisation holder Total number of mentions of medicine name & active substance Orfadin Nitisinone Swedish Orphan Biovitrum International AB 4290 Alkindi Hydrocortisone Diurnal Europe B.V. 3144 Ferriprox Deferiprone Apotex Europe BV 2789 Herceptin Trastuzumab Roche Registration GmbH 1210 Aplidin Plitidepsin Pharma Mar, S.A. 650
  97. 97. Example: Orfadin
  98. 98. Example: Alkindi
  99. 99. Example: Alkindi
  100. 100. Key results: human medicinal products authorized by the EMA
  101. 101. Summary To the best of our knowledge, Data4Impact is the first project to systematically link medicinal products & clinical trials to R&I activities Data highly useful for storytelling and impact stories, as well as monitoring and ex- post evaluation Once Eudamed data become available in mid-2020, big data will cover all key stages of the R&I lifecycle: • Basic research: throughput/output data + measures of academic impact • Translational research: clinical trials • Applied research, close-to-market research: EMA data on medicines, Eudamed data on medical devices & technologies
  102. 102. Clinical guidelines: Overview • Clinical guidelines, systematic reviews and treatment recommendation documents provide traces of clinical and professional practice • Proprietary data from Minso Solutions AB. Maintains a database, Clinical Impact, (CI:TM) (Except WHO, Cochrane, NICE, available in PubMed) • The coverage is nearly complete at the government level for Sweden, Denmark, Norway, Germany (at the S3 level), and the UK (NICE and SIGN guidelines), as well as good coverage of WHO guideline documents and Cochrane Systematic Reviews. • In total 855 clinical guidelines had a total of 3684 (2,073 fractional) references that were matched to 1781 publications found in the D4I database.
  103. 103. Indicators • Traditional bibliometric indicators based on Clinical guideline citations. 1. e.g. fractionalization at at funder level, normalization of publication and citation counts to comparable research, if needed, usage of time averages. • Combined citation and text based metrics  2. Subject classification of clinical guideline docs  Vector space embedding of references (based on reference/text combination)*  Conceptual embeddings of references (based on MESH terms of references)** 3. Reference weight in text CG:s) (Identification and categorization of named entities within the clinical guidelines.) *** Together with citation metrics, by using these modes of analyses we aim to identify significant relationships between cited references, named entities, topics, and reference functions. * Eklund, J. (2018). The importance of scientific references in their contexts Poster presented at the 23rd Nordic Workshop on Bibliometrics and Research Policy 2018, Borås, 7-9 November. **Eklund, J., Gunnarsson Lorentzen, David & Nelhans, G., (2019). MESH classification of clinical guidelines using conceptual embeddings of references. Manuscript accepted to ISSI, 17th International Society of Scientometrics and Informetrics Conference, Rome, 2-5 September. *** Manuscript in preparation
  104. 104. Funder (EC breakdown) Funder_type Number (full) Number (fract.) EC_funder (FP7/H2020) 115 78.2 European nat’l funders 1,859 1,317.9 Internationa funders 1,710 676.9 Total sum 3,684 2,073.0 Funder Number (full) Number (fract) EC_FP7-CORE 74 49.9 EC_FP7-EXTENDED 28 18.2 EC_H2020-EXTENDED 1 0.1 EC_other 12 10.0 Total sum 115 78.2
  105. 105. Funders (top 20) Funder_full Funder_country Number (full) Number (fract) National Institutes of Health US 1,645 624.6 Medical Research Council UK 585 452.4 Wellcome Trust UK 555 416.9 NHMRC - National Health and Medical Research Australia 156 85.5 Cancer Research UK UK 122 85.6 RCUK - Research Councils UK UK 85 37.7 Chief Scientist Office UK 82 66.5 EC_FP7-CORE EU 74 49.9 British Heart Foundation UK 69 34.6 Swiss National Science Foundation Switzerland 64 41.9 Arthritis Research UK UK 29 27.5 World Health Organization International 29 27.3 EC_FP7-EXTENDED EU 28 18.2 AKA - Academy of Finland FIN 27 9.8 Biotechnology and Biological Sciences Research Council UK 15 9.6 EC_other EU 12 10.0 NWO - Netherlands Organisation for Scientific Research Netherlands 12 6.6 Austrian Science Fund FWF Austria 11 9.1 ARC - Australian Research Council Australia 10 5.1 Other (N=26 funders) - 74 54 Sum - 3,684 2,073
  106. 106. Guideline providers 0 5 10 15 20 25 30 EC projects matched with guideline citations (n=115)
  107. 107. Topical analysis of reference contexts congue risus feugiat ref264 tincidunt lorem nullam In the generated topic model, each word is associated with a probability distribution of topics For each reference, a symmetric context window of size k is used as a pseudo-document, and the most probable topic is calculated for that context window congue risus feugiat ref264 tincidunt lorem nullam
  108. 108. Asthma, a chronic respiratory condition affecting 300 million people globally ( aref15080825 ), causes inflammation of the lungs as well as structural and functional remodelling of the airways. It is characterised by recurrent attacks of breathlessness and wheezing with varying degrees of frequency and severity, which is caused by swelling of the bronchial tubes resulting in airflow limitation (WHO 2011). Although the causes of asthma are not completely understood, risk factors are known to include inhaling asthma triggers such as allergens, tobacco smoke and chemical irritants. Asthma is incurable and the prevalence is increasing, particularly in children and young adults ( aref22157151 ), however appropriate management can control the disorder and enable people to enjoy a high quality of life (WHO 2011). asthma a chronic respiratory condition affecting million people globally aref causes inflammation of the lungs as well as structural and functional remodelling of the airways Topic 346 (0.8149): asthma, copd, allergic, airway, disease, fev, ige, respiratory, lung, symptoms Topic 78 (0.0689): pressure, lung, pulmonary, respiratory, gas, lungs, ventilation, volume, breathing, alveolar
  109. 109. Topical coherence Using distance measures defined on spaces of probability distribution, such as the Bhattacharyya distance and the Hellinger distance, we measure the divergence between the topics assigned to the same reference in different contexts as well as the topics assigned to context windows of different size for a specific in- text citation.
  110. 110. Clinical guideline impact • Professional impact – One step closer to the implementation of research within the clinic • Case: References in context:  Generic method for academic citations In Data for impact : 1. Subject classification of citing document based on cited documents’ MESH terms 2. Distinguishing between reference kinds in guideline documents 3. Establishing the ”topicality” of each reference based on a trained model of EuroPMC article.
  111. 111. Top-20 Twitter topics (n:~31M tweets) 0 500,000 1,000,000 1,500,000 2,000,000 climate change vaccination measles and newborn screening stress disorders diabetes mellitus attention deficit disorder with… depression transplantation weight loss and obesity cardiovascular risk factors alzheimer disease cancer therapy eating disorders hypertension and blood pressure myocardium and heart failure breast cancer schizophrenia and bipolar disorder dendritic cells and immunity asthma environmental exposure and air… Topic Topic name Num tweets 433 climate change 9,949,906 272 vaccination 1,760,780 175 measles and newborn screening 1,457,110 245 stress disorders 898,758 209 diabetes mellitus 858,118 294 adhd 706,055 315 depression 703,844 348 transplantation 699,582 121 weight loss and obesity 696,612 319 cardiovascular risk factors 647,843 254 alzheimer disease 637,668 362 cancer therapy 570,636 123 eating disorders 513,989 240 hypertension and blood pressure 452,499 302 myocardium and heart failure 445,434 284 breast cancer 415,986 366 schizophrenia and bipolar disorder 407,553 344 dendritic cells and immunity 397,980 169 asthma 383,321 373 env. exposure and air pollution 381,212
  112. 112. Topic fluctuation Jan-Feb 3 123 175 254 272 362 0 10000 20000 30000 40000 50000 60000 70000 Topics: 3: anorexia, 123: bulimia, 175: measles, 254: Alzheimer, 272: vaccination, 362: cancer
  113. 113. Virality From ten prominent topics according to virality, the most retweeted tweet together with its url. ID Topic Retweets URL 47lung cancer 145,421 3psychometrics 35,353 450iron deficiency and anemia 4,401 491acute lymphoblastic leukemia 11,338 324embryonic development 3,534 433climate change 47,547 175measles and newborn screening 15,561 272vaccination 11,923 348transplantation 60,692 362cancer therapy 5,031 47 lung cancer 491 leukemia 433 climate change 272 vaccination 348 transplantation
  114. 114. Task 5.4.3 Twitter conversation analysis • Builds on other WP5.4 activities, but takes a somewhat different approach to collecting data.  Focuses on relationships between social media posts (retweets, @tweets, #tweets)  Possible to construct meaningful tests as ”scripted dialogs”  Helps weed out spam  Amenable to content based text analysis at the conversation level (e.g. Sentiment analys, topic modelling)
  115. 115. Referring to research in thread First collected tweet in thread: -[tweet id='13441' replyto='14018'] Independent research has shown that individuals who were vaccinated for the flu had 5.5 times more respiratory illness than those who were not vaccinated. [/tweet] - (A number of replies omitted; thread length: 313) - [tweet id='216387' replyto='216418'] In the light of new info, why not? It happens all the time.[/tweet] - (Replies omitted, showing those with reference) - [tweet id='216302' replyto='216387'] which is???DOI:10.1371/journal.pntd.0005179 [/tweet] - [tweet id='216261' replyto='216387'] 'Analysis of year 3 results of phase III trials of Dengvaxia suggest high rates of protection of vaccinated partial dengue immunes but high rates of hospitalizations during breakthrough dengue infections of persons who were vaccinated when seronegative...'DOI:10.1371/journal.pntd.0005179 [/tweet] -- [tweet id='216241' replyto='216387'] Phase III Trials, among our 9-year olds! FACT. DOI:10.1371/journal.pntd.0005179 [/tweet] --- [tweet id='215757' replyto='216241'] Phase 2 was all that is required for release Phase 3 was 'extra' 'Extra' studies are always done throughout the commercial lifetimes of drugs & vaccines Consequences of phase 3 results are nowhere near what group wud have us believe DOI:10.1371/journal.pntd.0005179 [/tweet]
  116. 116. Vaccination on Twitter Topic bursts, user behaviour and referring to research in discussions
  117. 117. Topic burst • Identify a day when activity is more than 50% above the daily average • The burst extends up to the next day with activity below the average • This period is compared to previous and following periods of equal length • This example: 4 day long burst in topic 272 (vaccination) 3 123 175 254 272 362 0 10000 20000 30000 40000 50000 60000 70000 14-Jan 15-Jan 16-Jan 17-Jan 18-Jan 19-Jan 20-Jan 21-Jan 22-Jan 23-Jan 24-Jan 25-Jan 26-Jan 27-Jan 28-Jan 29-Jan 30-Jan 31-Jan 1-Feb 2-Feb 3-Feb 4-Feb 5-Feb 6-Feb 7-Feb 8-Feb 9-Feb 10-Feb
  118. 118. RT networks (similar structures, amount of RTs increases when activity is high) Word clouds based on hashtags (seemingly a topical shift during burst) 48% rts 55% rts 42.5% rts User groups and their relative activity Previous (144869 tweets) Burst (194712) Next (115557) Top 1% most active share (overall: 16%) 12 12 19 Next 9% share (overall: 17%) 20 18 18 90% least active share (overall: 67%) 68 70 63 The least active user group is more prominent when general activity is high while the most active user group is more prominent when activity is low.
  119. 119. ”Deniers” (measles, vaxxed, mmr, autism, study, flu, hpv, informedconsent, vaxwoke, cdc, vaccineinjury, learntherisk, maga, gardasil, vaccineskill) ”Non-deniers 2” (measles, vaccineswork, publichealth, science, humanitariancrisis, scientificreport, antivax, vaccinessavelives, venezuela, crisis, humanitarianaid, help, antivaxxers, vaccinesaresafe, misinformation, scicomm, itrustvaccines, mmr, factsmatter) ”Non-deniers 1” (measles, vaccineswork, flu, hpv, antivax, vaxfactsfebruary, vaccinessavelives, immunization, antivaxxers, mumps, rotavirus, ethiopia, law, ebola) RT and coupled hashtag networks from burst period.
  120. 120. Academic 27% Academically trained 11% Other Professional 23% Media 38% Policy/decision maker 1% 9,647 plain text biographies from Twitter profiles classified using a rule-based method: 30 % matched as: Class Keyword example Science student student, studying, Graduated MS, MA, graduate University faculty lectur, prof., professor Other scientist technician, lab manager, -ologist Education and outreach curator, teacher, librarian Applied science organization nonprofit, philantropy Other professional recruiter, entrepreneur, manager Media professional journalis, publisher Policy/decision maker congressman, senator, parliament Ekström, B. (2019): Developing a rule-based method for identifying researchers on Twitter: The case of vaccine discussions Poster accepted to ISSI, 17th International Society of Scientometrics and Informetrics Conference, Rome, 2-5 September.
  121. 121. How can we use Twitter-bio personas? - Retweet data
  122. 122. How can we use Twitter-bio personas? Conversation data ?
  123. 123. Data4Impact has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 770531. Thank you for your attention! Data4Impact Consortium Visit out website: Follow us on Twitter and SlideShare: @Data4Impact