twitter.com/openminted_eu
beyond Open Access
MAKING SENSE OF LARGE VOLUMEs OF SCIENTIFIC CONTENT
Stelios Piperidis
Athena Research & Innovation Centre
spip@ilsp.athena-innovation.gr
1
The global research community generates over 1.5 million new
scholarly articles per annum.
The STM report (2009)
2
Lokman I. Meho, The rise and rise of
citation analysis, 2007
e STM report (2009)
… some 90% of papers … are never cited.
… 50% of papers are never read by anyone other than their
authors, referees and journal editors
…about scientific literature?
… one paper published every 30 seconds
… 70,000 papers published on a single protein, the tumor
suppressor p53
Spangler et al, Automated Hypothesis
Generation based on Mining Scientific
Literature, 2014
e STM report (2009)
Emerging solution(s)
Machine reading
process textual sources, organise and classify in various
dimensions, extract main (indexical) information items,
… and understanding
identify and extract entities and relations between entities, facilitate
the transformation of unstructured textual sources into structured
data
… and predicting
enable the multidimensional analysis of structured data to extract
meaningful insights and improve the ability to predict
3
Structuring and mining
textual data
many examples from medical research
An example from social sciences:
study social confrontation in the Greek
society with a focus on the years of the crisis
based on newspaper corpora
what have been the claims of the social agents (parties,
unions, different professional associations, etc) against which
government/state bodies, instruments used, how they were
reported in different newspapers
4
Study social confrontation
example
Κατάληψη στα Υποθηκοφυλακεία Πειραιώς και
Σαλαμίνας αποφάσισε ο Δικηγορικός Σύλλογος
Πειραιώς (ΔΣΠ), στις 26 και 27 Απριλίου 2011,
διαμαρτυρόμενος για τα σοβαρότατα
προβλήματα λειτουργίας που παρουσιάζουν.
The Piraeus Bar Association ( SAB) decided to
go for the occupation of land registries in
Piraeus and Salamis on 26 and April 27, 2011 ,
protesting about the serious operational
problems they present.
5
Study social confrontation
example
6
Form
Actor/
Addressee
Issue
Time/
Location
Claims
Named
Entity
Recognition
Chunking
Dependency
Parsing
Co-reference
Resolution
Aggregatio
n
Analytics
Stack
ILSP-NLP
IE
workflow
Summarize/
Export
Summarize/
Export
Visualise statistics
Main objective
Establish an open and sustainable Text and
Data Mining (TDM) platform and infrastructure
where researchers can collaboratively create,
discover, share and re-use knowledge from a
wide range of text based scientific and
scholarly related sources.
9
infrastructure - focus on
interoperability
build on existing TDM tools - no new
algorithms
service oriented - discovery, re-use of
content & tools
community driven - user centric
requirements
open science - openness at all levels
Key aspects
10
The landscape
Text Mining
Researchers
Text Mining
Researchers
Content ProvidersContent Providers
End UsersEnd UsersComputing InfrastructuresComputing Infrastructures
11
the project
• Started: June 2015
• Duration: 3 years
• Total budget: 6,068,074
Euros
16 Partners
• 6 mining research groups
• 3 content providers
• 1 data center
• 1 library association
• 2 legal experts
• 6 community related partners
• 2 SMEs
12
Partners
Athena RIC
Univ. of Manchester (NacTem)
Univ. of Darmstadt
INRA
EMBL-EBI
Agro-Know
LIBER
Univ. of Amsterdam
Open University UK
EPFL
CNIO
Univ. of Sheffield (GATE)
GESIS
GRNET
Frontiers
Univ. of Stirling
the challenges
Content
Barriers and obstacles due to non-availability, technical restrictions,
copyright law or licensing issues.
No uniform way to search for, retrieve and access content for TDM.
Services
How to identify the most fitting one? Do I have permission to use it?
How to combine with other services I have access to or I need? How
to use them on my content?
Processing
Where to deploy? Are my machines powerful enough? How can I
get access to powerful machines? Where to store intermediate and
final results? How to ensure persistence of storage?
13
Bring all stakeholders together!
Main routes
14
accessible content
Metadata and transfer protocols
•Document literature content, language resources, data categories
taxonomies, provenance information
•Generic and domain-specific metadata descriptions
•Identify standards for metadata harvesting and federated search in
distributed repositories
IPR and licensing
•Study IPR restrictions for reuse of sources
• Exceptions?
• What about non-commercial research?
•Translate the legal & policy aspects into authentication and
authorization specifications (GEANT’s EduGain, …)
• User-to-service and service-to-service interactions
15
Starting with repositories and OA
publishers
via OpenAIRE and CORE
Starting with repositories and OA
publishers
via OpenAIRE and CORE
In close collaboration with the
FUTURETDM project
http://project.futuretdm.eu/
In close collaboration with the
FUTURETDM project
http://project.futuretdm.eu/
Scholarly
Comm.
life
sciences agriculture social
sciences
Community driven
17
From the very beginning…
Requirements, content, barriers, expected outcomes.
… to the very end
Create applications, validate and evaluate the results.
Use cases (1)
Scholarly communication analytics
OpenAIRE, CORE, Frontiers
•Semantic search and discovery of open scientific outcomes
•Map of academia – scholarly communication network
•Research monitoring and analytics
Life sciences
EBI, Human brain project
•Assisted curation of the EMBL-EBI chemical databases for
metabolomics
•Curation of the neurosciences resources KnowledgeBase and
Neurolex
18
Use cases (2)
Agriculture and biodiversity
INRA, AGRO-KNOW, EFSA
•Enrich agricultural databases to assist food- and water-borne
disease outbreak alerts and product recalls
•Image, figure and dataset discovery in the AGRIS FAO online
service
social sciences
GESIS
•Develop and evaluate methods for the automatic detection and
linking of named entities, citation traces and intentions in social
science scientific publications
19
Expectations from today’s WS
•Establish contact and dialogue with content providers,
especially OA content providers
•Understand current practices, problems and limitations
•Look into the emerging requirements
•Explore the challenges content providers face at
technical, legal, policy and organisational challenges
face in making their data open for text and data mining
•Develop a common vision and strategy
20
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin
vimeo.com/openminted
bit.do/openmintedplus
THANK YOU!
21

OpenMinTeD: Making Sense of Large Volumes of Data

  • 1.
    twitter.com/openminted_eu beyond Open Access MAKINGSENSE OF LARGE VOLUMEs OF SCIENTIFIC CONTENT Stelios Piperidis Athena Research & Innovation Centre spip@ilsp.athena-innovation.gr 1
  • 2.
    The global researchcommunity generates over 1.5 million new scholarly articles per annum. The STM report (2009) 2 Lokman I. Meho, The rise and rise of citation analysis, 2007 e STM report (2009) … some 90% of papers … are never cited. … 50% of papers are never read by anyone other than their authors, referees and journal editors …about scientific literature? … one paper published every 30 seconds … 70,000 papers published on a single protein, the tumor suppressor p53 Spangler et al, Automated Hypothesis Generation based on Mining Scientific Literature, 2014 e STM report (2009)
  • 3.
    Emerging solution(s) Machine reading processtextual sources, organise and classify in various dimensions, extract main (indexical) information items, … and understanding identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data … and predicting enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict 3
  • 4.
    Structuring and mining textualdata many examples from medical research An example from social sciences: study social confrontation in the Greek society with a focus on the years of the crisis based on newspaper corpora what have been the claims of the social agents (parties, unions, different professional associations, etc) against which government/state bodies, instruments used, how they were reported in different newspapers 4
  • 5.
    Study social confrontation example Κατάληψηστα Υποθηκοφυλακεία Πειραιώς και Σαλαμίνας αποφάσισε ο Δικηγορικός Σύλλογος Πειραιώς (ΔΣΠ), στις 26 και 27 Απριλίου 2011, διαμαρτυρόμενος για τα σοβαρότατα προβλήματα λειτουργίας που παρουσιάζουν. The Piraeus Bar Association ( SAB) decided to go for the occupation of land registries in Piraeus and Salamis on 26 and April 27, 2011 , protesting about the serious operational problems they present. 5
  • 6.
  • 7.
  • 9.
    Main objective Establish anopen and sustainable Text and Data Mining (TDM) platform and infrastructure where researchers can collaboratively create, discover, share and re-use knowledge from a wide range of text based scientific and scholarly related sources. 9
  • 10.
    infrastructure - focuson interoperability build on existing TDM tools - no new algorithms service oriented - discovery, re-use of content & tools community driven - user centric requirements open science - openness at all levels Key aspects 10
  • 11.
    The landscape Text Mining Researchers TextMining Researchers Content ProvidersContent Providers End UsersEnd UsersComputing InfrastructuresComputing Infrastructures 11
  • 12.
    the project • Started:June 2015 • Duration: 3 years • Total budget: 6,068,074 Euros 16 Partners • 6 mining research groups • 3 content providers • 1 data center • 1 library association • 2 legal experts • 6 community related partners • 2 SMEs 12 Partners Athena RIC Univ. of Manchester (NacTem) Univ. of Darmstadt INRA EMBL-EBI Agro-Know LIBER Univ. of Amsterdam Open University UK EPFL CNIO Univ. of Sheffield (GATE) GESIS GRNET Frontiers Univ. of Stirling
  • 13.
    the challenges Content Barriers andobstacles due to non-availability, technical restrictions, copyright law or licensing issues. No uniform way to search for, retrieve and access content for TDM. Services How to identify the most fitting one? Do I have permission to use it? How to combine with other services I have access to or I need? How to use them on my content? Processing Where to deploy? Are my machines powerful enough? How can I get access to powerful machines? Where to store intermediate and final results? How to ensure persistence of storage? 13 Bring all stakeholders together!
  • 14.
  • 15.
    accessible content Metadata andtransfer protocols •Document literature content, language resources, data categories taxonomies, provenance information •Generic and domain-specific metadata descriptions •Identify standards for metadata harvesting and federated search in distributed repositories IPR and licensing •Study IPR restrictions for reuse of sources • Exceptions? • What about non-commercial research? •Translate the legal & policy aspects into authentication and authorization specifications (GEANT’s EduGain, …) • User-to-service and service-to-service interactions 15 Starting with repositories and OA publishers via OpenAIRE and CORE Starting with repositories and OA publishers via OpenAIRE and CORE In close collaboration with the FUTURETDM project http://project.futuretdm.eu/ In close collaboration with the FUTURETDM project http://project.futuretdm.eu/
  • 16.
    Scholarly Comm. life sciences agriculture social sciences Communitydriven 17 From the very beginning… Requirements, content, barriers, expected outcomes. … to the very end Create applications, validate and evaluate the results.
  • 17.
    Use cases (1) Scholarlycommunication analytics OpenAIRE, CORE, Frontiers •Semantic search and discovery of open scientific outcomes •Map of academia – scholarly communication network •Research monitoring and analytics Life sciences EBI, Human brain project •Assisted curation of the EMBL-EBI chemical databases for metabolomics •Curation of the neurosciences resources KnowledgeBase and Neurolex 18
  • 18.
    Use cases (2) Agricultureand biodiversity INRA, AGRO-KNOW, EFSA •Enrich agricultural databases to assist food- and water-borne disease outbreak alerts and product recalls •Image, figure and dataset discovery in the AGRIS FAO online service social sciences GESIS •Develop and evaluate methods for the automatic detection and linking of named entities, citation traces and intentions in social science scientific publications 19
  • 19.
    Expectations from today’sWS •Establish contact and dialogue with content providers, especially OA content providers •Understand current practices, problems and limitations •Look into the emerging requirements •Explore the challenges content providers face at technical, legal, policy and organisational challenges face in making their data open for text and data mining •Develop a common vision and strategy 20
  • 20.