Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSFair2017 training | From Open Access to Open Science: making sense of scientific content


Published on

Stelios Piperidis talks about new opportunities for improved analytics that arise from Open Science and the broad use of interoperable text and data mining resources, tools and services on homogeneously accessible research literature

Training title:TDM unlocking a goldmine of information

Training overview:
Text and Data Mining (TDM) is a natural ‘next step’ in open science. It can lead to new and unexpected discoveries and increase the impact of publications and repositories. This workshop showcases examples of successful TDM and infrastructural solutions for researchers. We will also discuss what is needed to make most of infrastructures and how publishers and repositories can open up their content.


Published in: Science
  • Be the first to comment

  • Be the first to like this

OSFair2017 training | From Open Access to Open Science: making sense of scientific content

  1. 1. #openminted_eu Stelios Piperidis Athena Research & Innovation Centre OPEN SCIENCE FAIR ATHENS, 6 SEPTEMBER 2017
  2. 2. The global research community generates ~2.5 million new scholarly articles per year (English only) … one paper published every 12seconds … 70,000 papers published on a single protein, the tumor suppressor p53
  3. 3. ● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016. ● 24 million wireless sensors and actuators worldwide (553% up, between 2011 and 2016) ● 16 zettabytes of useful data (16 Trillion GB) by 2020 ● YouTube claims to upload 24 hours of video every minute, making the site a hugely significant data aggregator. ● Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, >500 million tweets per day and around 200 billion tweets per year. ● 74,200,000 pages existed on Facebook, with 7 million apps and websites integrated with Facebook on 30/5/2016
  4. 4. process textual sources, organise and classify in various dimensions, extract main (indexical) information items, identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict
  5. 5. Text Types Newswire Scientific Literature Tweets/blogs Patents Clinical/medical records Textbooks, monographs Online forums …. Languages English French German Spanish Portuguese Italian Polish …. Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Domains Finance/Business Health Biology Social Sciences Humanities ….
  6. 6. Establish an open and sustainable Text and Data Mining (TDM) platform and infrastructure where researchers can discover, collaboratively create, share and re-use knowledge from a wide range of text based scientific and scholarly related sources.
  7. 7. OpenMinted sets out to create an open, service-oriented e-Infrastructure for Text and Data Mining (TDM) of scientific and scholarly content. Content/Corpora Services/tools Annotated corpora and CORE
  8. 8. content web services "ancillary" resources
  9. 9. Scientific pubs = Research data ANNOTATED DATASET DERIVED KNOWLEDGE 1st layer 2nd layer at the level of licensing conditions SCIENTIFIC DATA PROCESSING TOOLS/SERVICES
  10. 10. production Level IaaS Cloud Open Source OpenStack compliant software stack Pithos+ Object storage Cyclades Virtual Machines Management Builds on GRNET DataCenters 3 locations (Athens, Crete, Epirus) 1000+ servers, 16PB Raw Storage, x10G InterconnectsMember of EGI Federated Cloud
  11. 11. • Model for describing content, language/knowledge resources, tools/services (aka components) • OMTD-SHARE schema • Allows search and browse of publications • Maps local schemata to OMTD-SHARE schema • Provides access to full text • In cooperation with respective projects (OpenAIRE, CORE)
  12. 12. • Provides access to external sources for metadata of TDM related resources: • Maven for tools (e.g. GATE, UIMA, uimaFIT, etc.) or machine learning models. • LR repositories (e.g. META-SHARE) for metadata of language resources. • Docker for dockerized tools and services.
  13. 13. • Provides content to OpenMinTeD • Search on publication sources (OpenAIRE, CORE) • Builds corpora of publications • Stores archives of content • Different storage backends • Pithos+ • Local filesystem
  14. 14. Slide by Petr Knoth, CORE/OU
  15. 15. • Create/modify workflows of TDM components • Execute workflows with user supplied content • Provide friendly UI • Used in biomedical research • Cooperation with 4 international projects • Ingest TDM tool descriptions from the registry • Start/stop/monitor workflows • Integrate with OMTD Store Service to supply content and store results
  16. 16. Rather novice users who want to find services (end to end) that fill their needs in an off-the-shelf type of situation. Understand basic usage of NLP and TDM services, but not the details. They know how to connect components, which content they must work on to get the required results. They need to develop end to end applications. agnostic to the internal specifics of TDM, but they need to integrate and operate TDM services into daily workflows.
  17. 17. Publishers and repository managers (research libraries). Expert language technology oriented people, who are using specific technologies and frameworks to develop and enhance their services. Non NLP expert developers, creating TDM modules based on off the shelf libraries and tools (e.g. Python, Jupyter). Not familiar with NLP frameworks and terminology but eager to publish their services.
  18. 18. Feature extraction Data citation Research analytics Curation of databases and lexica in Chembolomics & neuroinformatics Extracting information from tables for food safety alerts Data citation From the very beginning… Requirements, content, barriers, expected outcomes. … to the very end Create applications, validate and evaluate the results.
  19. 19. Unless possible conflict with NC
  20. 20.