Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Download to read offline and view in fullscreen.


OpenMinTeD, LIBER conference 2017

Download to read offline

This presentation was part of a workshop on text and data mining at the LIBER conference in Patras, July 2017. Presenting: Natalia Manola

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

OpenMinTeD, LIBER conference 2017

  1. 1. Overview LIBER Conference 5 July 2017, Patras, Greece Natalia Manola Athena Research and Innovation Centre
  2. 2. The problem PART I
  3. 3. A few sobering facts on content production LIBER conference - PATRAS, 5 July 2017 ● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016. ● 24 million wireless sensors and actuators worldwide (553% up, between 2011 and 2016) ● 16 zettabytes of useful data (16 Trillion GB) by 2020 ● YouTube claims to upload 24 hours of video every minute, making the site a hugely significant data aggregator. ● Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, >500 million tweets per day and around 200 billion tweets per year. ● 74,200,000 pages existed on Facebook, with 7 million apps and websites integrated with Facebook on 30/5/2016 3
  4. 4. … And some facts on scientific literature LIBER conference - PATRAS, 5 July 2017 The global research community generates ~2.5 million new scholarly articles per year (English only) The STM report (2015) … some 90% of papers … are never cited (82% in the humanities) … of those articles that are cited, only 20 percent have actually been read … 50% of papers are never read by anyone other than their authors, referees and journal editors Lokman I. Meho, The rise and rise of citation analysis, 2007 … one paper published every 12seconds … 70,000 papers published on a single protein, the tumor suppressor p53 Spangler et al, Automated Hypothesis Generation based on Mining Scientific Literature, 2014 4
  5. 5. How can we make sense of this data? 5 PART II
  6. 6. TDM - AN Emerging solution Machine reading process textual sources, organise and classify in various dimensions, extract main (indexical) information items, … and “understanding” identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data … and predicting enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict LIBER conference - PATRAS, 5 July 2017 6
  7. 7. However, … Multitude of solutions catering for different Text Types Newswire Scientific Literature Tweets/blogs Patents Clinical/medical records Textbooks, monographs Online forums …. Languages English French German Spanish Portuguese Italian Polish …. Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Domains Finance/Business Health Biology Social Sciences Humanities …. Creating a fragmented landscape LIBER conference - PATRAS, 5 July 2017 7
  8. 8. A complex and fragmented Landscape LIBER conference - PATRAS, 5 July 2017 Text Mining Researchers Computing Infrastructures Content Providers End Users 8
  9. 9. The components 9 PART III
  10. 10. 1. Share content • Document literature content • Share in a meaningful way: what does Open Access really mean? IPR and licensing • Study IPR restrictions for reuse of sources as well as possible exceptions • Promote clarity and standardisation of legal rights and obligations Challenges • Rights statement vs. Open licenses (for repositories) • No access to full text. We live in a metadata world • No standard protocols, formats and APIs for access and retrieval • No capacity to handle extra traffic LIBER conference - PATRAS, 5 July 2017 10
  11. 11. Proposed solution : Make TDM enabled hubs LIBER conference - PATRAS, 5 July 2017 11 Literature Repositories OA Journals Data Repositories Aggregators Archives Metadata Full text Data OpenAIRE CORE PMC Europe … Guidelines APIs TDM Research networks WIkiPedia/ Media/Research … Open Data Open Protocols OpenAIRE + OpenMinTeD
  12. 12. 2. Share TDM Services • Document language processing/text mining services and workflows in a meaningful way for domain discipline researchers • Document language/knowledge resources, data categories taxonomies, provenance information Interoperable services • Common way of presenting annotated results • Combine services into workflows • Combine content and language resources with services and workflows • Combine automatic and manual/crowdsourcing annotation services IPR and licensing • Translate the legal & policy aspects into specifications for lawful user-to- service and service-to-service interactions Challenges • Bring text miners close to the researcher problems and needs • Semantic interoperability (not just technical) LIBER conference - PATRAS, 5 July 2017 12
  13. 13. 3. Use/Share computing resources • Capacities and capabilities Interoperable services at the lower level • Common way of deploying operations/jobs • Authentication and Authorisation services: Single Sign On (SSO) • Accounting Challenges • Legal, organisational, … LIBER conference - PATRAS, 5 July 2017 13
  14. 14. The OpenMinted platform 14 PART III
  15. 15. OpenMinted framework & focus LIBER conference - PATRAS, 5 July 2017 15 OpenMinted sets out to create an open, service-oriented e-Infrastructure for Text and Data Mining (TDM) of scientific and scholarly content. … Content/Corpora Services/tools Annotated corpora
  16. 16. Register and Discover TDM Services and tools Link to Content hubs - Share corpora Run a TDM job Store, document, Publish and Share results (ANNOTATED CORPORA) Our Services 16 LIBER conference - PATRAS, 5 July 2017 Build your own service – Combine components into a Workflow and SHARE
  17. 17. key goalsapart from interoperability Recognise that the results of TDM, i.e., annotations, are valuable research data that should be preserved, shared, re-used. Scientific publications are data, and should abide to the FAIR principles of data. LIBER conference - PATRAS, 5 July 2017
  18. 18. who is openminted for PART IV
  19. 19. End users as consumers Domain specific researchers & research communities Rather novice users and who want to find services (end to end) that fill their needs in an off the shelf type of situation. (>100.000) Application developers / RI data scientists Understand basic usage of NLP and TDM services, but not the details. They know how to connect components, which content they must work on to get the required results. They need to develop end to end applications. (>10.000) Infrastructure operators agnostic to the internal specifics of TDM, but they need to integrate and operate TDM services into daily workflows. (<100) LIBER conference - PATRAS, 5 July 2017
  20. 20. content and services contributors FOR Content Publishers and repository managers (research libraries). (<1000) For services Expert language technology oriented people, who are using specific technologies and frameworks to develop and enhance their services. (< 500) Non NLP expert developers, creating TDM modules based on off the shelf libraries and tools (e.g. Python, Jupyter). Not familiar with NLP frameworks and terminology but are eager to publish their small services. (<5.000) LIBER conference - PATRAS, 5 July 2017
  21. 21. challenges PART v
  22. 22. LIBER conference - PATRAS, 5 July 2017 interoperability At which level of component A technical issue that requires consensus building Legal issues at all levels (IPRs, liabilities) Go Open! Policies & rules of engagement for content /service providers and consumers EOSC compatible priorities policies Legal issues 2 31
  23. 23. LIBER conference - PATRAS, 5 July 2017 Beta release in AUGUST 2017 REAL TIME Building corpora: OpenAIRE CORE Uploading OWN corpora Registering a service Running a service Viewing annotations Storing results in zenodo
  24. 24. sustainability? What is the role of the libraries of (OA) publishers in TDM? How does Open Access translate to Open Science? How can they help researchers achieve the best in their knowledge extraction endeavour? What is the role of e- Infrastructures like OpenAIRE and OpenMinTeD? LIBER conference - PATRAS, 5 July 2017
  25. 25. Join us in the Openscience fair athens, sept 6-8, 2017
  26. 26. THANK YOU! Questions? natalia manola

This presentation was part of a workshop on text and data mining at the LIBER conference in Patras, July 2017. Presenting: Natalia Manola


Total views


On Slideshare


From embeds


Number of embeds