Overview of Text and Data Mining Challenges and Solutions

Overview
LIBER Conference
5 July 2017, Patras, Greece
Natalia Manola
Athena Research and Innovation Centre

A few sobering facts on content production
LIBER conference - PATRAS, 5 July 2017
● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016.
● 24 million wireless sensors and actuators worldwide (553% up, between 2011 and
2016)
● 16 zettabytes of useful data (16 Trillion GB) by 2020
● YouTube claims to upload 24 hours of video every minute, making the site a
hugely significant data aggregator.
● Every second, on average, around 6,000 tweets are tweeted on Twitter, which
corresponds to over 350,000 tweets sent per minute, >500 million tweets per day
and around 200 billion tweets per year.
● 74,200,000 pages existed on Facebook, with 7 million apps and websites
integrated with Facebook on 30/5/2016
3

… And some facts on scientific literature
The global research community generates ~2.5 million new scholarly
articles per year (English only)
The STM report (2015)
… some 90% of papers … are never cited (82% in the humanities)
… of those articles that are cited, only 20 percent have actually been read
… 50% of papers are never read by anyone other than their authors,
referees and journal editors
Lokman I. Meho, The rise and rise of citation analysis, 2007
… one paper published every 12seconds
… 70,000 papers published on a single protein, the tumor suppressor p53
Spangler et al, Automated Hypothesis Generation based on Mining Scientific
Literature, 2014
4

How can we make sense of this data?
5
PART II

TDM - AN Emerging solution
Machine reading
process textual sources, organise and classify in various dimensions, extract
main (indexical) information items,
… and “understanding”
identify and extract entities and relations between entities, facilitate the
transformation of unstructured textual sources into structured data
… and predicting
enable the multidimensional analysis of structured data to extract meaningful
insights and improve the ability to predict
6

However, …
Multitude of solutions catering for different
Text Types
Newswire
Scientific Literature
Tweets/blogs
Patents
Clinical/medical records
Textbooks, monographs
Online forums
….
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
….
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities
….
Creating a fragmented landscape
7

A complex and fragmented Landscape
Text Mining Researchers
Computing Infrastructures
Content Providers
End Users
8

1. Share content
• Document literature content
• Share in a meaningful way: what does Open Access really mean?
IPR and licensing
• Study IPR restrictions for reuse of sources as well as possible exceptions
• Promote clarity and standardisation of legal rights and obligations
Challenges
• Rights statement vs. Open licenses (for repositories)
• No access to full text. We live in a metadata world
• No standard protocols, formats and APIs for access and retrieval
• No capacity to handle extra traffic
10

Proposed solution : Make TDM enabled hubs
11
Literature
Repositories
OA Journals
Data
Repositories
Aggregators
Archives
Metadata
Full text
Data
OpenAIRE
CORE
PMC Europe
…
Guidelines APIs
TDM
Research
networks
WIkiPedia/
Media/Research
…
Open Data
Open Protocols
OpenAIRE +
OpenMinTeD

2. Share TDM Services
• Document language processing/text mining services and workflows in a
meaningful way for domain discipline researchers
• Document language/knowledge resources, data categories taxonomies,
provenance information
Interoperable services
• Common way of presenting annotated results
• Combine services into workflows
• Combine content and language resources with services and workflows
• Combine automatic and manual/crowdsourcing annotation services
IPR and licensing
• Translate the legal & policy aspects into specifications for lawful user-to-
service and service-to-service interactions
Challenges
• Bring text miners close to the researcher problems and needs
• Semantic interoperability (not just technical)
12

3. Use/Share computing resources
• Capacities and capabilities
Interoperable services at the lower level
• Common way of deploying operations/jobs
• Authentication and Authorisation services: Single Sign On (SSO)
• Accounting
Challenges
• Legal, organisational, …
13

The OpenMinted platform
14
PART III

OpenMinted framework & focus
15
OpenMinted sets out to create an open,
service-oriented e-Infrastructure for Text
and Data Mining (TDM) of scientific and
scholarly content.
…
Content/Corpora Services/tools Annotated corpora

Register and Discover TDM Services and tools
Link to Content hubs - Share corpora
Run a TDM job
Store, document, Publish and Share results (ANNOTATED CORPORA)
Our Services
16
Build your own service – Combine components into a
Workflow and SHARE

key goalsapart from interoperability
Recognise that the results
of TDM, i.e., annotations, are
valuable research data that
should be preserved, shared,
re-used.
Scientiﬁc publications are
data, and should abide to the
FAIR principles of data.

End users as consumers
Domain specific researchers & research communities
Rather novice users and who want to find services (end to end) that fill their
needs in an off the shelf type of situation. (>100.000)
Application developers / RI data scientists
Understand basic usage of NLP and TDM services, but not the details. They
know how to connect components, which content they must work on to get the
required results. They need to develop end to end applications. (>10.000)
Infrastructure operators
agnostic to the internal specifics of TDM, but they need to integrate and
operate TDM services into daily workflows. (<100)

content and services contributors
FOR Content
Publishers and repository managers (research libraries). (<1000)
For services
Expert language technology oriented people, who are using speciﬁc
technologies and frameworks to develop and enhance their services. (< 500)
Non NLP expert developers, creating TDM modules based on off the shelf
libraries and tools (e.g. Python, Jupyter). Not familiar with NLP frameworks
and terminology but are eager to publish their small services. (<5.000)

interoperability
At which level of
component
A technical issue that
requires consensus
building
Legal issues at all
levels (IPRs,
liabilities)
Go Open!
Policies & rules of
engagement for
content /service
providers and
consumers
EOSC compatible
priorities
policies Legal issues
2 31

Beta release in AUGUST 2017
REAL TIME Building
corpora:
OpenAIRE
CORE
Uploading OWN
corpora
Registering a
service
Running a service
Viewing annotations
Storing results in
zenodo

sustainability?
What is the role of the libraries
of (OA) publishers in TDM?
How does Open Access
translate to Open Science?
How can they help researchers
achieve the best in their
knowledge extraction
endeavour?
What is the role of e-
Infrastructures like OpenAIRE
and OpenMinTeD?

Join us in the Openscience fair
athens, sept 6-8, 2017
www.opensciencefair.eu

THANK YOU!
Questions?
natalia manola
natalia@di.uoa.gr

Overview of Text and Data Mining Challenges and Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Overview of Text and Data Mining Challenges and Solutions

Similar to Overview of Text and Data Mining Challenges and Solutions (20)

More from openminted_eu

More from openminted_eu (15)

Recently uploaded

Recently uploaded (20)

Overview of Text and Data Mining Challenges and Solutions