1
twitter.com/openminted_eu
Peter Mutschke
ITOC Workshop
Philadelphia – February 20, 2016
Open Mining Infrastructure for
Text & Data (OpenMinTeD)
2
Goal of Text Mining
This is where the footer goes
implementation of transformational processes
that …
uncover knowledge in unstructured text
 salient content items
 hidden relationships between content items
…to assist researchers and scientific data
curators in making sense of the textual data
• 1
• 2
• 3
• 4
• 5
• 6
• 7
3
The phases of text mining
taken from ICT2015 presentation (N. Manola)@openminted_eu
NLP Analysis
Entity
Recognition
Data Mining
Knowledge
Discovery
Information
Extraction
STAGE 1 STAGE 2 STAGE 3 STAGE 4
Information
Retrieval
OPENMINTED - The Open Mining Infrastructure for Text and Data
4
Challenges
This is where the footer goes
Text Mining (TM)
 remains a fragmented set of tools
 TM requires particular technological and analytical skills
as well as domain knowledge
 no shared knowledge how to apply
 lack of a central infrastructure
(may rule out use of TM for small research groups)
high entry costs:
need to share infrastructure costs
5
Putting it all together
This is where the footer goes
OpenMinTeD
Establish an open and sustainable Text and
Data Mining (TDM) platform and infrastructure
where researchers can collaboratively create,
discover, share and re-use knowledge from a
wide range of text based scientific and
scholarly related sources
• 1
• 2
• 3
• 4
• 5
• 6
• 7
6
OpenMinTeD – working on
many fronts
@openminted_eu
6
ACCESSIBLE
CONTENT
DISCOVERABLE
SERVICES
EFFICIENT
PROCESSING
TDM
COMMUNITIES
VALUE ADDED
APPS
Via standardised programmatic
interfaces and access rules
Well-documented easily
discoverable text mining services
and workflows which process,
analyse and annotate text
Operate on public e-Infrastructures
via standarized APIs
Different scientific communities
have different challenges
Community-driven applications to
illustrate the value of the
infastructure. Engage with industry.
OPENMINTED - The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
• 1
• 2
• 3
• 4
• 5
• 6
• 7
7
Bridging the gap between different
communities
@openminted_eu
• 1
• 2
• 3
• 4
• 5
• 6
• 7
8
The project
Starts: June 2015
Duration: 3 years
16 Partners:
- 6 mining research groups
- 3 content providers
- 1 data center
- 1 library association
- 2 legal experts
- 6 community related partners
- 2 SMEs
Athena RIC
Univ. of Manchester (NacTem)
Univ. of Darmstadt
INRA
EMBL-EBI
Agro-Know
LIBER
Univ. of Amsterdam
Open University UK
EPFL
CNIO
Univ. of Sheffield (GATE)
GESIS
GRNET
Frontiers
Univ. of Stirling
PARTNERS
@openminted_eu
OPENMINTED = The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
9
OpenMinTeD users
This is where the footer goes
 TM consumer to advance their science
 Service Providers to enhance their
tools
 TM researcher to share their algorithms
 Content providers to enrich their
content
10
Infrastructural approach
This is where the footer goes
 OpenMinted does not build new services,
but adopts and adapts existing services for
new communities
 Focuses on interoperability across text
mining services and content providers
 Creates an open & collaborative space for
researchers to use the best fitting textmining
services available
• 1
• 2
• 3
• 4
• 5
• 6
• 7
11@openminted_eu
Data centre Data centre Data centre Data centre
in public cloud
Publisher text
corpus
OpenAIRE/CORE text
corpus
PMC text
corpus
Other text
corpora
Other text
corpora
Other text
corpora
Other types of text
corpora
Layer 3:
Interoperability
to shared storage and
computing resources
Language resources
Language resources
Language resources Language resources
Layer 2:
Interoperability of
language resources
& corpora
Layer 1:
Interoperability
of text mining services
(platforms or
components)
Language resources and corpora registry service
Platform services
Users: researchers, curators, text-miners and new services developers
Registry Workflow ManagementAuth2 & Policy management Annotator Accounting
Mining Platforms Mining Platforms Mining Platforms
Proprietary architectures
Mining Platforms
OPENMINTED = The Open Mining Infrastructure for Text and Data
The architecture
taken from ICT2015 presentation (N. Manola)
• 1
• 2
• 3
• 4
• 5
• 6
• 7
12@openminted_eu
RESEARCH
ANALYTICS
SOCIAL
SCIENCES
AGRICULTURELIFE
SCIENCES
Bottom-up approach
OpenMinTeD works with 4 use cases, which
give their requirements and evaluate the results.
OPENMINTED = The Open Mining Infrastructure for Text and Data
taken from ICT2015 presentation (N. Manola)
13
Science driven approach
This is where the footer goes
14
GESIS: Infrastructure for the
Social Sciences
This is where the footer goes
15
GESIS Research Data Cycle
This is where the footer goes
Study planningArchiving and
registering
Searching
Data collectionData analysis
15
16
Difficulties in Information Seeking
This is where the footer goes
17
Problems Processing Search Results
This is where the footer goes
18
Usefulness of TM enhanced search services
This is where the footer goes
• 1
• 2
• 3
• 4
• 5
• 6
• 7
19
Social Science Use case
Develop and evaluate methods for
automatic detection and linking of named
entities in Social Science publications in
order to advance reliable and context-
sensitive retrieval and linking of relevant
entities
1@openminted_eu
20
Enhancing Search in Text and Data
This is where the footer goes
 classical named entity recognition and
disambiguation of relevant entities (names,
places, organizations, terms) to enhance
automatic indexing
 recognition of vague variable mentions to
enhance linking of data and publications
 enrich data with context information from text
to enhance retrievability of data sets
21
Identifying references to survey variables
This is where the footer goes
OLGA NEŠPOROVÁ, ZDENĚK
R. NEŠPOR (2009). “Religion: An
Unsolved Problem for the Modern
Czech Nation”
ISSP 2008
Link Database
v39: Believe in life after death
v40: Believe in Heaven
22
Benefits from user perspective
This is where the footer goes
 semantic search: understanding the contextual
meaning of (search) terms
 fuzzy phrase search: search for attitudes,
survey questions in texts (under vagueness)
 link retrieval: search and retrieval of links
between text and data
 dataset retrieval: facilitating search for research
data in data catalogues at the level of items and
variables
• 1
• 2
• 3
• 4
• 5
• 6
• 7
23
Contact us
www.openminted.eu
peter.mutschke@gesis.org
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinked
in
vimeo.com/openminted
bit.do/openmintedplus

OpenMinted: It's Uses and Benefits for the Social Sciences

  • 1.
    1 twitter.com/openminted_eu Peter Mutschke ITOC Workshop Philadelphia– February 20, 2016 Open Mining Infrastructure for Text & Data (OpenMinTeD)
  • 2.
    2 Goal of TextMining This is where the footer goes implementation of transformational processes that … uncover knowledge in unstructured text  salient content items  hidden relationships between content items …to assist researchers and scientific data curators in making sense of the textual data
  • 3.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 3 The phases of text mining taken from ICT2015 presentation (N. Manola)@openminted_eu NLP Analysis Entity Recognition Data Mining Knowledge Discovery Information Extraction STAGE 1 STAGE 2 STAGE 3 STAGE 4 Information Retrieval OPENMINTED - The Open Mining Infrastructure for Text and Data
  • 4.
    4 Challenges This is wherethe footer goes Text Mining (TM)  remains a fragmented set of tools  TM requires particular technological and analytical skills as well as domain knowledge  no shared knowledge how to apply  lack of a central infrastructure (may rule out use of TM for small research groups) high entry costs: need to share infrastructure costs
  • 5.
    5 Putting it alltogether This is where the footer goes OpenMinTeD Establish an open and sustainable Text and Data Mining (TDM) platform and infrastructure where researchers can collaboratively create, discover, share and re-use knowledge from a wide range of text based scientific and scholarly related sources
  • 6.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 6 OpenMinTeD – working on many fronts @openminted_eu 6 ACCESSIBLE CONTENT DISCOVERABLE SERVICES EFFICIENT PROCESSING TDM COMMUNITIES VALUE ADDED APPS Via standardised programmatic interfaces and access rules Well-documented easily discoverable text mining services and workflows which process, analyse and annotate text Operate on public e-Infrastructures via standarized APIs Different scientific communities have different challenges Community-driven applications to illustrate the value of the infastructure. Engage with industry. OPENMINTED - The Open Mining Infrastructure for Text and Data taken from ICT2015 presentation (N. Manola)
  • 7.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 7 Bridging the gap between different communities @openminted_eu
  • 8.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 8 The project Starts: June 2015 Duration: 3 years 16 Partners: - 6 mining research groups - 3 content providers - 1 data center - 1 library association - 2 legal experts - 6 community related partners - 2 SMEs Athena RIC Univ. of Manchester (NacTem) Univ. of Darmstadt INRA EMBL-EBI Agro-Know LIBER Univ. of Amsterdam Open University UK EPFL CNIO Univ. of Sheffield (GATE) GESIS GRNET Frontiers Univ. of Stirling PARTNERS @openminted_eu OPENMINTED = The Open Mining Infrastructure for Text and Data taken from ICT2015 presentation (N. Manola)
  • 9.
    9 OpenMinTeD users This iswhere the footer goes  TM consumer to advance their science  Service Providers to enhance their tools  TM researcher to share their algorithms  Content providers to enrich their content
  • 10.
    10 Infrastructural approach This iswhere the footer goes  OpenMinted does not build new services, but adopts and adapts existing services for new communities  Focuses on interoperability across text mining services and content providers  Creates an open & collaborative space for researchers to use the best fitting textmining services available
  • 11.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 11@openminted_eu Data centre Data centre Data centre Data centre in public cloud Publisher text corpus OpenAIRE/CORE text corpus PMC text corpus Other text corpora Other text corpora Other text corpora Other types of text corpora Layer 3: Interoperability to shared storage and computing resources Language resources Language resources Language resources Language resources Layer 2: Interoperability of language resources & corpora Layer 1: Interoperability of text mining services (platforms or components) Language resources and corpora registry service Platform services Users: researchers, curators, text-miners and new services developers Registry Workflow ManagementAuth2 & Policy management Annotator Accounting Mining Platforms Mining Platforms Mining Platforms Proprietary architectures Mining Platforms OPENMINTED = The Open Mining Infrastructure for Text and Data The architecture taken from ICT2015 presentation (N. Manola)
  • 12.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 12@openminted_eu RESEARCH ANALYTICS SOCIAL SCIENCES AGRICULTURELIFE SCIENCES Bottom-up approach OpenMinTeD works with 4 use cases, which give their requirements and evaluate the results. OPENMINTED = The Open Mining Infrastructure for Text and Data taken from ICT2015 presentation (N. Manola)
  • 13.
    13 Science driven approach Thisis where the footer goes
  • 14.
    14 GESIS: Infrastructure forthe Social Sciences This is where the footer goes
  • 15.
    15 GESIS Research DataCycle This is where the footer goes Study planningArchiving and registering Searching Data collectionData analysis 15
  • 16.
    16 Difficulties in InformationSeeking This is where the footer goes
  • 17.
    17 Problems Processing SearchResults This is where the footer goes
  • 18.
    18 Usefulness of TMenhanced search services This is where the footer goes
  • 19.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 19 Social Science Use case Develop and evaluate methods for automatic detection and linking of named entities in Social Science publications in order to advance reliable and context- sensitive retrieval and linking of relevant entities 1@openminted_eu
  • 20.
    20 Enhancing Search inText and Data This is where the footer goes  classical named entity recognition and disambiguation of relevant entities (names, places, organizations, terms) to enhance automatic indexing  recognition of vague variable mentions to enhance linking of data and publications  enrich data with context information from text to enhance retrievability of data sets
  • 21.
    21 Identifying references tosurvey variables This is where the footer goes OLGA NEŠPOROVÁ, ZDENĚK R. NEŠPOR (2009). “Religion: An Unsolved Problem for the Modern Czech Nation” ISSP 2008 Link Database v39: Believe in life after death v40: Believe in Heaven
  • 22.
    22 Benefits from userperspective This is where the footer goes  semantic search: understanding the contextual meaning of (search) terms  fuzzy phrase search: search for attitudes, survey questions in texts (under vagueness)  link retrieval: search and retrieval of links between text and data  dataset retrieval: facilitating search for research data in data catalogues at the level of items and variables
  • 23.
    • 1 • 2 •3 • 4 • 5 • 6 • 7 23 Contact us www.openminted.eu peter.mutschke@gesis.org twitter.com/openminted_eu facebook.com/openminted bit.do/openmintedlinked in vimeo.com/openminted bit.do/openmintedplus