Distributing Text Mining tasks
with librAIry
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
cbadenes@fi.upm.es
@cbolmedo DocEng’17
oeg-upm.net
Distributing Text Mining tasks with librAIry
Motivation
2
Máster / PhD
Thesis
Journals
book
collection
interesting
newspaper
article
related
documents
Distributing Text Mining tasks with librAIry
Alternatives
3
Given a text,
retrieve similar documents
among a huge amount of textual files
from multiple sources
DIGITAL PUBLISHERS REFERENCE MANAGERS TEXT MINING TOOLS
given a text
multiple sources
similar documents
huge amount
given a text
multiple sources
similar documents
huge amount
given a text
multiple sources
similar documents
huge amount
multiple sources
text
similar documents
huge amount
Distributing Text Mining tasks with librAIry
exploration
User
Purposes
software
developing
librAIry
4
Given a text,
retrieve similar documents
among a huge amount of textual files
from multiple sources
librAIry
MANAGE ANNOTATE
DISCOVER EXTEND
• textual data
• multiple sources 

OAI-PMH, Elsevier, PDFs…
• large corpus

distributed, high-performance..
• NLP

entities, multi-words, PoS
• sections
• rhetorical parts
• topics
• similarities
• trends
• plug&play
• standards based
• emergent flows
• multi-language
text mining framework
oriented to
Distributing Text Mining tasks with librAIry
librAIry Model
5
Resources:
Actions / States:
domain document part
contains describes
annotation
Open Annotation Data Model (W3C)
Dublin Core Metadata Initiative (DCMI)
has
hashas
Linked Data Principles
Distributing Text Mining tasks with librAIry
librAIry Model
6
Events:
resource.state
Example:
document.created
document.created
document.created
document.#
Routing-Key
URI + timestamp
Body
document.updated
document.updated document.updated
Advanced Message Queuing Protocol (AMQP)
Distributing Text Mining tasks with librAIry
librAIry Modules
7
OAI-PMH
Harvester
Elsevier
Harvester
File
Harvester
Rest API
http://librairy.github.io
event-bus
newspaper
article
related
documents
Tokenizer
Annotator
LDA
Modeler
JSD
Annotator
Máster / PhD
Thesis
Journals
book
collection
librAIry
UDM
PDFs
Set your title here in Slide Master Options 8
librAIry
Distributing Text Mining tasks with librAIry
librAIry Usage
9
Paper Repository
(EU Project)
http://drinventor.dia.fi.upm.es
Support to Decision Makers
for analyzing patents and public AIDs
Book
Recommender
http://librairy.linkeddata.es
Real Scenarios
Distributing Text Mining tasks with librAIry
next steps
10
1. Fine-grained resource definition

- document -> MODELS -> domains 

- ‘annotations’ instead of ‘parts’ and ‘domains’

2. Resource URI in routing-keys

3. Passive-writes on Storage operations

Distributing Text Mining tasks
with librAIry
Carlos Badenes-Olmedo
Jose Luis Redondo-Garcia
Oscar Corcho
Ontology Engineering Group
Universidad Politécnica de Madrid
Spain
cbadenes@fi.upm.es
@cbolmedo
oeg-upm.net
http://librairy.github.io
Work supported by project Datos 4.0 with reference
TIN2016-78011-C4-4-R,
by the Spanish Ministry MINECO and by FEDER

Distributing Text Mining tasks with librAIry

  • 1.
    Distributing Text Miningtasks with librAIry Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain cbadenes@fi.upm.es @cbolmedo DocEng’17 oeg-upm.net
  • 2.
    Distributing Text Miningtasks with librAIry Motivation 2 Máster / PhD Thesis Journals book collection interesting newspaper article related documents
  • 3.
    Distributing Text Miningtasks with librAIry Alternatives 3 Given a text, retrieve similar documents among a huge amount of textual files from multiple sources DIGITAL PUBLISHERS REFERENCE MANAGERS TEXT MINING TOOLS given a text multiple sources similar documents huge amount given a text multiple sources similar documents huge amount given a text multiple sources similar documents huge amount multiple sources text similar documents huge amount
  • 4.
    Distributing Text Miningtasks with librAIry exploration User Purposes software developing librAIry 4 Given a text, retrieve similar documents among a huge amount of textual files from multiple sources librAIry MANAGE ANNOTATE DISCOVER EXTEND • textual data • multiple sources 
 OAI-PMH, Elsevier, PDFs… • large corpus
 distributed, high-performance.. • NLP
 entities, multi-words, PoS • sections • rhetorical parts • topics • similarities • trends • plug&play • standards based • emergent flows • multi-language text mining framework oriented to
  • 5.
    Distributing Text Miningtasks with librAIry librAIry Model 5 Resources: Actions / States: domain document part contains describes annotation Open Annotation Data Model (W3C) Dublin Core Metadata Initiative (DCMI) has hashas Linked Data Principles
  • 6.
    Distributing Text Miningtasks with librAIry librAIry Model 6 Events: resource.state Example: document.created document.created document.created document.# Routing-Key URI + timestamp Body document.updated document.updated document.updated Advanced Message Queuing Protocol (AMQP)
  • 7.
    Distributing Text Miningtasks with librAIry librAIry Modules 7 OAI-PMH Harvester Elsevier Harvester File Harvester Rest API http://librairy.github.io event-bus newspaper article related documents Tokenizer Annotator LDA Modeler JSD Annotator Máster / PhD Thesis Journals book collection librAIry UDM PDFs
  • 8.
    Set your titlehere in Slide Master Options 8 librAIry
  • 9.
    Distributing Text Miningtasks with librAIry librAIry Usage 9 Paper Repository (EU Project) http://drinventor.dia.fi.upm.es Support to Decision Makers for analyzing patents and public AIDs Book Recommender http://librairy.linkeddata.es Real Scenarios
  • 10.
    Distributing Text Miningtasks with librAIry next steps 10 1. Fine-grained resource definition
 - document -> MODELS -> domains 
 - ‘annotations’ instead of ‘parts’ and ‘domains’
 2. Resource URI in routing-keys
 3. Passive-writes on Storage operations

  • 11.
    Distributing Text Miningtasks with librAIry Carlos Badenes-Olmedo Jose Luis Redondo-Garcia Oscar Corcho Ontology Engineering Group Universidad Politécnica de Madrid Spain cbadenes@fi.upm.es @cbolmedo oeg-upm.net http://librairy.github.io Work supported by project Datos 4.0 with reference TIN2016-78011-C4-4-R, by the Spanish Ministry MINECO and by FEDER