oeg-upm.net
Cross-Lingual Search Engine
Carlos Badenes-Olmedo 1
Jose Luis Redondo García 2
Oscar Corcho 1
1 Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
2 Amazon Research
Cambridge, UK
Outline
• Analyze Document Collections (5min)
- index documents into a document-oriented database.
- generate charts from corpus statistics.
• Create Restful APIs from Topic Models (40min)
- train topic models from documents in database.
- publish models as Docker images at DockerHub.
• Annotate Documents with Multi-lingual Topics (25min)
- load existing topics models.
- make inferences from unseen texts.
- add labels to documents from topic distributions.
• Browse Multi-lingual Corpora (10min)
- explore document collections by filters and semantic similarities.
2
We will learn how to..
Disclaimer
• This demo aims to highlight the difficulties in building a
multilingual search engine from topic-based annotations.
• The quality of the model depends on the number of documents
used to train it. In this demo we will work with only 200 documents
for efficiency reasons (the models used during the experiments were
trained with around 20,000 documents).
3
Let's go!
4
1.Clone our demo project available at Github:
git clone https://github.com/librairy/demo.git
2.Move into the root folder:
cd demo/
3.And run it by:
docker-compose up -d
• http://librairy.linkeddata.es/
• Suite of services aimed at analyzing large-
scale document collections
• It combines NLP techniques, machine
learning algorithms and semantic
knowledge.
librAIry
5
EXPLORER
MODEL
APINLP
REPOSITORY
Carlos Badenes-Olmedo, Jose Luis Redondo-Garcia, and Oscar Corcho. 2017. Distributing Text Mining tasks with librAIry. In Proceedings of the 2017 ACM Symposium on Document Engineering
(DocEng 17). ACM, New York, NY, USA, 63-66. [DOI][PDF]
librAIry NLP
• http://librairy.linkeddata.es/nlp
• Restful API built on top of NLP open-source tools that creates tokens, bag-of-words and annotations from
unstructured texts by:
- part-of-speech tagging (and filtering), 
- stemming (Lemmas),
- entity recognition,
- DBPedia linking (spotlight), 
- Wordnet Synsets
• English, Spanish, French and Portuguese available (Italian and German coming soon)
• HTTP (JSON serialized), or TCP ( AVRO serialized) for efficiency.
• e.g:
“Natural language processing combines linguistics, computer science, information engineering, and artificial intelligence”
6
librAIry API
• http://localhost:8081 (user: demo, password: 2019)
• Restful API that materializes our algorithms to:
- save/parse documents efficiently into a document-oriented
repository
- create and distribute probabilistic topic model as Restful
services

- annotate documents with hierarchies of concepts
- search for semantically similar texts.
• It is designed to work with both local or remote repositories,
so that it can be easily integrated into existing environments,
by defining:
- dataSink: where the operations performed will persist
- dataSource: from where the data will be read
7
MODEL
API
MODEL
Documents
1.Customize the following file:
json/load-en-documents.json
2.Make an API request to /documents by running the following script:
./steps/1-load-documents-en.sh
3.Documents will be available in the repository:
http://localhost:8983/solr/banana
8
EUROVOC Categories
9
CATEGORY DESCRIPTION
4361 communication systems
4488 data processing
2817 intellectual property
2524 pollution
4415 technology
Models
1.Customize the following file:
json/topic-model-en.json
2.Make an API request to /topics by running the following script:
./steps/2-create-topics-en.sh
3. A model will be available at DockerHub:
https://hub.docker.com/
10
Customize Models
11
• Model parameters can be set on request to modify the training process:
Parameter Default Description
alpha 0.1 Topic distributions per document
beta 0.01 Topic distributions per word
maxdocratio 0.7 maximum presence (ratio) of a word
minfreq 5 minimum presence (#docs) of a word
pos NOUN VERB ADJECTIVE allowed part-of-speech tags
… … …
• Just add a parameters section in the request body:
json/topic-model-en.json
Topics
1.Load the model as a service into a Docker descriptor:
models/docker-compose.yml
2.Run the service by:
docker-compose -f models/docker-compose.yml up
3. Explore the topics at:
http://localhost:7777
12
EUROVOC Categories
13
CATEGORY DESCRIPTION
4361 communication systems
4488 data processing
2817 intellectual property
2524 pollution
4415 technology
Annotations
1.Customize the following file for annotate documents with topics:
json/annotate-en.json
2.Make an API request to /annotations by running the following script:
./steps/4-create-annotations-en.sh
3. Monitor progress by:
docker-compose logs -f
14
Inferences
15
• Let’s discover the main topics in this text:
E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or
over the Internet. Electronic commerce draws on technologies such as mobile commerce, electronic funds transfer,
supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI),
inventory management systems, and automated data collection systems. E-commerce is in turn driven by the
technological advances of the semiconductor industry, and is the largest sector of the electronics industry
http://localhost:7777/classes
Search Engine
• Browse the corpus by
- topic filters
- language filters
- document similarities
16
http://localhost:8080/
Multi-lingual Documents
1.Customize the following file:
json/load-es-documents.json
2.Make an API request to /documents by running the following script:
./steps/5-load-documents-es.sh
3.Documents will be available in the repository:
http://localhost:8983/solr/banana
17
Multi-lingual Models
1.Customize the following file:
json/topic-model-es.json
2.Make an API request to /topics by running the following script:
./steps/6-create-topics-es.sh
3. A model will be available at DockerHub:
https://hub.docker.com/
18
Multi-lingual Topics
1.Add the Spanish model as a service into the Docker descriptor:
models/docker-compose.yml
2.Run the service by:
docker-compose -f models/docker-compose.yml up
3. Explore the topics at:
http://localhost:7778
19
EUROVOC Categories
20
CATEGORY DESCRIPTION
4361 communication systems
4488 data processing
2817 intellectual property
2524 pollution
4415 technology
Multi-lingual Annotations
1.Customize the following file:
json/annotate-es.json
2.Make an API request to /annotations by running the following script:
./steps/7-create-annotations-es.sh
3. Monitor progress by:
docker-compose logs -f
21
Inferences
22
• Let’s discover the main topics in this text:
El comercio electrónico puede utilizarse en cualquier entorno en el que se intercambien documentos entre
empresas: compras o adquisiciones, finanzas, industria, transporte, salud, legislación y recolección de ingresos o
impuestos.
http://localhost:7778/classes
Cross-lingual Search Engine
• Browse the corpus by
- topic filters
- language filters
- document similarities
23
http://localhost:8080/
External Annotations
• External models available online can be used to annotate documents.
• Use the following models (trained with the complete JRC-Acquis
dataset) to annotate your documents:
- http://librairy.linkeddata.es/jrc-en-model
- http://librairy.linkeddata.es/jrc-es-model
- http://librairy.linkeddata.es/jrc-fr-model
• And create a cross-lingual search engine from these annotations.
24
Multi-lingual Annotations
1.Make an API request to /annotations (English documents) by running the
following script:
./steps/8-create-external-annotations-en.sh
2. Make an API request to /annotations (Spanish documents) by running
the following script:
./steps/9-create-external-annotations-es.sh
25
Cross-lingual Search Engine
• Browse the corpus by
- topic filters
- language filters
- document similarities
26
http://localhost:8080/
oeg-upm.net
Cross-Lingual Search Engine
Carlos Badenes-Olmedo 1
Jose Luis Redondo García 2
Oscar Corcho 1
1 Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
2 Amazon Research
Cambridge, UK

Crosslingual search-engine

  • 1.
    oeg-upm.net Cross-Lingual Search Engine CarlosBadenes-Olmedo 1 Jose Luis Redondo García 2 Oscar Corcho 1 1 Ontology Engineering Group Universidad Politécnica de Madrid, Spain 2 Amazon Research Cambridge, UK
  • 2.
    Outline • Analyze DocumentCollections (5min) - index documents into a document-oriented database. - generate charts from corpus statistics. • Create Restful APIs from Topic Models (40min) - train topic models from documents in database. - publish models as Docker images at DockerHub. • Annotate Documents with Multi-lingual Topics (25min) - load existing topics models. - make inferences from unseen texts. - add labels to documents from topic distributions. • Browse Multi-lingual Corpora (10min) - explore document collections by filters and semantic similarities. 2 We will learn how to..
  • 3.
    Disclaimer • This demoaims to highlight the difficulties in building a multilingual search engine from topic-based annotations. • The quality of the model depends on the number of documents used to train it. In this demo we will work with only 200 documents for efficiency reasons (the models used during the experiments were trained with around 20,000 documents). 3
  • 4.
    Let's go! 4 1.Clone ourdemo project available at Github: git clone https://github.com/librairy/demo.git 2.Move into the root folder: cd demo/ 3.And run it by: docker-compose up -d
  • 5.
    • http://librairy.linkeddata.es/ • Suiteof services aimed at analyzing large- scale document collections • It combines NLP techniques, machine learning algorithms and semantic knowledge. librAIry 5 EXPLORER MODEL APINLP REPOSITORY Carlos Badenes-Olmedo, Jose Luis Redondo-Garcia, and Oscar Corcho. 2017. Distributing Text Mining tasks with librAIry. In Proceedings of the 2017 ACM Symposium on Document Engineering (DocEng 17). ACM, New York, NY, USA, 63-66. [DOI][PDF]
  • 6.
    librAIry NLP • http://librairy.linkeddata.es/nlp •Restful API built on top of NLP open-source tools that creates tokens, bag-of-words and annotations from unstructured texts by: - part-of-speech tagging (and filtering),  - stemming (Lemmas), - entity recognition, - DBPedia linking (spotlight),  - Wordnet Synsets • English, Spanish, French and Portuguese available (Italian and German coming soon) • HTTP (JSON serialized), or TCP ( AVRO serialized) for efficiency. • e.g: “Natural language processing combines linguistics, computer science, information engineering, and artificial intelligence” 6
  • 7.
    librAIry API • http://localhost:8081(user: demo, password: 2019) • Restful API that materializes our algorithms to: - save/parse documents efficiently into a document-oriented repository - create and distribute probabilistic topic model as Restful services
 - annotate documents with hierarchies of concepts - search for semantically similar texts. • It is designed to work with both local or remote repositories, so that it can be easily integrated into existing environments, by defining: - dataSink: where the operations performed will persist - dataSource: from where the data will be read 7 MODEL API MODEL
  • 8.
    Documents 1.Customize the followingfile: json/load-en-documents.json 2.Make an API request to /documents by running the following script: ./steps/1-load-documents-en.sh 3.Documents will be available in the repository: http://localhost:8983/solr/banana 8
  • 9.
    EUROVOC Categories 9 CATEGORY DESCRIPTION 4361communication systems 4488 data processing 2817 intellectual property 2524 pollution 4415 technology
  • 10.
    Models 1.Customize the followingfile: json/topic-model-en.json 2.Make an API request to /topics by running the following script: ./steps/2-create-topics-en.sh 3. A model will be available at DockerHub: https://hub.docker.com/ 10
  • 11.
    Customize Models 11 • Modelparameters can be set on request to modify the training process: Parameter Default Description alpha 0.1 Topic distributions per document beta 0.01 Topic distributions per word maxdocratio 0.7 maximum presence (ratio) of a word minfreq 5 minimum presence (#docs) of a word pos NOUN VERB ADJECTIVE allowed part-of-speech tags … … … • Just add a parameters section in the request body: json/topic-model-en.json
  • 12.
    Topics 1.Load the modelas a service into a Docker descriptor: models/docker-compose.yml 2.Run the service by: docker-compose -f models/docker-compose.yml up 3. Explore the topics at: http://localhost:7777 12
  • 13.
    EUROVOC Categories 13 CATEGORY DESCRIPTION 4361communication systems 4488 data processing 2817 intellectual property 2524 pollution 4415 technology
  • 14.
    Annotations 1.Customize the followingfile for annotate documents with topics: json/annotate-en.json 2.Make an API request to /annotations by running the following script: ./steps/4-create-annotations-en.sh 3. Monitor progress by: docker-compose logs -f 14
  • 15.
    Inferences 15 • Let’s discoverthe main topics in this text: E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or over the Internet. Electronic commerce draws on technologies such as mobile commerce, electronic funds transfer, supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI), inventory management systems, and automated data collection systems. E-commerce is in turn driven by the technological advances of the semiconductor industry, and is the largest sector of the electronics industry http://localhost:7777/classes
  • 16.
    Search Engine • Browsethe corpus by - topic filters - language filters - document similarities 16 http://localhost:8080/
  • 17.
    Multi-lingual Documents 1.Customize thefollowing file: json/load-es-documents.json 2.Make an API request to /documents by running the following script: ./steps/5-load-documents-es.sh 3.Documents will be available in the repository: http://localhost:8983/solr/banana 17
  • 18.
    Multi-lingual Models 1.Customize thefollowing file: json/topic-model-es.json 2.Make an API request to /topics by running the following script: ./steps/6-create-topics-es.sh 3. A model will be available at DockerHub: https://hub.docker.com/ 18
  • 19.
    Multi-lingual Topics 1.Add theSpanish model as a service into the Docker descriptor: models/docker-compose.yml 2.Run the service by: docker-compose -f models/docker-compose.yml up 3. Explore the topics at: http://localhost:7778 19
  • 20.
    EUROVOC Categories 20 CATEGORY DESCRIPTION 4361communication systems 4488 data processing 2817 intellectual property 2524 pollution 4415 technology
  • 21.
    Multi-lingual Annotations 1.Customize thefollowing file: json/annotate-es.json 2.Make an API request to /annotations by running the following script: ./steps/7-create-annotations-es.sh 3. Monitor progress by: docker-compose logs -f 21
  • 22.
    Inferences 22 • Let’s discoverthe main topics in this text: El comercio electrónico puede utilizarse en cualquier entorno en el que se intercambien documentos entre empresas: compras o adquisiciones, finanzas, industria, transporte, salud, legislación y recolección de ingresos o impuestos. http://localhost:7778/classes
  • 23.
    Cross-lingual Search Engine •Browse the corpus by - topic filters - language filters - document similarities 23 http://localhost:8080/
  • 24.
    External Annotations • Externalmodels available online can be used to annotate documents. • Use the following models (trained with the complete JRC-Acquis dataset) to annotate your documents: - http://librairy.linkeddata.es/jrc-en-model - http://librairy.linkeddata.es/jrc-es-model - http://librairy.linkeddata.es/jrc-fr-model • And create a cross-lingual search engine from these annotations. 24
  • 25.
    Multi-lingual Annotations 1.Make anAPI request to /annotations (English documents) by running the following script: ./steps/8-create-external-annotations-en.sh 2. Make an API request to /annotations (Spanish documents) by running the following script: ./steps/9-create-external-annotations-es.sh 25
  • 26.
    Cross-lingual Search Engine •Browse the corpus by - topic filters - language filters - document similarities 26 http://localhost:8080/
  • 27.
    oeg-upm.net Cross-Lingual Search Engine CarlosBadenes-Olmedo 1 Jose Luis Redondo García 2 Oscar Corcho 1 1 Ontology Engineering Group Universidad Politécnica de Madrid, Spain 2 Amazon Research Cambridge, UK