Crosslingual search-engine

oeg-upm.net
Cross-Lingual Search Engine
Carlos Badenes-Olmedo 1
Jose Luis Redondo García 2
Oscar Corcho 1
1 Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
2 Amazon Research
Cambridge, UK

Outline
• Analyze Document Collections (5min)
- index documents into a document-oriented database.
- generate charts from corpus statistics.
• Create Restful APIs from Topic Models (40min)
- train topic models from documents in database.
- publish models as Docker images at DockerHub.
• Annotate Documents with Multi-lingual Topics (25min)
- load existing topics models.
- make inferences from unseen texts.
- add labels to documents from topic distributions.
• Browse Multi-lingual Corpora (10min)
- explore document collections by ﬁlters and semantic similarities.
2
We will learn how to..

Disclaimer
• This demo aims to highlight the difﬁculties in building a
multilingual search engine from topic-based annotations.
• The quality of the model depends on the number of documents
used to train it. In this demo we will work with only 200 documents
for efﬁciency reasons (the models used during the experiments were
trained with around 20,000 documents).
3

Let's go!
4
1.Clone our demo project available at Github:
git clone https://github.com/librairy/demo.git
2.Move into the root folder:
cd demo/
3.And run it by:
docker-compose up -d

• http://librairy.linkeddata.es/
• Suite of services aimed at analyzing large-
scale document collections
• It combines NLP techniques, machine
learning algorithms and semantic
knowledge.
librAIry
5
EXPLORER
MODEL
APINLP
REPOSITORY
Carlos Badenes-Olmedo, Jose Luis Redondo-Garcia, and Oscar Corcho. 2017. Distributing Text Mining tasks with librAIry. In Proceedings of the 2017 ACM Symposium on Document Engineering
(DocEng 17). ACM, New York, NY, USA, 63-66. [DOI][PDF]

librAIry NLP
• http://librairy.linkeddata.es/nlp
• Restful API built on top of NLP open-source tools that creates tokens, bag-of-words and annotations from
unstructured texts by:
- part-of-speech tagging (and filtering),
- stemming (Lemmas),
- entity recognition,
- DBPedia linking (spotlight),
- Wordnet Synsets
• English, Spanish, French and Portuguese available (Italian and German coming soon)
• HTTP (JSON serialized), or TCP ( AVRO serialized) for efficiency.
• e.g:
“Natural language processing combines linguistics, computer science, information engineering, and artificial intelligence”
6

librAIry API
• http://localhost:8081 (user: demo, password: 2019)
• Restful API that materializes our algorithms to:
- save/parse documents efﬁciently into a document-oriented
repository
- create and distribute probabilistic topic model as Restful
services 
- annotate documents with hierarchies of concepts
- search for semantically similar texts.
• It is designed to work with both local or remote repositories,
so that it can be easily integrated into existing environments,
by deﬁning:
- dataSink: where the operations performed will persist
- dataSource: from where the data will be read
7
MODEL
API
MODEL

Documents
1.Customize the following ﬁle:
json/load-en-documents.json
2.Make an API request to /documents by running the following script:
./steps/1-load-documents-en.sh
3.Documents will be available in the repository:
http://localhost:8983/solr/banana
8

EUROVOC Categories
9
CATEGORY DESCRIPTION
4361 communication systems
4488 data processing
2817 intellectual property
2524 pollution
4415 technology

Models
json/topic-model-en.json
2.Make an API request to /topics by running the following script:
./steps/2-create-topics-en.sh
3. A model will be available at DockerHub:
https://hub.docker.com/
10

Customize Models
11
• Model parameters can be set on request to modify the training process:
Parameter Default Description
alpha 0.1 Topic distributions per document
beta 0.01 Topic distributions per word
maxdocratio 0.7 maximum presence (ratio) of a word
minfreq 5 minimum presence (#docs) of a word
pos NOUN VERB ADJECTIVE allowed part-of-speech tags
… … …
• Just add a parameters section in the request body:
json/topic-model-en.json

Topics
1.Load the model as a service into a Docker descriptor:
models/docker-compose.yml
2.Run the service by:
docker-compose -f models/docker-compose.yml up
3. Explore the topics at:
http://localhost:7777
12

EUROVOC Categories
13
2524 pollution
4415 technology

Annotations
1.Customize the following ﬁle for annotate documents with topics:
json/annotate-en.json
2.Make an API request to /annotations by running the following script:
./steps/4-create-annotations-en.sh
3. Monitor progress by:
docker-compose logs -f
14

Inferences
15
• Let’s discover the main topics in this text:
E-commerce (electronic commerce) is the activity of electronically buying or selling of products on online services or
over the Internet. Electronic commerce draws on technologies such as mobile commerce, electronic funds transfer,
supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI),
inventory management systems, and automated data collection systems. E-commerce is in turn driven by the
technological advances of the semiconductor industry, and is the largest sector of the electronics industry
http://localhost:7777/classes

Search Engine
• Browse the corpus by
- topic ﬁlters
- language ﬁlters
- document similarities
16
http://localhost:8080/

Multi-lingual Documents
json/load-es-documents.json
2.Make an API request to /documents by running the following script:
./steps/5-load-documents-es.sh
3.Documents will be available in the repository:
http://localhost:8983/solr/banana
17

Multi-lingual Models
json/topic-model-es.json
2.Make an API request to /topics by running the following script:
./steps/6-create-topics-es.sh
3. A model will be available at DockerHub:
https://hub.docker.com/
18

Multi-lingual Topics
1.Add the Spanish model as a service into the Docker descriptor:
models/docker-compose.yml
2.Run the service by:
docker-compose -f models/docker-compose.yml up
3. Explore the topics at:
http://localhost:7778
19

EUROVOC Categories
20
2524 pollution
4415 technology

Multi-lingual Annotations
json/annotate-es.json
2.Make an API request to /annotations by running the following script:
./steps/7-create-annotations-es.sh
3. Monitor progress by:
docker-compose logs -f
21

Inferences
22
• Let’s discover the main topics in this text:
El comercio electrónico puede utilizarse en cualquier entorno en el que se intercambien documentos entre
empresas: compras o adquisiciones, ﬁnanzas, industria, transporte, salud, legislación y recolección de ingresos o
impuestos.
http://localhost:7778/classes

Cross-lingual Search Engine
- topic ﬁlters
- language ﬁlters
23

External Annotations
• External models available online can be used to annotate documents.
• Use the following models (trained with the complete JRC-Acquis
dataset) to annotate your documents:
- http://librairy.linkeddata.es/jrc-en-model
- http://librairy.linkeddata.es/jrc-es-model
- http://librairy.linkeddata.es/jrc-fr-model
• And create a cross-lingual search engine from these annotations.
24

Multi-lingual Annotations
1.Make an API request to /annotations (English documents) by running the
following script:
./steps/8-create-external-annotations-en.sh
2. Make an API request to /annotations (Spanish documents) by running
the following script:
./steps/9-create-external-annotations-es.sh
25

Cross-lingual Search Engine
- topic ﬁlters
- language ﬁlters
26

Crosslingual search-engine

More Related Content

What's hot

Similar to Crosslingual search-engine

More from Carlos Badenes-Olmedo

Recently uploaded

Crosslingual search-engine