FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

FaDA: Fast document aligner with
word embedding
Pintu Lohar, Debasis Ganguly, Haithem Afli,
Andy Way and Gareth F. Jones
ADAPT Centre, School of Computing, Dublin
City University

Contents
• Objective
• Introduction to FaDA
• Methodology used
• Word vector-based similarity
• Architecture of the whole system
• Experiments
• Results
• Conclusions and future work

Objective
• To align the documents in two different
languages within a large collection of
comparable documents.
• Alignment procedure should be faster with less
than quadratic time complexity.

Example of comparable documents
• The same news published in two languages

Introduction to FaDA
• FaDA (Fast Document Aligner) is a free/open-
source tool for aligning bilingual documents .
• It is a fast alignment tool with linear time
complexity.

Methodology used
• Crosslingual information retrieval (CLIR)-
based document-alignment system with word
vector-based similarity measurements.

Why word vector-based similarity ?
• CLIR-based approach takes into account only
text-based similarity without addressing the
underlying semantic match between the words.
• The word vector-based approach considers the
semantic similarity between the words.

Word vector-based similarity
• Query likelihood
Where, q1, q2, q3 → query terms
dots → words of a document in 2D space.
The centroid of document in Figure (a) is closer to the query terms than
document in Figure (b)

Combination of word vector-based
and text-based similarity
• α is the linear interpolation parameter denoting
the relative contributions from the text overlap
and word vector-based similarities

Bilingual
documents
target
documents
source
documents
Indexing
target index source index
Pseudo query of
each document
Translate by
dictionary
Translated
query terms
Compare
top n
documents
Combine word -vector
and text similarity
Select with best score
Retrieved target
document
System architecture of FaDA

Baseline
• Based on “Jaccard similarity coefﬁcient”
which measures the term overlaps between
document pairs.
• “Cosine similarity-based” and “Named
Entity matching-based” approaches did not
work well hence not used as baseline.

Tuning (Euronews data) :
Optimal parameter settings:
i. λ = 0.9
ii. Number of translation terms M = 7 and
iii. Query to document ratio τ = 0.6

Conclusions and future work
 Uses CLIR-based approach which is much faster
than the baseline (with quadratic time complexity).
 The performance is further enhanced by word
vector embedding-based approach.
 In future , we would like to apply our approach to
other language pairs.

Questions ?
and/or
Suggestions !

FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones

Similar to FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones (20)

More from Sebastian Ruder

More from Sebastian Ruder (14)

Recently uploaded

Recently uploaded (20)

FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones