This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
Oral presentation given at ACL 2018 about our paper Strong Baselines for Neural Semi-supervised Learning under Domain Shift (http://aclweb.org/anthology/P18-1096).
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
Oral presentation given at ACL 2018 about our paper Strong Baselines for Neural Semi-supervised Learning under Domain Shift (http://aclweb.org/anthology/P18-1096).
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
qPRP is the new model for IR. Existing qPRP approach considers term present in different section of document equally. Our belief is that representing the document as multidimensional subspace will give better result. Because each section of document as its own importance.
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
qPRP is the new model for IR. Existing qPRP approach considers term present in different section of document equally. Our belief is that representing the document as multidimensional subspace will give better result. Because each section of document as its own importance.
Paper presentation for the final course Advanced Concept in Machine Learning.
The paper is @Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data"
http://jmlr.org/proceedings/papers/v32/chenf14.pdf
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
PhD thesis defense.
This manuscript describes a methodology designed and implemented to realise the recommendation of vocabularies based on the content of a given website. The goal of the proposed approach is to generate vocabularies by reusing existing schemas. The automatic recommendation helps to leverage websites to self-described web entities in the Web of Data; understandable by both humans and machines. In this direction, the implemented approach is wrapped within a broader methodology of turning a website in a machine understandable node by using technologies that have been developed in the scope of the Semantic Web vision. Transforming a website to a machine understandable entity is the first step required by the websites side in order to narrow the gap with web agents and enable the structured content consumption without the need of implementing an Application Programming Interface (API) that would provide read-write functionality. The motivation of the thesis stems from the fact that the data provided via an API is already presented on the corresponding website in most of the cases.
Keyphrase Extraction And Source Code Similarity Detection- A Survey Nakul Sharma
This is the presentation given at chsn2020. For full article please visit the website:https://iopscience.iop.org/article/10.1088/1757-899X/1074/1/012027 or https://doi.org/10.1088/1757-899X/1074/1/012027
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
Automatic Query Expansion Using Word Embedding Based on Fuzzy Graph Connectiv...YogeshIJTSRD
The aim of information retrieval systems is to retrieve relevant information according to the query provided. The queries are often vague and uncertain. Thus, to improve the system, we propose an Automatic Query Expansion technique, to expand the query by adding new terms to the user s initial query so as to minimize query mismatch and thereby improving retrieval performance. Most of the existing techniques for expanding queries do not take into account the degree of semantic relationship among words. In this paper, the query is expanded by exploring terms which are semantically similar to the initial query terms as well as considering the degree of relationship, that is, “fuzzy membership- between them. The terms which seemed most relevant are used in expanded query and improve the information retrieval process. The experiments conducted on the queries set show that the proposed Automatic query expansion approach gave a higher precision, recall, and F measure then non fuzzy edge weights. Tarun Goyal | Ms. Shalini Bhadola | Ms. Kirti Bhatia "Automatic Query Expansion Using Word Embedding Based on Fuzzy Graph Connectivity Measures" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-5 , August 2021, URL: https://www.ijtsrd.com/papers/ijtsrd45074.pdf Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/45074/automatic-query-expansion-using-word-embedding-based-on-fuzzy-graph-connectivity-measures/tarun-goyal
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. The information need can be understood as forming a pyramid, where only its peak is made visible by users in the form of a conceptual query.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
Similar to FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones (20)
Keynote at the Insight@DCU Deep Learning Workshop (https://www.eventbrite.ie/e/insightdcu-deep-learning-workshop-tickets-45474212594) on successes and frontiers of Deep Learning, particularly unsupervised learning and transfer learning.
Talk on Optimization for Deep Learning, which gives an overview of gradient descent optimization algorithms and highlights some current research directions.
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.
Transfer Learning for Natural Language ProcessingSebastian Ruder
Slides on Transfer Learning for Natural Language Processing by Sebastian Ruder. Talk given at Natural Language Processing Copenhagen Meetup on 31 May 2017.
Making sense of word senses: An introduction to word-sense disambiguation and...Sebastian Ruder
Talk at 6th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/237998517/) by Alfredo Maldonado, Postdoc at ADAPT Centre at Trinity College Dublin.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy Way and Gareth F. Jones
1. FaDA: Fast document aligner with
word embedding
Pintu Lohar, Debasis Ganguly, Haithem Afli,
Andy Way and Gareth F. Jones
ADAPT Centre, School of Computing, Dublin
City University
2. Contents
• Objective
• Introduction to FaDA
• Methodology used
• Word vector-based similarity
• Architecture of the whole system
• Experiments
• Results
• Conclusions and future work
3. Objective
• To align the documents in two different
languages within a large collection of
comparable documents.
• Alignment procedure should be faster with less
than quadratic time complexity.
5. Introduction to FaDA
• FaDA (Fast Document Aligner) is a free/open-
source tool for aligning bilingual documents .
• It is a fast alignment tool with linear time
complexity.
6. Methodology used
• Crosslingual information retrieval (CLIR)-
based document-alignment system with word
vector-based similarity measurements.
7. Why word vector-based similarity ?
• CLIR-based approach takes into account only
text-based similarity without addressing the
underlying semantic match between the words.
• The word vector-based approach considers the
semantic similarity between the words.
9. Word vector-based similarity
• Query likelihood
Where, q1, q2, q3 → query terms
dots → words of a document in 2D space.
The centroid of document in Figure (a) is closer to the query terms than
document in Figure (b)
10. Combination of word vector-based
and text-based similarity
• α is the linear interpolation parameter denoting
the relative contributions from the text overlap
and word vector-based similarities
13. Baseline
• Based on “Jaccard similarity coefficient”
which measures the term overlaps between
document pairs.
• “Cosine similarity-based” and “Named
Entity matching-based” approaches did not
work well hence not used as baseline.
14. Tuning (Euronews data) :
Optimal parameter settings:
i. λ = 0.9
ii. Number of translation terms M = 7 and
iii. Query to document ratio τ = 0.6
16. Conclusions and future work
Uses CLIR-based approach which is much faster
than the baseline (with quadratic time complexity).
The performance is further enhanced by word
vector embedding-based approach.
In future , we would like to apply our approach to
other language pairs.