The document discusses multilingualism in information retrieval systems. It examines challenges in creating systems that allow users to input queries and receive results in multiple languages. Several existing systems are described that have implemented multilingual features, such as translating queries or documents. However, challenges remain around accurate translation and handling diverse languages and regions. Future research areas discussed include developing better translation tools and testing systems on a wider range of languages and users.
1. Running Head: MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 1
Multilingualism in Information Retrieval Systems
Ariel Hess
University of North Texas
INFO 5206
May 5, 2017
Summary/Author’s Note:
Multilingualism in information retrieval systems is a topic that researchers have spent
countless hours examining. The challenge of creating a system that allows the user to input a
query that contains multiple languages and a result are populated in multiple languages is
something that will continue to be examined. Information retrieval systems can be adjusted to
include features that are designed to translate documents and queries. This paper will examine
different strategies used for text translation, projects implemented and challenges faced.
2. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 2
Introduction
Most search engines provide only monolingual search interface for documents mostly
written in English (Chen, Lee & Yang, 2009, p.4). Users often translate their query into English
before using a search engine. The goal of creating a Multilingual Retrieval System is to allow
users to search for information in multiple languages and retrieve information in multiple
languages. This is done with the deployment of Cross Language Retrieval, allows the user to ask
a question in one language and retrieve the information in another.
A survey of academic users was done to gain a better understanding of why users want to
have access to information documents in different languages. This was done to see if users in a
Digital Library would want access to a multilingual retrieval systems. Most users wanted the
access because of educational purposes. Users would use a Multilingual Information Retrieval
System to complete assignments that require documents to be searched using a language other
than English. The study showed that some users felt it would be too difficult to search for
documents that contain more than one language (He, Luo & Wu, 2012, pp. 188). The overall
takeaway from the survey is to gain a better understanding of user needs to determine if this
system works with the preexisting Information Retrieval System and the users. Developers want
to dismantle the barrier between the user query and multilingual documents. This can be done by
adjusting the Information Retrieval System to incorporate multilingualism by adding translation
tools and various other techniques.
Generally, a Multilingual Retrieval Systems works by first searching retrieving
documents from different collections from each language. Then a monolingual list or results is
retrieved from each collection to be merged to create a multilingual list. Each system can be
adapted to cater to the needs of the organization. Different tools are employed to ensure
3. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 3
compatibility. The Multilingual Retrieval System generally focuses on one or all the following:
document, query, and translation.
Approach/ Methods
The method of executing the process of a Multilingual Retrieval System includes a
variety of tools and features. The system has three levels of concern: query, translation, and
document. These areas are expressed through different techniques such as creating a dictionary
based model. Each Multilingual Retrieval System has its own features and deploys different
methods for retrieval. These methods are adaptable and catered to the type of audience the
system is intended for.
The use of text mining is the process of originating quality information from an
unstructured text. (Chen, Lee, & Yang, 2009, p.4) “Text mining in a multilingual setting [is also
incorporated as] an automated process that is design to discover the relationship between
languages (Hsiao, Lee & Yang, 2009, pp. 648).” These three techniques are often employed to
deal with the problem of creating a multilingual friendly system. Using a machine translation
systems, using a bilingual dictionary or terminology base, and using a statistical/probabilistic
mode based on parallel texts are different methods for creating this system.
Query translation is a strategy where the users query is translated into each language
presented into the multilingual collection to generate a monolingual information retrieval process
per language (Cumbreras, Lopez & Santiago, 2011, pp. 414)” The most common query search
depends on concepts of natural language. Dictionary based tool uses a bilingual list of words and
translates it into different languages. A machine translates every document in the corpus into
multiple languages. Corpus Based retrieval tools use knowledge based procurement techniques
to discover cross-lingual relationships and use them in Multilingual Retrieval Systems. This
4. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 4
method uses word alignment to generate bilingual corpa which establishes relationships between
words in different languages. This in turn is used to create a translation table used in query
translation. It is recommended that the corpus be virtual to save storage and time. These three
methods are grouped together because of their relation to each other. Query translation is made
possible because of dictionary based tools. Once the query is translated then the information is
obtained from a corpus which may have documents clustered. The documents in the corpora are
commonly indexed based on a single keyword or a group of keywords that can be easily found
during searching. Multilingual Comparable Corpus is another tool translated documents that
have the same topics. Many of the text mining themes are based on this method (Hsiao, Lee &
Yang, 2009, pp. 650).
Thesaurus based multilingual retrieval takes related terms in a document that are
commonly used and indexes them. This method can be done in Multilingual Information
Retrieval through mapping between thesauri of different languages (Chen, Lee, & Yang, 2009,
pp.6).
The methods addressed above are all interchangeable with any system that is
implementing a multilingual extension. The intended purpose of tools such as corpora’s is to
ensure a repository is available to access the intended information. The benefit of clustering
corpora’s is that is provides a narrower grouping of documents and text that are comparable.
Applications
The following sections provides examples of existing systems that have added the multilingual
feature to an existing Information Retrieval System or created a new system. Multilingualism is
designed to be incorporated into an already existing system. The following systems examine
5. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 5
their implementation of multilingualism into their pre-existing system.
SveMed
SveMed is uses terms from the Medical Subject Headings thesaurus which contains a list
of controlled vocabularies and translates these terms into different languages. These terms are
arranged in a hierarchical tree and when deciding which terms are going to be indexed the
indexer tries to select the finest term possible. These terms are then indexed and can be retrieved
by performing a truncation search. This is to ensure user submitted queries can provide results.
The interfaces use a thesaurus based database to translate the medical terms into three different
languages and distinguish information between the document terms. (Gavel, & Anderson, 2014,
pp.272) Uses the Solr search engine that relies solely on query expansion. “The search interface
allows the user to search terms in English, Swedish, or Norwegian, and browse for MeSH terms.
(Gavel & Anderson, 2014, pp.274).” A great advantage of this searching interface is that it
allows the user to select which language to search for information in.
GHSOM
“Growing hierarchical self-organizing map (GHSOM) constructs hierarchical structure of
expandable maps. Algorithms are developed after the relationships between other languages
based on the hierarchical map has been determined (Chen, Lee & Yang, 2009, pp.7).” A speech
tagger is used to select nouns from the text that will be used as keywords. The queries are
reprocessed to convert to vectors that will attach to the overall meaning of the document. Once
the keywords have been selected then they are converted into roots. The training is aid in the
encoding of bilingual documents to ensure users can access the information in these documents.
The expandable maps allow for better results.
Merge Model
6. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 6
The system first starts out with the user query that is carried out by the Cross Lingual
Information Retrieval system. The query is sent to three different collections and three sets of
results are populated. The merge model is design to combine the three monolingual lists into one
multilingual list. In this model sixty-two features are extracted from the three levels of
Multilingual Retrieval Systems query, document, and translation (Chen, Tsai, & Wang, 2011,
pp.638) A learning based ranking algorithm is employed called Frank to rank items based on
relevance. This learning based merge model has room for improvement.
ICE-TEA
Interactive Cross-Language search English with Translation Enhancement performs
query translation based on an interactive Multilingual Information Access system. The language
resources used is a bilingual dictionary translating English to Chinese. “Translation enhancement
is a feature of this system that provides users the original returned documents and their
translations. [The] system implements post-translation query expansions (He, Wu & Xu, 2012,
pp.527).” The system is designed to allow users to delete any translations that were returned that
was not needed. The system allows more users to interact with various stages of the Multilingual
Information Access system (He, Wu & Xu, 2012, pp.536). The system will need to be developed
to allow for better retrieval of relevant documents. Users can become more involved in the
information retrieval process with the help of this system.
BRUJA
A question and answer system for the management of multilingual collections. This
system uses Cross Lingual Information Retrieval to retrieve documents form a multilingual
system. This a common practice employed in the multilingual systems. The system produces
more correct answers in Spanish then in other languages. This system uses a machine translation
7. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 7
resource which requires a word-level alignment algorithm for the translations (Cumbreras, Lopez
& Santiago, 2011, pp. 420)
The commonalities of each system is the use of some form of query translation to bridge
the gap between the query and the documents. Each system’s goal is to enable the user to search
for information in multiple languages. Systems mention the involvement of Cross Lingual
Retrieval System in the Multilingual Retrieval System. These two system work together to
connect the user to information requested. The user is able to submit a query and a tool is used to
translate the query into a language corresponding with each collection. Then a list of
monolingual results are populated. This list is merge together with the use of the merging model
explained above. This model is just a model and can be adjust to cater to any other system. The
process of organizing the multilingual documents is different depending on the use of the system.
Documents can be translated then divided into comparable clusters or comparable corpora’s.
Keywords are often taken from documents and they are then translated into various languages
before being searched in the system. The sample systems and methods explained above discuss
methods of helping the user from the input of the query to receiving of the information.
ML News Clustering
Multilingual Document Clustering involves dividing a set of documents into two
languages into clusters, in such a way that similar documents are in the same cluster. News
cloistering is something that is popular because of the vast amount of news available to users.
This study uses a language independent representation of news documents by focusing of
clustering the news documents according to their content. They started with using comparable
multilingual news articles. (Fresno, Martinez & Montavo, 2015, pp.522) Name entities played a
role in the natural language processing, such as machine translation, clustering, summarizing and
8. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 8
extraction(cite) Comparable corpora were Spanish and English were the languages used.
Expected Density is a measurement tool that can be used in a ML setting to determine the quality
of the clusters (Fresno, Martinez & Montavo, 2015,pp.528).
Challenges/ Limitations
Each article read explain the challenges of creating a multilingual retrieval systems.
There is a large amount of text that has multiple meanings in different languages. This poses a
problem when indexed terms are translated into a term that is represented in the system.
Multilingualism in Information Retrieval Systems is a challenge due to the limitations of existing
programs that are available. The amount of resources available is limited to main items such as
query translation. Many developers want to steer away from translator due to the inaccuracy of
some translations. When words are translated into another language the developer runs the risk
of the word not being translated correctly due to the missed meaning or inadequate translation
tools for languages derived from a specific region. For example, there are many regions of origin
of Spanish which means a viable translation system must be equipped to translate different
versions of Spanish words. This has not been developed.
Some translation systems aren’t equipped to handle the translation of proper nouns. A machine
translation system is deemed as impractical due to the large amount of text being translated
(Dhavachelvan & Sujatha, 2011, pp.116). The larger the text, the slower the retrieval time.
It is important that when choosing keywords to comprehensive ones to allow for chance of
retrieving relevant documents (Peters, et al., 2011, pp.5) In some languages there is no way to
change a verb to a noun which is why some systems require the keyword to have a noun in it.
(Peters, et al., 2011,pp.11) These challenges are common in an information setting where the
user is looking for information in either their native or nonnative language.
9. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 9
Future Research
Future research should include the creation of a large bilingual text corpa, large scale text
databases for testing, and a database with lexical semantic relations (Fluhr, n.d.,para. 24).
Systems need to be tested in various languages. The Cross-Language Evaluation Forum spent it’s
time from 2000 to 2005 researching implemented systems that have multilingual features for
digital media. CLEF noticed that most systems examined pre-processed the document collection,
adopted linguistic processors and language resources such as POS-taggers (Peters, 2011,
pp.677).
Future testing should include a wide range of users in the test group. Having a group of
test users who are from one specific region does not allow for accurate results. The test group
used needs to be diverse. Questions catered to multilingualism should be asked to determine how
they would use the system and if it would be necessary to implement.
User knowledge needs to be improved. The challenge of implementing a new system that
involves more than one language can frustrate native English speakers and nonnative English
speakers. A study showed “the language choices made by the students while searching for
information on the Internet seemed to indicate that the students used their native languages just
as much as they used English. This is a reflection of the rising multilingualism and
multiculturalism in the online environment and the fact that English is not as dominant as it was
some years ago: (Ajiferuke, et al., 2016, pp.498)” There needs to be adequate time set aside to
train users how to search and use such system. Organizations need to decide if implementing a
Multilingual Retrieval System will be beneficial to their user audience.
10. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 10
Discussion
Multilingualism in information retrieval systems is a concept that is still in the beginning
stages. It is a challenge to take a document that is written in multiple languages and translate it
into the language derived in the search query. “Multilingualism plays a role in the quality and
effectiveness of communication services offered [to users] (Menard, 2011, pp.15).”
Multilingualism is not only needed in library systems but a museum felt the need to offer this
service to their users as well. This feature was used to allow users to search images that have
been indexed in multiple languages.
Multilingual Information Retrieval System provides document retrieval techniques that
enable a user to enter a query, including a natural language query, in a desired one of a plurality
of supported languages, and retrieve documents from a database that includes documents in at
least one other language of the plurality of supported languages (Libby, et al., 1999, pp.8.)
A variety of articles were examined, each discussing different but similar aspects of Multilingual
Retrieval Systems. A significant improvement can be made to existing samples of retrieval
systems that are implementing the new system. Multilingualism is design to be incorporate to an
already existing Information Retrieval System. There are many tools currently available and
tools that need to be developed. Currently this system is limited to dictionary based tools,
corpora’s, clustering, indexing, and thesaurus based tools. These tools have been beneficial to the
development of this system but need to be enhanced due to errors that can arise.
11. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 11
References
García-Cumbreras, M. Á, Martínez-Santiago, F., & Ureña-López, L. A. (2011, 10). Architecture and
evaluation of BRUJA, a multilingual question answering system. Information Retrieval, 15(5),
413-432. doi:10.1007/s10791-011-9177-5
Fluhr, Christian (n.d). Multilingual Information Retrieval. Retrieved from
http://www.cslu.ogi.edu/HLTsurvey/ch8node7.html
Gavel, Y., & Andersson, P. (2014, 06). Multilingual query expansion in the SveMed bibliographic
database: A case study. Journal of Information Science, 40(3), 269-280.
doi:10.1177/0165551514524685
Libby, E. D., Palk, W., Yu, E. S., & Li, M. (1999). U.S. Patent No. 6006221. Washington, DC: U.S.
Patent and Trademark Office.
Montalvo, S., Martínez, R., & Fresno, V. (2015, 08). Quality prediction of multilingual news
clustering: An experimental study. Journal of Information Science, 41(4), 518-530.
doi:10.1177/0165551515586671
Ménard, E. (2011, 07). Search Behaviours of Image Users: A Pilot Study on Museum Objects.
Partnership: The Canadian Journal of Library and Information Practice and Research, 6(1).
doi:10.21083/partnership.v6i1.1433
Nzomo, P., Ajiferuke, I., Vaughan, L., & Mckenzie, P. (2016, 09). Multilingual Information Retrieval
& Use: Perceptions and Practices Amongst Bi/Multilingual Academic Users. The Journal of
Academic Librarianship, 42(5), 495-502. doi:10.1016/j.acalib.2016.06.012
Peters, C., Braschler, M., & Clough, P. (2011, 09). Evaluation for Multilingual Information Retrieval
Systems. Multilingual Information Retrieval, 129-169. doi:10.1007/978-3-642-23008-0_5
12. MULTILINGUALISM IN INFORMATION RETRIEVAL SYSTEMS 12
P., & D. (2011, 10). A Review on the Cross and Multilingual Information Retrieval. International
Journal of Web & Semantic Technology, 2(4), 115-124. doi:10.5121/ijwest.2011.2409
Tsai, M., Chen, H., & Wang, Y. (2011, 09). Learning a merge model for multilingual information
retrieval. Information Processing & Management, 47(5), 635-646.
doi:10.1016/j.ipm.2009.12.002
Wu, D., He, D., & Luo, B. (2012, 04). Multilingual needs and expectations in digital libraries. The
Electronic Library, 30(2), 182-197. doi:10.1108/02640471211221322
Wu, D., He, D., & Xu, X. (2012, 08). A study of relevance feedback techniques in interactive
multilingual information access. Library Hi Tech, 30(3), 523-544.
doi:10.1108/07378831211266645
Yang, H., Hsiao, H., & Lee, C. (2011, 09). Multilingual document mining and navigation using self-
organizing maps. Information Processing & Management, 47(5), 647-666.
doi:10.1016/j.ipm.2009.12.003
Yang, H., Lee, C., & Chen, D. (2009, 02). A method for multilingual text mining and retrieval using
growing hierarchical self-organizing maps. Journal of Information Science, 35(1), 3-23.
doi:10.1177/0165551508088968
Zhang, X., Liu, J. N., & Atwell, E. (n.d.) Multilingual Information Retrieval in World Wide Web.
Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.90&rep=rep1&type=pdf