SlideShare a Scribd company logo
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
DOI : 10.5121/ijnlc.2012.1301 1
A RULE-BASED APPROACH FOR ALIGNING
JAPANESE-SPANISH SENTENCES FROM A
COMPARABLE CORPORA
Jessica C. Ramírez1
and Yuji Matsumoto2
Information Science, Nara Institute of Science and Technology, Nara, Japan
1
Jessicrv1@yahoo.com.mx 2
matsu@is.naist.jp
ABSTRACT
The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to
the quality and length of the parallel corpus it uses. However for some pair of languages there is a
considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be
used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this
problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from
Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic
features of both languages. Human evaluation was performed over a sample and shows promising results,
in comparison with the baseline.
KEYWORDS
Comparable Corpora, POS tagging, Sentences alignment, Machine Translation
1. INTRODUCTION
Much research in recent years has focused on constructing semi-automatic and automatic aligned
data resources, which are essential for many Natural Language Processing Tasks; however for
some pairs of languages there is still a huge lack of annotated data.
Manual construction of Parallel corpus requires high quality translators, besides It is time
consuming and expensive. With the proliferation of the internet and the immense amount of data,
a number of researchers have proposed using the World Wide Web as a large-scale corpus[5].
However due to redundancy and ambiguous information on the web, we must find methods of
extracting only the information that is useful for a given task. [4].
Extracting parallel sentences from a comparable corpus is a challenging task, due to the fact that
despite two documents can be referred to a same topic, it can be possible that both documents do
not have a single sentence in common.
In this study, we propose an approach for extracting Japanese-Spanish parallel sentences from
Wikipedia using Part-of-Speech rule based alignment and Dictionary based translation. We use
as comparable corpora to Wikipedia articles, a dictionary extracted from the Wikipedia links and
Aulex a free Japanese-Spanish dictionary
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
2
2. RELATED WORKS
The use of Wikipedia as a data resource in NLP is fairly new, and thus research is fairly limited.
There are, however, works showing promising results.[2] attempts to extract the named entities
from Wikipedia and presents two disambiguation methods using cosine similarity and SVM.
Firstly detecting a named entity from Wikipedia by using IE technique, and disambiguates
between multiples entities, using context articles similarity by using cosine similarity and of
taxonomy kernel.
Such work that is directly related to this research is [1] . Their research uses two approaches to
similarity between sentences in Wikipedia. Firstly they introduced an MT based approach, using
Jaccard Similarity and ‘Babel MT system of Altavista1
’ to aligned the sentences. The second
approach, the Link based bilingual lexicon, that used the hyperlinks in every sentences by mean
of a dictionary extracted from Wikipedia to select the. Their result showed best result on using the
second approach, especially in articles that are literal translation on each other.
Our approach differs in that we convert the articles in their equivalent POS tags and we just align
sentences that are according to the rules we add. And then we used two Japanese- Spanish
dictionaries as a seed lexicon.
3. BACKGROUND
3.1. Comparable Corpora2
A comparable corpus is a collection of text about a given topic in two or more languages. For
example, The ‘Yomiuri Shimbun3
corpora’, a corpus extracted from their daily news both in
English and Japanese. Despite the news in both languages are the same, they do not make a
proper translation of the content.
Comparable corpora are used for many NLP tasks, such as: Information Retrieval, Machine
Translation, bilingual lexicon extraction, and so on. In languages with a scarcity of resources
Comparable corpora are an alternative of prime order in NLP research.
3.2. Wikipedia
Wikipedia4
is a multilingual web-based encyclopedia with articles on a wide range of topics, in
which the texts are aligned across different languages.
Wikipedia is the successor of Nupedia- an online encyclopedia written by experts in different
fields that does not exist now. Wikipedia arose as a single language project (English) on January,
2001 to support Wikipedia differs from Nupedia mainly that anyone can write it, and writers do
not need to be an expert in the field that is written.
1
Babel, is an online multilingual translation system
2
Corpora is a plural of corpus.
3
読売新聞: is a Japanese newspaper, with an English version too.
http://www.yomiuri.co.jp/
4
http://en.wikipedia.org/wiki/Main_Page
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
3
Wikipedia is written collaboratively by volunteers (called “wikipedians”) from different places all
around the world. This is because Wikipedia has volunteers from many nationalities who write in
many different languages. Actually it has articles written in more than 200 languages, with
different numbers of articles in each language.
The topics vary from science, covering many different fields such as informatics, biology,
anthropology and entertainment, such as albums name, artist, actors etc. and fictional characters
such as James Bond.
Wikipedia is not only a simple encyclopaedia; it has some features that make Wikipedia suitable
for NLP research. These features are:
3.2.1. Redirect pages
The redirect page is a very suitable resource for eliminating redundant articles. This means
avoiding the existence of two articles referring to the same topic.
This is used in the case of:
• Synonyms like ‘altruism’ which redirects to ‘selflessness’
• Abreviations like ‘USA’ redirects to ‘The United States of America’
• Variation in Spelling like ‘colour’ which redirects to ‘color’
• Nicknames and pseudonyms like ‘Einstein’ which redirects to ‘Albert Einstein’
3.2.2. Disambiguation pages
Disambiguation pages are pages that contain the list of the different senses of a word.
3.2.3. Hyperlinks
Articles contain words or entities that have an article about them. So when a user clicks the link
he will be redirected to an article about that word.
3.2.4. Category pages
Category pages are pages without articles that list members of a particular category and its
subcategories. These pages have titles that start with “Category:” and is followed by the name of
the particular category.
Categorization is a project of Wikipedia that attempts to assign to each article a category. The
category is assigned manually by wikipedians and therefore not all pages have a category item.
Some articles belong to multiple categories. For example the article “Dominican Republic”
belongs to three categories such as: “Dominican Republic”, “Island countries” and “Spanish-
speaking countries”. Thus the article Dominican Republic appears in three different category
pages.
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
4
4. Methodology
4.1. General Description
Figure 1 shows a general view of the methodology. First we extract from Wikipedia all the
aligned links i.e. Wikipedia article titles. We extract the Japanese and Spanish about the same
topic. Then eliminate the unnecessary data (pre-processing). Split into sentences. After extracting
those articles, Use a POS tagger to add the lexical category to each word in a given sentence.
Choose the sentences that match according to their lexical category. Use the dictionaries to make
a word to word translation. .Finally we got the sentences to be parallel.
Figure 1. Methodology
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
5
4.2. Dictionary Extraction from Wikipedia
The goal of this phase is acquisition of Japanese-Spanish-English tuples of the Wikipedia’s article titles in
order to acquire translations. Wikipedia provides links in each article to corresponding articles in different
languages.
Every article page in Wikipedia has on the left hand side some boxes labelled: ‘navigation’, ‘search’, ‘toolbox’
and at finally ‘in other languages’. This has a list of all the languages available for that article, although the
article in each language does not all have exactly the same contents. In most cases English articles are longer
or have more information than the same article in other languages, because most of the Wikipedia
collaborators are native English speakers.
4.2.1. Methodology
Take all articles titles that are nouns or named entities and look in the articles’ contents for the
box called ‘In other languages’. Verify that it has at least one link. If the box exists it redirects to
the same article in other languages. Extract the words in these other languages and align it with
the original article title. For instance the Spanish article titled ‘economía’ (economics), is
translated into Japanese as ‘keizaigaku’ (経済学). When we click Spanish or Japanese in the
other languages box we obtain an article about the same topic in the other language, this gives us
the translation.
4.3. Extract Japanese and Spanish articles
We used the Japanese-Spanish dictionary (4.2.) to select the articles with links in Japanese and
Spanish.
4.4. Pre-processing
We eliminate the irrelevant information from Wikipedia articles, to make processing easy and
faster.
The steps are as follows.
1. Remove from the pages all irrelevant information, such as images, menus, characters such
as: “()”, “&quot”, “*”, etc...
2. Verify if a link is a redirected article and extract the original article
3. Remove all stopwords -general words that do not give information about a specific topic
such as “the”, “between”, “on”, etc.
4.5. Spliting into Sentences and POS tagging
For splitting the sentences in the Spanish articles we used NLTK toolkit5
, which is a well-known
platform for building Python scripts.
For tag Spanish sentences, we used FreeLing6
, which an open source suit for language analizer,
specialized in Spanish language.
5
http://nltk.org/
6
http://nlp.lsi.upc.edu/freeling/
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
6
For Splitting into sentences, in to words and add a word category, we used MeCab7
, which is a
Part-of-Speech and Morphological Analyser for Japanese.
4.6. Constructing the Rules
Japanese is a Subject-Object-Verb language; While Spanish is a Subject-Verb-Object language.
Figure 2. Basic Japanese-Spanish sentence order
Figure 2 shows the basic order of the sentences both in Japanese and Spanish, using as a example
the sentence “The dog drinks water”. The Japanese sentence ‘犬は水をのみます’ (Inu wa mizu
wo nomimasu.)‘ The dog drinks water’ is translated into Spanish as ‘El perro bebe agua’.
Table 1. Japanese-Spanish rules
Characteristics Rule Description
Rules Spanish Japanese Japanese=> Spanish
Noun affects the gender of the
adjective
Adjective do not have
gender
Noun+desu => noun
Adj => Noun
(gender) Adj
Name Entity Always start with capital letter Do not exist this
distinction
NE=>NE (Capital
letter)
Adjective With gender and numbers Adjective (Na),
adjective (I)
Adj (fe/male) =>Adj
(NA/I)
Question It is delimited by question
marks ¿?
The sentences end in
か(ka)
(sentence+ か )=>( ¿
+ sentence +?)
Pronouns According to the context can
be omitted
Can be omitted like
in Spanish
Pron =>Pron
table 1 shows some of the rules applied to this work. Those rules are created taking in account
the morphological and syntactic characteristic of each language. For example, In Japanese there
no exist genders for the adjectives. While in Spanish there are indispensable.
7
http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/japanese/
犬は水を飲みます。=> 犬は 水を 飲みます。
El perro bebe agua. => El perro bebe agua.
Subject Object Verb
Subject Verb Object
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
7
5. Experimental Evaluation
For the evaluation of the proposed method, we took a sample of 20 random Japanese and Spanish
articles. This experiments were based on two approaches: the hyperlink approach [1] as a
baseline and the Rule-Based approach. We downloaded the Wikipedia xml data for April 20128
.
We used the Aulex9
dictionary because the dictionary extracted from Wikipedia contain mostly
noums and name entitities. To align other grammatical forms such as: verbs, adjectives, etc. we
require another dictionary.
Table 2 shows the result obtained both with the baseline [1] and our approach. In column 1 shows
the “Correct identification” means the sentences with the high scores and the alignment were
correct. “Partial Matching” refers to the sentences both in source and target language with a noum
phrase in comun. And at last “Incorrect Identification”, refers to sentences with the higher scored.
However, there was not even one word match
Table 2. Results
Baseline Our Approach
Hyperlinks Rule-Based POS
Correct Identification 13 42
*Partial Matching 51 46
Incorrect Identification 36 12
Total 100 100
5.2. Discussion
We noticed that in the sentences form the first part of the article. It is usually the definition of the
title of the article, have more correct identification, both approaches. Overall Rule-Based
approach performed better than the baseline. This is because when in a given sentence, if the
hyperlink word is or the title of the article is repeated, give automatically the best score, even if It
is redundant.
Some identification performed no as well as we expected due to we need to add more rules, It can
be manually or by using bootstrapping methods, this is very interesting point for a future work.
We have noticed that by using this method It is possible the construction of new sentences, even
they are not in both articles.
6. CONCLUSIONS AND FUTURE WORKS
This study focuses on aligning Japanese-Spanish sentences by using a rule-based approach. We
have demonstrated the feasibility of using Wikipedia’s features for aligning several languages.
We have used POS and constructed rules for aligning the sentences both in source and target
article.
8
The Wikipedia data is increasing constantly.
9
http://aulex.org/ja-es/
International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012
8
The same method can be applied to any pair of language in Wikipedia, and another type of
comparable corpora.
For future works, we will explore the used of English as a pivot language, and the automatic
construction on a corpus by translating.
ACKNOWLEDGEMENTS
We would like to thanks to Yuya R. for her contribution and helpful comments.
REFERENCES
[1] Adafre, Sisay F. & De Rijke, Maarten, (2006) “Finding Similar Sentences across Multiple Languages
in Wikipedia”, In Proceeding of EACL-06, pages 62-69.
[2] Bunescu, Razvan & Pasca, Marius (2006) “Using Encyclopedic Knowledge for Named Entity
Disambiguation”, In Proceeding of EACL-06, pages 9-16.
[3] Fung, Pascale & Cheung Percy, (2004) “ Multi-level Bootstrapping for extracting Parallel Sentences
from a quasi-Comparable Corpus”, In Proceeding of the 20th International Conference on
Computational Linguistics. Pages 350
[4] Ramírez, Jessica, Asahara, Masayuki & Matsumoto, Yuji , (2008) “Japanese-Spanish Thesaurus
Construction Using English as a Pivot”, In Proceeding of The Third International Joint Conference
on Natural Language Processing (IJCNLP), Hyderabad, India. pages 473-480.
[5] Rigau, German, Magnni, Bernardo, Aguirre, Eneko & Carroll, John, (2002) “ A Roadmap to
Knowledge Technologies”, In Proceeding of COLING Workshop on A Roadmap for Computational
Linguistics. Taipei, Taiwan.
[6] Tillman, Christoph, (2009) . “ A Bean-Search Extraction Algorithm for Comparable Data”, In
Proceeding of ACL, pages 225-228
[7] Tillman, Christoph & Xu, Jian-Ming (2009) “A Simple Sentence-Level Extraction Algorithm for
Comparable Data”, In Proceeding of HLT/NAACL, pages 93-96.
Authors
Jessica C. Ramírez
She received his M.S. degree from Nara Institute of Science and Technology (NAIST)
in 2007. She is currently pursuing a Ph.D. degree. Her research interest Include
machine translation and word sense disambiguation.
Yuji Matsumoto
He received his M.S. and Ph.D. degrees in information science from Kyoto University in 1979 and in 1989.
He is currently a Professor at the Graduate School of Information Science, Nara Institute of Science and
Technology. His main research interests are natural language understanding and machine learning.

More Related Content

What's hot

A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
kevig
 
lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
Duygu Aşıklar
 
Corpora and its use in elt
Corpora and its use in eltCorpora and its use in elt
Corpora and its use in elt
Ilse Berenice Méndez Vega
 
Corpora translation
Corpora translationCorpora translation
Corpora translation
Ariett Gouveia
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
bikashtaly
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Irum Malik
 
Corpora in language teaching
Corpora in language teachingCorpora in language teaching
Corpora in language teaching
Jonathan Smart
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
Shubham Kumar
 
Smart grammar a dynamic spoken language understanding grammar for inflective ...
Smart grammar a dynamic spoken language understanding grammar for inflective ...Smart grammar a dynamic spoken language understanding grammar for inflective ...
Smart grammar a dynamic spoken language understanding grammar for inflective ...
ijnlc
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Raul Vargas
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
King Saud University
 
Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics
Umm-e-Rooman Yaqoob
 
TALC 2008 Workshop 1 - Teaching and Language Corpora
TALC 2008 Workshop 1 - Teaching and Language CorporaTALC 2008 Workshop 1 - Teaching and Language Corpora
TALC 2008 Workshop 1 - Teaching and Language Corpora
Pascual Pérez-Paredes
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
Editor IJMTER
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
paperpublications3
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
CALPER
 

What's hot (17)

A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
 
lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
Corpora and its use in elt
Corpora and its use in eltCorpora and its use in elt
Corpora and its use in elt
 
Corpora translation
Corpora translationCorpora translation
Corpora translation
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Corpora in language teaching
Corpora in language teachingCorpora in language teaching
Corpora in language teaching
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Smart grammar a dynamic spoken language understanding grammar for inflective ...
Smart grammar a dynamic spoken language understanding grammar for inflective ...Smart grammar a dynamic spoken language understanding grammar for inflective ...
Smart grammar a dynamic spoken language understanding grammar for inflective ...
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics Corpus Analysis in Corpus linguistics
Corpus Analysis in Corpus linguistics
 
TALC 2008 Workshop 1 - Teaching and Language Corpora
TALC 2008 Workshop 1 - Teaching and Language CorporaTALC 2008 Workshop 1 - Teaching and Language Corpora
TALC 2008 Workshop 1 - Teaching and Language Corpora
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
A Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 DevelopmentA Corpus-based Approach to Tracking L2 Development
A Corpus-based Approach to Tracking L2 Development
 

Similar to A RULE-BASED APPROACH FOR ALIGNING JAPANESE-SPANISH SENTENCES FROM A COMPARABLE CORPORA

Improving a japanese spanish machine translation system using wikipedia medic...
Improving a japanese spanish machine translation system using wikipedia medic...Improving a japanese spanish machine translation system using wikipedia medic...
Improving a japanese spanish machine translation system using wikipedia medic...
csandit
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
ijnlc
 
AICOL2015_paper_16
AICOL2015_paper_16AICOL2015_paper_16
AICOL2015_paper_16
Tenyo Tyankov
 
How to learn vocabulary in english
How to learn vocabulary in englishHow to learn vocabulary in english
How to learn vocabulary in english
Independant Teacher
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
Prof.Ravindra Borse
 
Lexicography
 Lexicography Lexicography
Lexicography
the4theorists
 
Lexicography
 Lexicography Lexicography
Lexicography
the4theorists
 
Virtual research project presentation 2013
Virtual research project presentation 2013Virtual research project presentation 2013
Virtual research project presentation 2013
Estefii Cabrera Morales
 
Outlining Bangla Word Dictionary for Universal Networking Language
Outlining Bangla Word Dictionary for Universal Networking  LanguageOutlining Bangla Word Dictionary for Universal Networking  Language
Outlining Bangla Word Dictionary for Universal Networking Language
IOSR Journals
 
Annotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesAnnotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital Humanities
Faith Brown
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
Alex Curtis
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
syila239
 
Towards optimize-ESA for text semantic similarity: A case study of biomedical...
Towards optimize-ESA for text semantic similarity: A case study of biomedical...Towards optimize-ESA for text semantic similarity: A case study of biomedical...
Towards optimize-ESA for text semantic similarity: A case study of biomedical...
IJECEIAES
 
532_Paper
532_Paper532_Paper
532_Paper
Arash Saidi
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
kevig
 
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
kevig
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
mimisy
 
Corpora in the classroom
Corpora in the classroomCorpora in the classroom
Corpora in the classroom
Gabriella Sannicandro
 
eMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding AgencyeMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding Agency
RDUES
 
Final proyect estefania_cabrera
Final proyect estefania_cabreraFinal proyect estefania_cabrera
Final proyect estefania_cabrera
Estefii Cabrera Morales
 

Similar to A RULE-BASED APPROACH FOR ALIGNING JAPANESE-SPANISH SENTENCES FROM A COMPARABLE CORPORA (20)

Improving a japanese spanish machine translation system using wikipedia medic...
Improving a japanese spanish machine translation system using wikipedia medic...Improving a japanese spanish machine translation system using wikipedia medic...
Improving a japanese spanish machine translation system using wikipedia medic...
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
 
AICOL2015_paper_16
AICOL2015_paper_16AICOL2015_paper_16
AICOL2015_paper_16
 
How to learn vocabulary in english
How to learn vocabulary in englishHow to learn vocabulary in english
How to learn vocabulary in english
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Lexicography
 Lexicography Lexicography
Lexicography
 
Lexicography
 Lexicography Lexicography
Lexicography
 
Virtual research project presentation 2013
Virtual research project presentation 2013Virtual research project presentation 2013
Virtual research project presentation 2013
 
Outlining Bangla Word Dictionary for Universal Networking Language
Outlining Bangla Word Dictionary for Universal Networking  LanguageOutlining Bangla Word Dictionary for Universal Networking  Language
Outlining Bangla Word Dictionary for Universal Networking Language
 
Annotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital HumanitiesAnnotated Bibliographical Reference Corpora In Digital Humanities
Annotated Bibliographical Reference Corpora In Digital Humanities
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
 
Towards optimize-ESA for text semantic similarity: A case study of biomedical...
Towards optimize-ESA for text semantic similarity: A case study of biomedical...Towards optimize-ESA for text semantic similarity: A case study of biomedical...
Towards optimize-ESA for text semantic similarity: A case study of biomedical...
 
532_Paper
532_Paper532_Paper
532_Paper
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
 
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
M ORPHOLOGICAL A NALYZER U SING THE B I - LSTM M ODEL O NLY FOR JAPANESE H IR...
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Corpora in the classroom
Corpora in the classroomCorpora in the classroom
Corpora in the classroom
 
eMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding AgencyeMargin Presentation given to Skills Funding Agency
eMargin Presentation given to Skills Funding Agency
 
Final proyect estefania_cabrera
Final proyect estefania_cabreraFinal proyect estefania_cabrera
Final proyect estefania_cabrera
 

More from kevig

Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
kevig
 
Effect of Query Formation on Web Search Engine Results
Effect of Query Formation on Web Search Engine ResultsEffect of Query Formation on Web Search Engine Results
Effect of Query Formation on Web Search Engine Results
kevig
 
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and DogriInvestigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
kevig
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
kevig
 
Effect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech PerceptionEffect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech Perception
kevig
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
kevig
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
kevig
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Tagging
kevig
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
kevig
 
Improving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data OptimizationImproving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data Optimization
kevig
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
kevig
 
Rag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented GenerationRag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented Generation
kevig
 
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
kevig
 
Evaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English LanguageEvaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English Language
kevig
 
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONIMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
kevig
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
kevig
 
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONRAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
kevig
 
Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...
kevig
 
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEEVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
kevig
 

More from kevig (20)

Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
Effect of Query Formation on Web Search Engine Results
Effect of Query Formation on Web Search Engine ResultsEffect of Query Formation on Web Search Engine Results
Effect of Query Formation on Web Search Engine Results
 
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and DogriInvestigations of the Distributions of Phonemic Durations in Hindi and Dogri
Investigations of the Distributions of Phonemic Durations in Hindi and Dogri
 
May 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language ComputingMay 2024 - Top10 Cited Articles in Natural Language Computing
May 2024 - Top10 Cited Articles in Natural Language Computing
 
Effect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech PerceptionEffect of Singular Value Decomposition Based Processing on Speech Perception
Effect of Singular Value Decomposition Based Processing on Speech Perception
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
 
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT ModelsIdentifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Tagging
 
Rule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to PunjabiRule Based Transliteration Scheme for English to Punjabi
Rule Based Transliteration Scheme for English to Punjabi
 
Improving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data OptimizationImproving Dialogue Management Through Data Optimization
Improving Dialogue Management Through Data Optimization
 
Document Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language StructureDocument Author Classification using Parsed Language Structure
Document Author Classification using Parsed Language Structure
 
Rag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented GenerationRag-Fusion: A New Take on Retrieval Augmented Generation
Rag-Fusion: A New Take on Retrieval Augmented Generation
 
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...
 
Evaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English LanguageEvaluation of Medium-Sized Language Models in German and English Language
Evaluation of Medium-Sized Language Models in German and English Language
 
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONIMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATION
 
Document Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language StructureDocument Author Classification Using Parsed Language Structure
Document Author Classification Using Parsed Language Structure
 
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONRAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATION
 
Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...Performance, energy consumption and costs: a comparative analysis of automati...
Performance, energy consumption and costs: a comparative analysis of automati...
 
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEEVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGE
 

Recently uploaded

学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
mamamaam477
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
HODECEDSIET
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 

Recently uploaded (20)

学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Engine Lubrication performance System.pdf
Engine Lubrication performance System.pdfEngine Lubrication performance System.pdf
Engine Lubrication performance System.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMTIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEM
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 

A RULE-BASED APPROACH FOR ALIGNING JAPANESE-SPANISH SENTENCES FROM A COMPARABLE CORPORA

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 DOI : 10.5121/ijnlc.2012.1301 1 A RULE-BASED APPROACH FOR ALIGNING JAPANESE-SPANISH SENTENCES FROM A COMPARABLE CORPORA Jessica C. Ramírez1 and Yuji Matsumoto2 Information Science, Nara Institute of Science and Technology, Nara, Japan 1 Jessicrv1@yahoo.com.mx 2 matsu@is.naist.jp ABSTRACT The performance of a Statistical Machine Translation System (SMT) system is proportionally directed to the quality and length of the parallel corpus it uses. However for some pair of languages there is a considerable lack of them. The long term goal is to construct a Japanese-Spanish parallel corpus to be used for SMT, whereas, there are a lack of useful Japanese-Spanish parallel Corpus. To address this problem, In this study we proposed a method for extracting Japanese-Spanish Parallel Sentences from Wikipedia using POS tagging and Rule-Based approach. The main focus of this approach is the syntactic features of both languages. Human evaluation was performed over a sample and shows promising results, in comparison with the baseline. KEYWORDS Comparable Corpora, POS tagging, Sentences alignment, Machine Translation 1. INTRODUCTION Much research in recent years has focused on constructing semi-automatic and automatic aligned data resources, which are essential for many Natural Language Processing Tasks; however for some pairs of languages there is still a huge lack of annotated data. Manual construction of Parallel corpus requires high quality translators, besides It is time consuming and expensive. With the proliferation of the internet and the immense amount of data, a number of researchers have proposed using the World Wide Web as a large-scale corpus[5]. However due to redundancy and ambiguous information on the web, we must find methods of extracting only the information that is useful for a given task. [4]. Extracting parallel sentences from a comparable corpus is a challenging task, due to the fact that despite two documents can be referred to a same topic, it can be possible that both documents do not have a single sentence in common. In this study, we propose an approach for extracting Japanese-Spanish parallel sentences from Wikipedia using Part-of-Speech rule based alignment and Dictionary based translation. We use as comparable corpora to Wikipedia articles, a dictionary extracted from the Wikipedia links and Aulex a free Japanese-Spanish dictionary
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 2 2. RELATED WORKS The use of Wikipedia as a data resource in NLP is fairly new, and thus research is fairly limited. There are, however, works showing promising results.[2] attempts to extract the named entities from Wikipedia and presents two disambiguation methods using cosine similarity and SVM. Firstly detecting a named entity from Wikipedia by using IE technique, and disambiguates between multiples entities, using context articles similarity by using cosine similarity and of taxonomy kernel. Such work that is directly related to this research is [1] . Their research uses two approaches to similarity between sentences in Wikipedia. Firstly they introduced an MT based approach, using Jaccard Similarity and ‘Babel MT system of Altavista1 ’ to aligned the sentences. The second approach, the Link based bilingual lexicon, that used the hyperlinks in every sentences by mean of a dictionary extracted from Wikipedia to select the. Their result showed best result on using the second approach, especially in articles that are literal translation on each other. Our approach differs in that we convert the articles in their equivalent POS tags and we just align sentences that are according to the rules we add. And then we used two Japanese- Spanish dictionaries as a seed lexicon. 3. BACKGROUND 3.1. Comparable Corpora2 A comparable corpus is a collection of text about a given topic in two or more languages. For example, The ‘Yomiuri Shimbun3 corpora’, a corpus extracted from their daily news both in English and Japanese. Despite the news in both languages are the same, they do not make a proper translation of the content. Comparable corpora are used for many NLP tasks, such as: Information Retrieval, Machine Translation, bilingual lexicon extraction, and so on. In languages with a scarcity of resources Comparable corpora are an alternative of prime order in NLP research. 3.2. Wikipedia Wikipedia4 is a multilingual web-based encyclopedia with articles on a wide range of topics, in which the texts are aligned across different languages. Wikipedia is the successor of Nupedia- an online encyclopedia written by experts in different fields that does not exist now. Wikipedia arose as a single language project (English) on January, 2001 to support Wikipedia differs from Nupedia mainly that anyone can write it, and writers do not need to be an expert in the field that is written. 1 Babel, is an online multilingual translation system 2 Corpora is a plural of corpus. 3 読売新聞: is a Japanese newspaper, with an English version too. http://www.yomiuri.co.jp/ 4 http://en.wikipedia.org/wiki/Main_Page
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 3 Wikipedia is written collaboratively by volunteers (called “wikipedians”) from different places all around the world. This is because Wikipedia has volunteers from many nationalities who write in many different languages. Actually it has articles written in more than 200 languages, with different numbers of articles in each language. The topics vary from science, covering many different fields such as informatics, biology, anthropology and entertainment, such as albums name, artist, actors etc. and fictional characters such as James Bond. Wikipedia is not only a simple encyclopaedia; it has some features that make Wikipedia suitable for NLP research. These features are: 3.2.1. Redirect pages The redirect page is a very suitable resource for eliminating redundant articles. This means avoiding the existence of two articles referring to the same topic. This is used in the case of: • Synonyms like ‘altruism’ which redirects to ‘selflessness’ • Abreviations like ‘USA’ redirects to ‘The United States of America’ • Variation in Spelling like ‘colour’ which redirects to ‘color’ • Nicknames and pseudonyms like ‘Einstein’ which redirects to ‘Albert Einstein’ 3.2.2. Disambiguation pages Disambiguation pages are pages that contain the list of the different senses of a word. 3.2.3. Hyperlinks Articles contain words or entities that have an article about them. So when a user clicks the link he will be redirected to an article about that word. 3.2.4. Category pages Category pages are pages without articles that list members of a particular category and its subcategories. These pages have titles that start with “Category:” and is followed by the name of the particular category. Categorization is a project of Wikipedia that attempts to assign to each article a category. The category is assigned manually by wikipedians and therefore not all pages have a category item. Some articles belong to multiple categories. For example the article “Dominican Republic” belongs to three categories such as: “Dominican Republic”, “Island countries” and “Spanish- speaking countries”. Thus the article Dominican Republic appears in three different category pages.
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 4 4. Methodology 4.1. General Description Figure 1 shows a general view of the methodology. First we extract from Wikipedia all the aligned links i.e. Wikipedia article titles. We extract the Japanese and Spanish about the same topic. Then eliminate the unnecessary data (pre-processing). Split into sentences. After extracting those articles, Use a POS tagger to add the lexical category to each word in a given sentence. Choose the sentences that match according to their lexical category. Use the dictionaries to make a word to word translation. .Finally we got the sentences to be parallel. Figure 1. Methodology
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 5 4.2. Dictionary Extraction from Wikipedia The goal of this phase is acquisition of Japanese-Spanish-English tuples of the Wikipedia’s article titles in order to acquire translations. Wikipedia provides links in each article to corresponding articles in different languages. Every article page in Wikipedia has on the left hand side some boxes labelled: ‘navigation’, ‘search’, ‘toolbox’ and at finally ‘in other languages’. This has a list of all the languages available for that article, although the article in each language does not all have exactly the same contents. In most cases English articles are longer or have more information than the same article in other languages, because most of the Wikipedia collaborators are native English speakers. 4.2.1. Methodology Take all articles titles that are nouns or named entities and look in the articles’ contents for the box called ‘In other languages’. Verify that it has at least one link. If the box exists it redirects to the same article in other languages. Extract the words in these other languages and align it with the original article title. For instance the Spanish article titled ‘economía’ (economics), is translated into Japanese as ‘keizaigaku’ (経済学). When we click Spanish or Japanese in the other languages box we obtain an article about the same topic in the other language, this gives us the translation. 4.3. Extract Japanese and Spanish articles We used the Japanese-Spanish dictionary (4.2.) to select the articles with links in Japanese and Spanish. 4.4. Pre-processing We eliminate the irrelevant information from Wikipedia articles, to make processing easy and faster. The steps are as follows. 1. Remove from the pages all irrelevant information, such as images, menus, characters such as: “()”, “&quot”, “*”, etc... 2. Verify if a link is a redirected article and extract the original article 3. Remove all stopwords -general words that do not give information about a specific topic such as “the”, “between”, “on”, etc. 4.5. Spliting into Sentences and POS tagging For splitting the sentences in the Spanish articles we used NLTK toolkit5 , which is a well-known platform for building Python scripts. For tag Spanish sentences, we used FreeLing6 , which an open source suit for language analizer, specialized in Spanish language. 5 http://nltk.org/ 6 http://nlp.lsi.upc.edu/freeling/
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 6 For Splitting into sentences, in to words and add a word category, we used MeCab7 , which is a Part-of-Speech and Morphological Analyser for Japanese. 4.6. Constructing the Rules Japanese is a Subject-Object-Verb language; While Spanish is a Subject-Verb-Object language. Figure 2. Basic Japanese-Spanish sentence order Figure 2 shows the basic order of the sentences both in Japanese and Spanish, using as a example the sentence “The dog drinks water”. The Japanese sentence ‘犬は水をのみます’ (Inu wa mizu wo nomimasu.)‘ The dog drinks water’ is translated into Spanish as ‘El perro bebe agua’. Table 1. Japanese-Spanish rules Characteristics Rule Description Rules Spanish Japanese Japanese=> Spanish Noun affects the gender of the adjective Adjective do not have gender Noun+desu => noun Adj => Noun (gender) Adj Name Entity Always start with capital letter Do not exist this distinction NE=>NE (Capital letter) Adjective With gender and numbers Adjective (Na), adjective (I) Adj (fe/male) =>Adj (NA/I) Question It is delimited by question marks ¿? The sentences end in か(ka) (sentence+ か )=>( ¿ + sentence +?) Pronouns According to the context can be omitted Can be omitted like in Spanish Pron =>Pron table 1 shows some of the rules applied to this work. Those rules are created taking in account the morphological and syntactic characteristic of each language. For example, In Japanese there no exist genders for the adjectives. While in Spanish there are indispensable. 7 http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/japanese/ 犬は水を飲みます。=> 犬は 水を 飲みます。 El perro bebe agua. => El perro bebe agua. Subject Object Verb Subject Verb Object
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 7 5. Experimental Evaluation For the evaluation of the proposed method, we took a sample of 20 random Japanese and Spanish articles. This experiments were based on two approaches: the hyperlink approach [1] as a baseline and the Rule-Based approach. We downloaded the Wikipedia xml data for April 20128 . We used the Aulex9 dictionary because the dictionary extracted from Wikipedia contain mostly noums and name entitities. To align other grammatical forms such as: verbs, adjectives, etc. we require another dictionary. Table 2 shows the result obtained both with the baseline [1] and our approach. In column 1 shows the “Correct identification” means the sentences with the high scores and the alignment were correct. “Partial Matching” refers to the sentences both in source and target language with a noum phrase in comun. And at last “Incorrect Identification”, refers to sentences with the higher scored. However, there was not even one word match Table 2. Results Baseline Our Approach Hyperlinks Rule-Based POS Correct Identification 13 42 *Partial Matching 51 46 Incorrect Identification 36 12 Total 100 100 5.2. Discussion We noticed that in the sentences form the first part of the article. It is usually the definition of the title of the article, have more correct identification, both approaches. Overall Rule-Based approach performed better than the baseline. This is because when in a given sentence, if the hyperlink word is or the title of the article is repeated, give automatically the best score, even if It is redundant. Some identification performed no as well as we expected due to we need to add more rules, It can be manually or by using bootstrapping methods, this is very interesting point for a future work. We have noticed that by using this method It is possible the construction of new sentences, even they are not in both articles. 6. CONCLUSIONS AND FUTURE WORKS This study focuses on aligning Japanese-Spanish sentences by using a rule-based approach. We have demonstrated the feasibility of using Wikipedia’s features for aligning several languages. We have used POS and constructed rules for aligning the sentences both in source and target article. 8 The Wikipedia data is increasing constantly. 9 http://aulex.org/ja-es/
  • 8. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.3, October 2012 8 The same method can be applied to any pair of language in Wikipedia, and another type of comparable corpora. For future works, we will explore the used of English as a pivot language, and the automatic construction on a corpus by translating. ACKNOWLEDGEMENTS We would like to thanks to Yuya R. for her contribution and helpful comments. REFERENCES [1] Adafre, Sisay F. & De Rijke, Maarten, (2006) “Finding Similar Sentences across Multiple Languages in Wikipedia”, In Proceeding of EACL-06, pages 62-69. [2] Bunescu, Razvan & Pasca, Marius (2006) “Using Encyclopedic Knowledge for Named Entity Disambiguation”, In Proceeding of EACL-06, pages 9-16. [3] Fung, Pascale & Cheung Percy, (2004) “ Multi-level Bootstrapping for extracting Parallel Sentences from a quasi-Comparable Corpus”, In Proceeding of the 20th International Conference on Computational Linguistics. Pages 350 [4] Ramírez, Jessica, Asahara, Masayuki & Matsumoto, Yuji , (2008) “Japanese-Spanish Thesaurus Construction Using English as a Pivot”, In Proceeding of The Third International Joint Conference on Natural Language Processing (IJCNLP), Hyderabad, India. pages 473-480. [5] Rigau, German, Magnni, Bernardo, Aguirre, Eneko & Carroll, John, (2002) “ A Roadmap to Knowledge Technologies”, In Proceeding of COLING Workshop on A Roadmap for Computational Linguistics. Taipei, Taiwan. [6] Tillman, Christoph, (2009) . “ A Bean-Search Extraction Algorithm for Comparable Data”, In Proceeding of ACL, pages 225-228 [7] Tillman, Christoph & Xu, Jian-Ming (2009) “A Simple Sentence-Level Extraction Algorithm for Comparable Data”, In Proceeding of HLT/NAACL, pages 93-96. Authors Jessica C. Ramírez She received his M.S. degree from Nara Institute of Science and Technology (NAIST) in 2007. She is currently pursuing a Ph.D. degree. Her research interest Include machine translation and word sense disambiguation. Yuji Matsumoto He received his M.S. and Ph.D. degrees in information science from Kyoto University in 1979 and in 1989. He is currently a Professor at the Graduate School of Information Science, Nara Institute of Science and Technology. His main research interests are natural language understanding and machine learning.