This document proposes a method for translating legal sentences that involves two steps: 1) segmenting legal sentences based on their logical structure using conditional random fields, and 2) selecting translation rules for the segmented sentences using maximum entropy and linguistic/contextual features. The method was tested on a corpus of 516 English-Japanese sentence pairs and showed improvements over baseline systems in recognizing logical structures and BLEU scores.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
In this paper, we propose a compression based multi-document summarization technique by incorporating
word bigram probability and word co-occurrence measure. First we implemented a graph based technique
to achieve sentence compression and information fusion. In the second step, we use hand-crafted rule
based syntactic constraint to prune our compressed sentences. Finally we use probabilistic measure while
exploiting word co-occurrence within a sentence to obtain our summaries. The system can generate summaries for any user-defined compression rate.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
Text Summarization is a way to produce a text, which contains the significant portion of information of the
original text(s). Different methodologies are developed till now depending upon several parameters to find
the summary as the position, format and type of the sentences in an input text, formats of different words,
frequency of a particular word in a text etc. But according to different languages and input sources, these
parameters are varied. As result the performance of the algorithm is greatly affected. The proposed
approach summarizes a text without depending upon those parameters. Here, the relevance of the
sentences within the text is derived by Simplified Lesk algorithm and WordNet, an online dictionary. This
approach is not only independent of the format of the text and position of a sentence in a text, as the
sentences are arranged at first according to their relevance before the summarization process, the
percentage of summarization can be varied according to needs. The proposed approach gives around 80%
accurate results on 50% summarization of the original text with respect to the manually summarized result,
performed on 50 different types and lengths of texts. We have achieved satisfactory results even upto 25%
summarization of the original text.
The computer is based on natural language on
summarization and machine system. It is very
difficult for human being manually summarize
large amount of text. The greatest challenge
for text summarization to summarize convent
from number of textual and semi structured
sources, including text , HTML page, portable
document file . This summarization can be
determine from internal and external measure.
Our proposed work sentence similarity based
text summarization using clusters help in
finding subjective question and answer on
internet . This work provide short units of text
that belong to similar information. I proposed
my work on sentence similarity-based
computation that helps to experiment for
similar text computation. Extractive
summarization text system choosing a subset of
the similar group from the text. proposal work I
used the part of speech, proper noun, verb,
pronouns such as he, she, and they etc. With
the help of part of speech we find important
sentence using a statistical method like proper
noun and sentence similarity system. It based
on internet information that contain picky
sentence.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
In this paper, we propose a compression based multi-document summarization technique by incorporating
word bigram probability and word co-occurrence measure. First we implemented a graph based technique
to achieve sentence compression and information fusion. In the second step, we use hand-crafted rule
based syntactic constraint to prune our compressed sentences. Finally we use probabilistic measure while
exploiting word co-occurrence within a sentence to obtain our summaries. The system can generate summaries for any user-defined compression rate.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
Text Summarization is a way to produce a text, which contains the significant portion of information of the
original text(s). Different methodologies are developed till now depending upon several parameters to find
the summary as the position, format and type of the sentences in an input text, formats of different words,
frequency of a particular word in a text etc. But according to different languages and input sources, these
parameters are varied. As result the performance of the algorithm is greatly affected. The proposed
approach summarizes a text without depending upon those parameters. Here, the relevance of the
sentences within the text is derived by Simplified Lesk algorithm and WordNet, an online dictionary. This
approach is not only independent of the format of the text and position of a sentence in a text, as the
sentences are arranged at first according to their relevance before the summarization process, the
percentage of summarization can be varied according to needs. The proposed approach gives around 80%
accurate results on 50% summarization of the original text with respect to the manually summarized result,
performed on 50 different types and lengths of texts. We have achieved satisfactory results even upto 25%
summarization of the original text.
The computer is based on natural language on
summarization and machine system. It is very
difficult for human being manually summarize
large amount of text. The greatest challenge
for text summarization to summarize convent
from number of textual and semi structured
sources, including text , HTML page, portable
document file . This summarization can be
determine from internal and external measure.
Our proposed work sentence similarity based
text summarization using clusters help in
finding subjective question and answer on
internet . This work provide short units of text
that belong to similar information. I proposed
my work on sentence similarity-based
computation that helps to experiment for
similar text computation. Extractive
summarization text system choosing a subset of
the similar group from the text. proposal work I
used the part of speech, proper noun, verb,
pronouns such as he, she, and they etc. With
the help of part of speech we find important
sentence using a statistical method like proper
noun and sentence similarity system. It based
on internet information that contain picky
sentence.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
An efficient approach to query reformulation in web searcheSAT Journals
Abstract Wide range of problems regarding to natural language processing, mining of data, bioinformatics and information retrieval can be categorized as string transformation, the following task refers the same. If we give an input string, the system will generates the top k most equivalent output strings which are related to the same input string. In this paper we proposes a narrative and probabilistic method for the transformation of string, which is considered as accurate and also efficient. The approach uses a log linear model, along with the method used for training the model, and also an algorithm that generates the top k outcomes. Log linear method can be defined as restrictive possibility distribution of a result string and the set of rules for the alteration conditioned on key string. It is guaranteed that the resultant top k list will be generated using the algorithm for string generation which is based on pruning. The projected technique is applied to correct the spelling error in query as well as reformulation of queries in case of web based search. Spelling error correction, query reformulation for the related query is not considered in the previous work. Efficiency is not considered as an important issue taken into the consideration in earlier methods and was not focused on improvement of accuracy and efficiency in string transformation. The experimental outcomes on huge scale data show that the projected method is extremely accurate and also efficient. Keywords: Log linear method, Query reformulation, Spelling Error correction.
Semantic Rules Representation in Controlled Natural Language in FluentEditorCognitum
Abstract. The purpose of this paper is to present a way of representation of semantic rules (SWRL) in controlled natural language (English) in order to facilitate understanding the rules by humans interacting with a machine. The rule representation is implemented in FluentEditor – ontology editor with controlled natural language (CNL). The representation can be used in a lot of domains where people interact with machines and use specialized interfaces to define knowledge in a system (semantic knowledge base), e.g. representing medical knowledge and guidelines, procedures in crisis management or in management of any coordination processes. Such knowledge bases are able to support decision making in any discipline provided there is a knowledge stored in a proper semantic way.
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
An efficient approach to query reformulation in web searcheSAT Journals
Abstract Wide range of problems regarding to natural language processing, mining of data, bioinformatics and information retrieval can be categorized as string transformation, the following task refers the same. If we give an input string, the system will generates the top k most equivalent output strings which are related to the same input string. In this paper we proposes a narrative and probabilistic method for the transformation of string, which is considered as accurate and also efficient. The approach uses a log linear model, along with the method used for training the model, and also an algorithm that generates the top k outcomes. Log linear method can be defined as restrictive possibility distribution of a result string and the set of rules for the alteration conditioned on key string. It is guaranteed that the resultant top k list will be generated using the algorithm for string generation which is based on pruning. The projected technique is applied to correct the spelling error in query as well as reformulation of queries in case of web based search. Spelling error correction, query reformulation for the related query is not considered in the previous work. Efficiency is not considered as an important issue taken into the consideration in earlier methods and was not focused on improvement of accuracy and efficiency in string transformation. The experimental outcomes on huge scale data show that the projected method is extremely accurate and also efficient. Keywords: Log linear method, Query reformulation, Spelling Error correction.
Semantic Rules Representation in Controlled Natural Language in FluentEditorCognitum
Abstract. The purpose of this paper is to present a way of representation of semantic rules (SWRL) in controlled natural language (English) in order to facilitate understanding the rules by humans interacting with a machine. The rule representation is implemented in FluentEditor – ontology editor with controlled natural language (CNL). The representation can be used in a lot of domains where people interact with machines and use specialized interfaces to define knowledge in a system (semantic knowledge base), e.g. representing medical knowledge and guidelines, procedures in crisis management or in management of any coordination processes. Such knowledge bases are able to support decision making in any discipline provided there is a knowledge stored in a proper semantic way.
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Source side pre-ordering using recurrent neural networks for English-Myanmar ...IJECEIAES
Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
International Journal of Engineering and Science Invention (IJESI) inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language
pair. The machine translation system will take input script as English sentence and parse with the help of
Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the
machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will
take the parsed output and separate the source text word by word and searches for their corresponding
target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also
reordering rules are there. After applying the reordering rules, English sentence will be syntactically
reordered to suit Marathi language.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Similar to TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION (20)
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
DOI : 10.5121/ijnlc.2013.2403 35
TRANSLATING LEGAL SENTENCE BY
SEGMENTATION AND RULE SELECTION
Bui Thanh Hung1
, Nguyen Le Minh2
and Akira Shimazu3
Graduate School of Information Science,
Japan Advanced Institute of Science and Technology, Japan
1
hungbt@jaist.ac.jp ,2
nguyenml@jaist.ac.jp ,3
shimazu@jaist.ac.jp
ABSTRACT
A legal text usually long and complicated, it has some characteristic that make it different from other daily-
use texts.Then, translating a legal text is generally considered to be difficult. This paper introduces an
approach to split a legal sentence based on its logical structure and presents selecting appropriate
translation rules to improve phrase reordering of legal translation. We use a method which divides a
English legal sentence based on its logical structure to segment the legal sentence. New features with rich
linguistic and contextual information of split sentences are proposed to rule selection. We apply maximum
entropy to combine rich linguistic and contextual information and integrate the features of split sentences
into the legal translation, tree-based SMT system. We obtain improvements in performance for legal
translation from English to Japanese over Moses and Moses-chart systems.
KEYWORDS
Legal Translation, Logical Structure of a Legal Sentence, Phrase Reordering, CRFs, Maximum Entropy
Model, Linguistic and Contextual Information
1. INTRODUCTION
Legal translation is the task of how to translate texts within the field of law. Translating legal
texts automatically is one of the difficult tasks because legal translation requires exact precision,
authenticity and a deep understanding of law systems. Because of the meticulous nature of the
composition (by experts), sentences in legal texts are usually long and complicated. When we
translate long sentences, parsing accuracy will be lower as the length of sentence grows. It will
inevitably hurt the translation quality and decoding on long sentences will be time consuming,
especially for forest approaches. So splitting long sentences into sub-sentences becomes a natural
way to improve machine translation quality.
A legal sentence represents a requisite and its effectuation [10], [17], [23]. Dividing a sentence
into shorter parts and translating them has a possibility to improve the quality of translation. For a
legal sentence with the requisite-effectuation structure (logical structure), dividing a sentence into
requisite-and-effectuation parts is simpler than dividing the sentence into its clauses because such
legal sentences have specific linguistic expressions that are useful for dividing. We first recognize
the logical structure of a legal sentence using statistical learning model with linguistic
information. Then we segment a legal sentence into parts of its structure and apply rule selection
to translate them with statistical machine translation (SMT) models.
In the phrase-based model [11], phrase reordering is a great problem because the target phrase
order differs significantly from the source phrase order for several language pairs such as English-
Japanese. Linguistic and contextual information have been widely used to improve translation
2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
36
performance. It is helpful to reduce ambiguity, thus guiding the decoder to choose correct
translation for a source text on phrase reordering. In this paper, we focus on selecting appropriate
translation rules to improve phrase reordering for the tree-based statistical machine translation,
the model operates on synchronous context free grammars (SCFG) (Chiang [4], [5]). SCFG rules
for translation are represented by using terminal (words or phrases), non-terminals and structural
information. SCFG consists of a left-hand-side (LHS) and a right-hand-side (RHS). Generally,
there are cases that a source sentence pattern-matches with multiple rules which produce quite
different phrase reordering as the following example:
During decoding, without considering linguistic and contextual information for both nonterminals
and terminals, the decoder may make errors on phrase reordering caused by inappropriate
translation rules. So rule selection is important to tree-based statistical machine translation
systems. This is because a rule contains not only terminals (words or phrases), but also
nonterminals and structural information. During decoding, when a rule is selected and applied to a
source text, both lexical translations (for terminals) and reorderings (for nonterminals) are
determined. Therefore, rule selection affects both lexical translation and phrase reorderings.
We propose translating split sentence based on the logical structure of a legal sentence and rule
selection for legal translation specifically:
• We divide a legal sentence based on its logical structure into the first step
• We apply a statistical learning method - Conditional Random Fields (CRFs) with rich
linguistic information to recognize the logical structure of a legal sentence and the
logical structure of a legal sentence is adopted to divide the sentence.
• We use a maximum entropy-based rule selection model for tree-based English-Japanese
statistical machine translation in legal domain. The maximum entropy-based rule
selection model combines local contextual information around rules and information of
sub-trees covered by variables in rules.
• We propose using rich linguistic and contextual information of split sentences for both
non-terminals and terminals to select appropriate translation rules.
• We obtain substantial improvements by BLEU over the Moses and Moses-chart baseline
system.
This paper is organized as follows. Section II gives related works. Section III describes our
Method : how to segment legal sentence to the logical structure and how to use rich linguistic and
contextual information of split sentences for rule selection. The experiment results are discussed
in Section IV, and the conclusion and future work are followed in Section V.
2. PREVIOUS WORK
Machine translation can work well for simple sentences but a machine translation system faces
difficulty while translating long sentences, as a result the performance of the system degrades.
Most legal sentences are long and complex, the translation model has a higher probability to fail
in the analysis, and produces poor translation results. One possible way to overcome this problem
is to divide long sentences to smaller units which can be translated separately. There are several
approaches on splitting long sentences into smaller segments in order to improve the translation.
Xiong et al., [25] used Maximum Entropy Markov Models to learn the translation boundaries
based on word alignments in hierarchical trees. They integrated soft constraints with beginning
3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
37
and ending translation boundaries into the log linear model decoder. They proposed a new
feature: translation boundary violation counting to prefer the hypotheses that are consistent with
the translation boundaries.
Sudoh et al., [22] proposed dividing a source sentence into smaller clauses using a syntactic
parser. They used a non-terminal symbol served as a place-holder for a relative clause and
trained a clause translation model with these symbols. They proposed a clause alignment method
using a graph-based method to build the non-terminal corpus. Their model can perform short and
long distance reordering simultaneously.
Goh et al., [7] proposed rule-based method for splitting the long sentences using linguistic
information and translated the sentences with the split boundaries. They used two types of
Constraints : split condition and block condition with “zone” and “wall” markers in Moses.
Each of these approaches has its strength and weakness in application to sentence partitioning.
However, in order to develop a system for splitting legal sentences, dividing a legal sentence
based on its logical structure is preferable. Dividing a sentence into requisite-and-effectuation
parts (logical structure) is simpler than dividing the sentence into its clauses because such legal
sentences have specific linguistic expressions that are useful for dividing.
Our approach is different from those of previous works. We apply the logical structure of a legal
sentence to split legal sentences. We use characteristics and linguistic information of legal texts
to split legal sentences into logical structures. Bach et al., [1] used Conditional Random Fields
(CRFs) to recognize the logical structure of a Japanese legal sentence. We use the same way as
in [1], [9] to recognize the logical structure of a English legal sentence. We propose new features
to recognize its logical structure. The logical structure of a legal sentence by the recognition task
will be used to split long sentences. Our approach is useful for legal translation.It will reserve a
legal sentence structure, reduce the analysis in deciding the correct syntactic structure of a
sentence, remove ambiguous cases in advanced and promise results.
Linguistic and contextual information were used previously. They are very helpful for SMT
system. There are many works using them to solve the selection problem in SMT.Carpuat and
Wu, [2] integrated word-sense-disambiguation (WSD) and phrase-sensedisambiguation (PSD)
into a phrased-based SMT system to solve the lexical ambiguity problem.Chan et al., [3]
incorporated a WSD system into the hierarchical SMT system, focusing on solving ambiguity for
terminals of translation rules. He et al., Liu et al., extended WSD like the approach proposed in
[2] to hierarchical decoders and incorporated the MERS model into a state-of-the-art syntax-based
SMT model, the tree-to-string alignment template model [8], [16].Chiang et al. [6], used 11,001
features for statistical machine translation.
In this paper, we propose using rich linguistic and contextual information for English-Japanese
legal translation, specifically:
• We recognize the logical structure of a legal sentence and divide the legal sentence
based on its logical structure as the first step.
• We use rich linguistic and contextual information for both non-terminals and
terminals.Linguistic and contextual information around terminals have never been used
before,we see that these new features are very useful for selecting appropriate translation
rules if we integrate them with the features of non-terminals.
• We propose a simple and sufficient algorithm for extracting features in rule selection.
• We classify features by using maximum entropy-based rule selection model and
incorporate this model into a state-of-the-art syntax-based SMT model, the tree-based
model (Moses-chart).
4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
38
• Our proposed method can achieve better results for English-Japanese legal translation
based on the BLEU scores.
Figure 1. The diagram of our proposed method
3. PROPOSED METHOD
The method of translating legal sentence by segmentation and rule selection follows in two
steps:
• Legal sentence segmentation
• Rule selection for legal translation
The diagram of our proposed method is shown in Fig. 1.
3.1. Legal sentence segmentation
To segment legal sentence to its structure, at the first we recognize the logical structure of legal
sentence. Most law sentences are the implication and the logical structure of a sentence defining
a term is the equivalence type. An implication law sentence consists of a law requisite part and a
law effectuation part which designate the legal logical structure described in [10], [17], [23].
Structures of a sentence in terms of these parts are shown in Fig. 2.
The requisite part and the effectuation part of a legal sentence are generally composed from three
parts: a topic part, an antecedent part and a consequent part. In a legal sentence, the partusually
describes a law provision, and the antecedent part describes cases in which the law provision can
be applied. The topic part describes a subject which is related to the law provision.There are four
cases (illustrated in Fig. 2) basing on where the topic part depends on: case 0 (no
5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
39
Figure 2. Four cases of the logical structure of a legal sentence
topic part), case 1 (the topic part depends on the antecedent part), case 2 (the topic part depends
on the consequent part), and case 3 (the topic part depends on both the antecedent and the
consequent parts). Let us show examples of four cases of the logical structure of a legal sentence.
The annotation in these examples and in the test corpus was carried out by a person who was an
officer of the Japanese government, and persons who were students of a graduate law school and
a law school.
• Case 0:
<A> When a period of an insured is calculated, </A>
<C> it is based on a month. </C>
• Case 1:
<T1> For the person </T1>
<A> who is qualified for the insured after s/he was disqualified, </A>
<C> the terms of the insured are added up together. </C>
• Case 2:
<T2> For the amount of the pension by this law, </T2>
<A> when there is a remarkable change in the living standard of the nation of the other situation,
</A>
<C> a revision of the amount of the pension must be taken action promptly to meet the
situations. </C>
• Case 3:
<T3> For the Government, </T3>
<A> when it makes a present state and a perspective of the finance, </A>
<C> it must announce it officially without delay. </C>
6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
40
In these examples, A refers to the antecedent part, C refers to the consequent part, and T1, T2, T3
refer to the topic parts which correspond to case 1, case 2, and case 3.
We use sequence learning model described in [13], [15] to recognize the logical structure of a
legal sentence. We model the structure recognition task as a sequence labeling problem, in which
each sentence is a sequence of words. We consider implication types of legal sentences,and five
kinds of logical parts for the recognition of the logical structure of a legal sentence, as
follows:
¯ Antecedent part (A)
¯ Consequent part (C)
¯ Topic part T1 (correspond to case 1)
¯ Topic part T2 (correspond to case 2)
¯ Topic part T3 (correspond to case 3)
in the IOB notation [13], [15], we will have 11 kinds of tags: B-A, I-A, B-C, I-C, B-T1, I-
T1,BT2, I-T2, B-T3, I-T3 and O (used for an element not included in any part). For example, an
element with tag B-A begins an antecedent part, while an element with tag B-C begins a
consequent part.
We use Conditional Random Fields (CRFs) [13], [15] as a learning method because the
recognition task of the logical structure of a legal sentence can be considered as a sequence
learning problem, and CRFs is an efficient and powerful framework for sequence learning tasks.
We propose some new features to recognize the logical structure of a English legal sentence
based on its characteristics and linguistic information. We designed a set of features:
• Word form: phonological or orthographic appearance of a word in a sentence.
• Chunking tag: tag of syntactically correlated parts of words in a sentence.
• Part-of-Speech features: POS tags of the words in a sentence
• The number of particular linguistic elements which appear in a sentence as follows:
+ Relative pronouns (e.g, where, who, whom, whose, that)
+ Punctuation marks (. , ; :)
+ Verb phrase chunks
+ Relative phrase chunks
+ Quotes
We parse the individual English sentences by Stanford parser [20] and use CRFs tool [13] for
sequence learning tasks. Experiments were conducted in the English-Japanese translation corpus.
We collected the corpus using Japanese Law Translation Database System (available at
http://www.japaneselawtranslation.go.jp/). The corpus contains 516 sentences pairs. Table 1
shows statistics on the number of logical parts of each type.
We divided the corpus into 10 sets, and conducted 10-fold cross-validation tests for recognizing
logical structures of the sentences in the corpus. We evaluated the performance by precision,
recall, and F1 scores as:
Experimental results on the corpus are described in Table 2.
7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
41
Table 1. Statistics on logical parts of the corpus
Table 2. Experimental results for recognition of the logical structure of a legal sentence
Figure 3. Examples of sentence segmentation
After recognizing the logical structure of a legal sentence, we segment a sentence to its structure.
8. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
42
According to the logical structure of a legal sentence (Fig. 1), a sentence of each case is divided
as follows:
• Case 0:
Requisite part: [A]
Effectuation part: [C]
• Case 1:
Requisite part: [T1 A]
Effectuation part: [C]
• Case 2:
Requisite part: [A]
Effectuation part: [T2 C]
• Case 3:
Requisite part: [T3 A]
Effectuation part: [T3 C]
The examples of the sentences in section 3.1 are separated as shown in Fig 3.
3.2. Rule Selection for Legal Translation
Rule selection is important to tree-based statistical machine translation systems. This is because a
rule contains not only terminals (words or phrases), but also nonterminals and structural
information. During decoding, when a rule is selected and applied to a source text, both lexical
translations (for terminals) and reorderings (for nonterminals) are determined. Therefore, rule
selection affects both lexical translation and phrase reorderings. However, most of the current
tree-based systems ignore contextual information when they select rules during decoding,
especially the information covered by nonterminals. This makes the decoder hardly to
distinguishrules. Intuitively, information covered by nonterminals as well as
contextualinformation of rules is believed to be helpful for rule selection.Linguistic and
contextual information have been widely used to improve translation performance. It is helpful to
reduce ambiguity, thus guiding the decoder to choose correct translation for a source text on
phrase reordering. In our research, we integrate dividing a legal sentence based on its logical
structure into the first step of the rule selection. We propose a maximum entropy-based rule
selection model for tree-based English-Japanese statistical machine translation in legal domain.
The maximum entropy-based rule selection model combines local contextual information around
rules and information of sub-trees covered by variables in rules. Therefore, the nice properties of
maximum entropy model (lexical and syntax for rule selection) are helpful for rule selection
methods better.
3.2.1. Maximum Entropy based rule selection model (MaxEnt RS model)
The rule selection task can be considered as a multi-class classification task. For a source-side,
each corresponding target-side is a label. The maximum entropy approach (Berger et al., 1996)
is known to be well suited to solve the classification problem. Therefore, we build a maximum
entropy-based rule selection (MaxEnt RS) model for each ambiguous hierarchical LHS (left-hand
side).
Following [4], [5] we use (α, γ) to represent a SCFG rule extracted from the training corpus,
where α and γ are source and target strings, respectively. The nonterminal in α and γ are
represented by Xk , where k is an index indicating one-one correspondence between nonterminal
in source and target sides. Let us use e(Xk ) to represent the source text covered by Xk and f(Xk )
9. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
43
to represent the translation of e(Xk ). Let C(α) be the context information of source text matched
Table 3. Lexical features of nonterminals
by α and C(α) be the context information of source text matched by γ . Under the MaxEnt
model, we have:
Where hi a binary feature function, λi the feature weight of hi. The MaxEnt RS model combines
rich context information of grammar rules, as well as information of the subphrases which will
be reduced to nonterminal X during decoding. However, these information is ignored by
Chiang’s hierarchical model.
We design five kinds of features for a rule (α, γ): Lexical, Parts-of-speech (POS), Length, Parent
and Sibling features.
3.2.2. Linguistic and Contextual Information For Rule Selection
A. Lexical Features of Nonterminal
In the each hierarchical rules, there are nonterminals. Features of nonterminal consist of Lexical
features, Parts-of-speech features and Length features:
Lexical features, which are the words immediately to the left and right of α, and boundary
words of subphrase e(Xk) and f(Xk);
Parts-of-speech (POS) features, which are POS tags of the source words defined in lexical
features.
Length features, which are the length of subphrases e(Xk) and f(Xk).
Table 3 shows lexical features of nonterminals. For example, we have a rule, source phrase,
source sentence and alignment as following:
Rule:
10. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
44
Table 4. Lexical features of nonterminal of the example
Table 5. Lexical features around nonterminal
Features of this example are shown in Table 4.
b. Lexical features around nonterminals
These features are same meaning as features of nonterminal.
11. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
45
Lexical features, which are the words immediately to the left and right of subphrase e(Xk) and
f(Xk);
Table 6. Lexical features around nonterminal of the example
Figure 4. Sub-tree covers nonterminal X1
Parts-of-speech (POS) features, which are POS tags of the source words defined in lexical
features.
Table 5 shows lexical features around nonterminal.
Example: with a rule:
We have lexical features around nonterminal shown in Table 6.
12. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
46
c. Syntax features
Let R <α, γ> is a translation rule and e(α) is source phrase covered by α.
Xk is nonterminal in α, T(Xk) is sub-tree covering Xk.
Parent feature (PF):
The parent node of T(Xk) in the parse tree of source sentence. The same sub-tree may have
different parent nodes in different training examples. Therefore, this feature may provide
information for distinguishing source sub-trees
Figure 5. (a) S: Parent feature of sub-tree covers nonterminal X1
(b) NP: Sibling feature of sub-tree covers nonterminal X1
13. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
47
Figure 6. Algorithm for extracting features
Sibling feature (SBF)
The sibling features of the root of T(Xk). This feature considers neighboring nodes which share
the same parent node.Fig. 4 shows the subtree covers non terminal X1, Fig. 5(a) shows S node is
the Parent feature ofsubtree covering X1 and NP node is the Sibling feature shown in Fig. 5(b).
Those features:Lexical feature, Parts-of-speech features, Length features, Parent features and
Sibling featuresmake use rich of information around a rule, including the contextual information
of a rule andthe information of sub-trees covered by non terminals. These features can be
gathered accordingto Chiang’s rule extraction method. We use Moses-chart [12] to extract
phrases and rules,Stanford Tagger toolkits and Cabocha [14] to tag, tokenize English and
Japanese sourcesentence, Stanford parser [20] to parse English test sentence, after that we use
algorithm in Fig.6 to extract features.
In Moses-chart, the number of nonterminal of a rule are limited up to 2. Thus a rule may have
36 features at most. After extracting features from training corpus, we use the toolkit
implemented in [24] to train a MaxEnt RS model for each ambiguous hierarchical LHS.
3.2.3. Integrating Maxent RS Model Into Tree-Based Model
We integrate the MaxEnt RS model into the tree-based model during the translation of each
source sentence. Thus the MaxEnt RS models can help the decoder perform context-dependent
rule selection during decoding.
14. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
48
In Chiang, [4] the log-linear model combines 8 features: the translation probabilities P(γ|α)
And P(α | γ ), the lexical weights Pw(γ | α) and Pw(α | γ ), the language model, the word penalty,
thephrase penalty, and the glue rule penalty. For integration, we add two new features:
This feature is computed by the MaxEnt RS model, which gives a probability that the model
selecting a target-side γ given an ambiguous source-side α, considering context information.
(2) Prsn = exp(1).
This feature is similar to phrase penalty feature. In our experiment, we find that some sourcesides
are not ambiguous, and correspond to only one target-side. However, if a source-side α’ is not
ambiguous, the first features Prs will be set to 1.0. In fact, these rules are not reliable since
they usually occur only once in the training corpus. Therefore, we use this feature to reward the
ambiguous source-side. During decoding, if an LHS has multiple translations, this feature is set
to exp(1), otherwise it is set to exp(0).
Chiang [5] used the CKY (Cocke-Kasami-Younger) algorithm with a cube pruning method for
decoding. This method can significantly reduce the search space by efficiently computing the
top-n items rather than all possible items at a node, using the k-best algorithms of Huang and
Chiang (2005) to speed up the computation. In cube pruning, the translation model is treated as
the monotonic backbone of the search space, while the language model score is a nonmonotonic
cost that distorts the search space. Similarly, in the MaxEnt RS model, source-side features form a
monotonic score while target-side features constitute a non-monotonic cost that can be seen as
part of the language model.
For translating a source sentence EJ
I, the decoder adopts a bottom-up strategy. All derivations
are stored in a chart structure. Each cell c[i, j] of the chart contains all partial derivations which
correspond to the source phrase e j
i . For translating a source-side span [i , j], we first select all
possible rules from the rule table. Meanwhile, we can obtain features of the MaxEnt RS model
which are defined on the source-side since they are fixed before decoding. During decoding, for
a source phrase e j
i, suppose the rule X → (e k
i X1 e j
t, f k
’i’ X1 f j
’t’) is selected by the decoder,
where and k+1 < t, then we can gather features which are defined on the target-
side of the subphrase X1 from the ancestor chart cell c[k+1, t-1] since the span [k+1, t-1] has
already been covered. Then the new feature scores Prs and Prsn can be computed. Therefore, the
cost of derivation can be obtained. Finally, the decoding is completed. When the whole sentence
is
Table 7. Statistical table of train and test corpus
15. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
49
Table 8. Statistics of the test corpus
Table 9. Number of requisite part, effectuation part in the test data
covered, and the best derivation of the source sentence EJ
I is the item with the lowest cost in cell
c[I,J].
The advantage of our integration is that we need not change the main decoding algorithm of a
SMT system. Furthermore, the weights of the new features can be trained together with other
features of the translation model.
4. EXPERIMENTS
We conducted the experiments on the English-Japanese translation corpus provided by Japanese
Law Translation Database System. The training corpus consisted of 40,000 English-Japanese
original sentence pairs, the development and test set consisted of 1,400 and 516 sentence pairs,
respectively. The statistics of the corpus is shown in Table 7. We tested on 516 English- Japanese
sentence pairs. Table 8 shows statistics of the test corpus. The test set is divided by the method
described in Section 3.1. Table 9 shows the number of sentences, the statistics of the requisite
parts, the effectuation parts and the logical parts after splitting in the test set. Then, we applied
rule selection for the split sentence in the test set.
To run decoder, we share the same pruning setting with Moses, Moses-chart [12] baseline
systems. To train the translation model, we first run GIZA++ [18] to obtain word alignment in
both translation directions. Then we use Moses-chart to extract SCFG grammar rules. We use
Stanford Tagger [20] and Cabocha [14] toolkits to token and tag English and Japanese sentences.
We parse the split English test sentence by Stanford parser [20] and gather lexical and syntax
features for training the MaxEnt RS models. The maximum initial phrase length is set to 7 and the
maximum rule length of the source side is set to 5.
Lex= Lexical Features, POS= POS Features, Len= Length Feature, Parent= Parent Features,
Sibling = Sibling Features.
16. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
50
Table 10. BLEU-4 scores (case-insensitive) on English-Japanese corpus.
We use SRI Language modeling toolkit [21] to train language models. We use minimum
error rate training integrated in Moses-chart to tune the feature weights for the log-linear
model. The translation quality is evaluated by BLEU metric [19], as calculated by
mteval-v12.pl with case insensitive matching of n-grams, where n=4. After using Moses-
chart to extract rules, we have a rule-table, then we insert two new scores into the rules.
We evaluate both original test sentence and split test sentence with Maxent RS model.
We compare the results of four systems: Moses using original test sentence (MM),
Moses-chart using original test sentence (MC), Moses-chart using split test sentence (MS)
and Moses-chart using split test sentence and applying rule selection or our system (MR).
The results are shown in Table 10.
As we described, we add two new features to integrate the Maxent RS models into the
Moses chart: and Prsn .We do not need to change the
main decoding algorithm of a SMT system and the weights of the new features can be
trained together with other features of the translation model.
In Table 10, Moses system using original test sentence (MM) got 0.287 BLEU scores, Moses
chart system using original test sentence (MC) got 0.306 BLEU scores, Moses-chart system
using split sentence (MS) got 0.318 BLEU scores, using all features defined to train the MaxEnt
RS models for Moses-chart using split test sentence our system got 0.329 BLEU scores, with an
absolute improvement 4.2 over MM system, 2.3 over MC system and 1.1 over MS system. In
order to explore the utility of the context features, we train the MaxEnt RS models on different
features sets. We find that lexical features of non terminal and syntax features are the most useful
features since they can generalize over all training examples. Moreover, lexical features around
non terminal also yields improvement. However, these features are never used in the baseline.
When we used MS system to extract rule, we got the rules as shown in Table 11. Table 12 shows
the number of source-sides of SCFG rules for English-Japanese corpus. After extracting grammar
rules from the training corpus, there are 12,148 source-sides match the split test sentence, they are
hierarchical LHS's (H-LHS, the LHS which contains non terminals). For the hierarchical LHS's,
52.22% is ambiguous (AH-LHS, the H-LHS which has multiple translations). This indicates that
the decoder will face serious rule selection problem during decoding. We also noted the number
of the source-sides of the best translation for the split test sentence. However, by incorporating
MaxEnt RS models, that proportion increases to 67.36%, since the number of AH-LHS increases.
17. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
51
The reason is that, we use the feature Prsn to reward ambiguous hierarchical LHS’s. This has
some advantages. On one hand, H-LHS can capture phrase reordering
Table 11. Statistical table of rules
Table 12. Number of possible source-sides of SCFG rule for English-Japanese corpus and
number of source-sides of the best translation.
H-LHS = Hierarchical LHS, AH-LHS = Ambiguous hierarchical LHS
Table 13. Translation examples of test sentences in Case 3 in MS and our systems (MR, all
features)The Japanese sentence in Japanese-English translation is the original sentence. The
English sentence in English-Japanese translation is the reference translation in the government
web page
phrase reorderings. On the other hand, AH-LHS is more reliable than non-ambiguous LHS,
since most non-ambiguous LHS occurred only once in the training corpus. In order to know
how the MaxEnt RS models improve the performance of the SMT system, we study the best
translation of MS and our systems. We find that the MaxEnt RS models improve translation
quality in two ways:
Better Phrase reordering
Since the SCFG rules which contain nonterminals can capture reordering of phrases, better rule
selection will produce better phrase reordering.
Table 13 shows translation examples of test sentences in Case 3 in MS and our systems, our
system gets better result than MS system in phrase reordering.
18. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
52
Better Lexical Translation
The MaxEnt RS models can also help the decoder perform better lexical translation than
MSsystem. This is because the SCFG rules contain terminals. When the decoder selects a
rule for a source-side, it also determines the translations of the source terminals. The
examples of our system get better result than MS system in lexical translation shown in
the underlined parts of Table 13.
The advantage of the proposed method arises from the translation model based on the
logical structure of a legal sentence where the decoder searches over shortened inputs.
Because we use the logical structure of a legal sentence to split sentence, the split
sentence reserves its structure and the average length of split sentence is much smaller
than those of no split sentence. They are expected to help realize an efficient statistical
machine translation search.
The syntax features and lexical features of non-terminals are the useful features since
they canbe generalized over all training examples. However, the lexical features around
non-terminals also yield improvement because translation rules contain terminals (words
or phrases), nonterminals and structural information. Terminals indicate lexical
translation, and non-terminal and structural information can capture short or long-
distance reordering. Therefore, rich lexical and contextual information can help decoder
capture reordering of phrases. Since the rules which contain non-terminals can capture
reordering of phrases, better rule selection will produce better phrase reordering. The
Maximum entropy models can also help the decoder perform better lexical translation
than the baseline. This is because the rules contain terminals, when the decoder selects a
rule for a source side, it also determines the translations of the source terminals.
5. CONCLUSION AND FUTURE WORK
We show in this paper that translating legal sentence by segmentation and rule selection
can help improving legal translation. We divided a legal sentence based on its logical
structure and applying the split sentence into rule selection. We use rich linguistic and
contextual information for both non-terminals and terminals and integrate them into
maximum entropy-based ruleselection to solve each ambiguous rule among the
translation rules to help decoder know which rules are suitable. Experiment shows that
this approach to legal translation achieves improvements over tree-based SMT (Moses-
chart) and Moses systems.
In the future, we will investigate more sophisticated features to improve legal
sentencesegmentation and the maximum entropy-based rule selection model. We will
apply our proposed method into training and test the performance on a large scale corpus.
REFERENCES
[1] Bach, N., X., Minh, and N., L., Shimazu, A., (2010). “PRE Task: The Task of Recognition of
Requisite Part and Effectuation Part in Law Sentences”, in International Journal of Computer
Processing of Languages (IJCPOL), Volume 23, Number 2.
19. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
53
[2] Carpuat, M., and Wu, W., (2007). “Improving statistical machine translation using word sense
disambiguation”. In Proceedings of EMNLP-CoNLL, pp. 61–72.
[3] Chan, Seng, Y., Tou, N., H., and Chiang, D., (2007). "Word sense disambiguation improves statistical
machine translation". In Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics, pp. 33-40.
[4] Chiang, D., (2005). “A hierarchical phrase-based model for statistical machine translation”. In
Processing of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 263-
270.
[5] Chiang, D., (2007). “Hierarchical phrase-based translation”. Computational Linguistics, pp.
33(2):201–228.
[6] Chiang, D., Knight, K., and Wang, W., (2009). “11,001 new features for statistical machine
translation”. In Proceedings of the Human Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics.
[7] Goh, C., Onishi, T., and Sumita, E., (2011). "Rule-based Reordering Constraints for Phrase-based
SMT", EAMT.
[8] He, Z., Liu, Q., and Lin, S., (2008). “Improving statistical machine translation using lexicalized rule
selection”. Proceedings of the 22nd International Conference on Computational Linguistics, August,
pp. 321-328.
[9] Hung, B., T., Minh, N., L., and Shimazu, A., (2012). “Divide and Translate Legal Text Sentence by
Using its Logical Structure”, in Proceedings of 7th International Conference on Knowledge,
Information and Creativity Support Systems, pp. 18-23.
[10] Katayama, K., (2007). “Legal Engineering-An Engineering Approach to Laws in E-society Age”, in
Proc. of the 1st International Workshop on JURISIN.
[11] Koehn, P., (2010). “Statistical machine translation”, 488 pages, Cambridge press.
[12] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen,
W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E., (2007). “Moses: Open
Source Toolkit for Statistical Machine Translation”, Annual Meeting of the Association for
Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June.
[13] Kudo, T., (2003). “CRF++: Yet Another CRF toolkit”, http://crfpp.sourceforge.net/
[14] Kudo, T., (2003). “Yet Another Japanese Dependency Structure Analyzer”,
http://chasen.org/taku/software/cabocha/
[15] Lafferty, J., McCallum, A., and Pereira, F., (2001). “Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data”, in Proceedings of ICML, pp. 282–289
[16] Liu, Q., He, Z., Liu, Y., and Lin, S., (2008). "Maximum Entropy based Rule Selection Model for
Syntax-based Statistical Machine Translation". Proceedings of EMNLP, Honolulu, Hawaii.
[17] Nakamura, M., Nobuoka, S., and Shimazu, A., (2007). “Towards Translation of Legal Sentences into
Logical Forms”, in Proceedings of the 1st International Workshop on JURISIN.
[18] Och, F., and Ney, H., (2000). “Improved Statistical Alignment Models”, Proc. of the 38th Annual
Meeting of the Association for Computational Linguistics, Hongkong, China, pp. 440-447, October.
[19] Papineni, Kishore, Roukos, S., Ward, T., and Zhu, W., (2002). “BLEU: A Method for Automatic
Evaluation of Machine Translation”, in Proceedings of the 40th Annual Meeting of the ACL, pp.311-
318.
[20] Socher, R., Bauer, J., Manning, C., D., and Andrew, Y., Ng., (2013). "Parsing With Compositional
Vector Grammars". Proceedings of ACL.
[21] Stolcke, (2002). “SRILM-An Extensible Language Modeling Toolkit”, in Proceedings of
International Conference on Spoken Language Processing, vol. 2, (Denver, CO), pp. 901-904.
[22] Sudoh, Katsuhito, Duh, K., Tsukada, H., Hirao, T., and Nagata, M., (2010). “Divide and Translate:
Improving Long Distance Reordering in Statistical Machine Translation”, in Proceedings of the Joint
5th Workshop on SMT and MEtricsMATR, pp 418-427.
[23] Tanaka, K., Kawazoe, I., and Narita, H., (1993). “Standard Structure of Legal Provisions-for the
Legal Knowledge Processing by Natural Language-(in Japanese)”, in IPSJ Research Report on
Natural Language Processing, pp.79-86.
[24] Tsuruoka, Y., (2011). “A Simple C++ Library for Maximum Entropy Classification”. http://www-
tsujii.is.s.u-tokyo.ac.jp/ tsuruoka/maxent/.
[25] Xiong, H., Xu, W., Mi, H., Liu, Y., and Lu, Q., (2009). “Sub-Sentence Division for Tree-Based
Machine Translation”, in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP,
Short Papers, Singapore, pp. 137-140
20. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.4, August 2013
54
Authors
Bui Thanh Hung
He received his M.S. degree from Japan Advanced Institute of Science and Technology
(JAIST) in 2010. He is a Ph.D. research student. His research interest includes natural
language processing and machine translation.
Nguyen Le Minh
He received his M.S. degree from Vietnam National University in 2001 andPh.D. degree
from Japan Advanced Institute of Science and Technology in 2004.He is currently a
Associate Professor at the Graduate School of InformationScience, Japan Advanced
Institute of Science and Technology (JAIST). His main research interests are Natural
Language Processing, Text Summarization,Machine Translation, and Natural Language
Understanding
Akira Shimazu
He received his M.S. and Ph.D. degrees from Kyushu University in 1973 and in1991. He
is currently a Professor at the Graduate School of Information Science,Japan Advanced
Institute of Science and Technology (JAIST). His main researchinterests are Natural
Language Processing, Dialog Processing, MachineTranslation, and Legal Engineering.