This paper presents a method for automatically detecting and correcting erroneous characters in Chinese text. The method treats typo correction as an integral part of syntactic analysis. It considers both the original character and possible replacement characters from a list of confusable pairs during sentence parsing. The character that results in the best parse is identified as correct. The approach achieves substantially higher recall and precision than existing Chinese proofreaders, which do not perform a full syntactic analysis. An evaluation on 50 character pairs found an overall precision of 86.9% and recall of 96.3%. Cases involving characters that can only form words together tended to have perfect scores, while characters that can stand alone were more difficult to correct.
This document presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model uses a parser to resolve ambiguities that require syntactic information from the full sentence. Most ambiguities are resolved at the lexical level using dictionary information, reducing complexity for the parser. The model prioritizes parsing efficiency by only presenting unambiguous words and postponing ambiguous words to the parsing stage when needed. It is implemented in a natural language understanding system.
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Shallow parsing is a technique that divides text such as sentences into constituent parts and describes the syntactic relationships between those parts, but does not fully analyze internal structure or function. It aims to infer as much structure as possible from morphological and word order information. Typical modules include part-of-speech tagging, chunking of phrases, and relation finding between chunks. Shallow parsers are useful for processing large texts and are more robust to noise than deep parsers.
Learning Verb-Noun Relations to Improve ParsingAndi Wu
This document describes a learning procedure to automatically acquire knowledge about verb-noun relations in Chinese. It uses an existing parser, a large corpus, and statistical methods to learn which verb-noun pairs typically occur in a verb-object relation versus a modifier-head relation. The learned knowledge is then used to disambiguate parses, improving the accuracy of the original parser. An evaluation on 500 sentences showed the parser's accuracy improved significantly, with the correct analysis found for 350 sentences when using the acquired knowledge.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model uses a parser to resolve ambiguities that require syntactic information from the full sentence. Most ambiguities are resolved at the lexical level using dictionary information, reducing complexity for the parser. The model prioritizes parsing efficiency by only presenting unambiguous words and postponing ambiguous words to the parsing stage when needed. It is implemented in a natural language understanding system.
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Shallow parsing is a technique that divides text such as sentences into constituent parts and describes the syntactic relationships between those parts, but does not fully analyze internal structure or function. It aims to infer as much structure as possible from morphological and word order information. Typical modules include part-of-speech tagging, chunking of phrases, and relation finding between chunks. Shallow parsers are useful for processing large texts and are more robust to noise than deep parsers.
Learning Verb-Noun Relations to Improve ParsingAndi Wu
This document describes a learning procedure to automatically acquire knowledge about verb-noun relations in Chinese. It uses an existing parser, a large corpus, and statistical methods to learn which verb-noun pairs typically occur in a verb-object relation versus a modifier-head relation. The learned knowledge is then used to disambiguate parses, improving the accuracy of the original parser. An evaluation on 500 sentences showed the parser's accuracy improved significantly, with the correct analysis found for 350 sentences when using the acquired knowledge.
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Developing links of compound sentences for parsing through marathi link gramm...ijnlc
Marathi is a verb-final language with a relatively free word order. Complex Sentences is one of the major types of sentences which are used commonly in any language. This paper explores the study of complex sentence structure of Marathi language. The paper proposes various links of complex sentence clauses and modelling of the complex sentences using proposed links in the Link Grammar Framework for parsing purpose.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
This document describes a study that analyzed features for a supervised transition-based dependency parser on the Latin Dependency Treebank. It found that using part-of-speech and case features achieved the highest accuracy. The corpus and parsing approach are described, including how dependency graphs are encoded and the transition system used to parse sentences. Projective and non-projective graphs are distinguished, and roughly half the sentences in the corpus exhibited non-projective structures.
Natural Language Generation from First-Order ExpressionsThomas Mathew
In this paper I discuss an approach for generating natural language from a first-order logic representation. The approach shows that a grammar definition for the natural language and a
lambda calculus based semantic rule collection can be applied for bi-directional translation using an overgenerate-and-prune mechanism.
The document discusses Performance Grammar (PG), a psycholinguistically motivated grammar formalism that aims to describe syntactic processing phenomena in language production and comprehension. PG uses a two-stage process to generate sentences, first assembling an unordered hierarchical structure and then applying linear precedence rules to order constituents from left to right. However, a purely precedence-based approach cannot account for incremental sentence production. To address this, PG supplements precedence rules with rules that assign constituents to absolute positions within a constituent's spatial layout.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONijnlc
This document discusses an approach for translating legal sentences by segmenting them based on their logical structure and then selecting appropriate translation rules. It first uses conditional random fields to recognize the logical structure of legal sentences, which involves a requisite and effectuation. It then segments the sentences based on this structure. Finally, it proposes a maximum entropy model for rule selection in statistical machine translation that incorporates linguistic and contextual information from the segmented sentences. An experiment on English to Japanese legal translation showed improvements over baseline systems.
This document proposes a method for translating legal sentences that involves two steps: 1) segmenting legal sentences based on their logical structure using conditional random fields, and 2) selecting translation rules for the segmented sentences using maximum entropy and linguistic/contextual features. The method was tested on a corpus of 516 English-Japanese sentence pairs and showed improvements over baseline systems in recognizing logical structures and BLEU scores.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...Carrie Wang
This project developed a new quantitative methodology using feature-based context-free grammar to analyze discourse semantics from social media discussions in order to identify potential drug abuse. The methodology was able to parse YouTube comments about recreational cough syrup use and perform anaphora resolution. This computational representation of discourse contributes to understanding human language structure and has applications in public health monitoring and clinical research.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
This document discusses a method for dynamically acquiring lexical information during sentence analysis in order to improve the coverage of a parser without requiring manual dictionary editing. New words and attributes are proposed based on contextual templates and accepted or rejected based on whether they are needed to parse sentences successfully. Accepted proposals are stored in auxiliary lexicons which can then be combined with the main lexicon to improve parsing of future sentences, especially in domain-specific texts. Evaluation on a technical manual corpus showed the method significantly improved parsing accuracy by recognizing new words and attributes.
This document provides an overview of tools that can be used to create parsers. It discusses parser generators, which can generate code for a parser, and parser combinators, which are libraries used to build parsers. Popular parser generators for regular languages include JFlex and ANTLR, while tools like ANTLR, APG, and JavaCC can generate parsers for context-free languages. The document also provides basic introductions to key parser concepts like lexical analysis, parse trees, grammars, and language types.
This paper presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model achieves high accuracy by resolving most ambiguities at the lexical level using dictionary information, but handles cases requiring syntactic context in the parsing process. The complexity usually associated with parsing is reduced by pruning implausible segmentations prior to parsing. The approach is implemented in a natural language understanding system developed at Microsoft Research.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
This paper proposes rules to detect and correct real-word spelling errors in the Setswana language. Real-word errors occur when a misspelled word is still a valid word but does not make sense in the context of the sentence. The paper focuses on commonly confused Setswana words that are valid both separately and combined as one word. Contextual rules are developed using neighboring words and parts-of-speech to determine the correct word form. An implementation of the rules achieved an 88% success rate on test documents. The rules provide a way to detect and correct some real-word errors for Setswana words.
The document discusses syntax analysis in compilers. It defines syntax as the arrangement of words and phrases to create meaningful sentences. The syntax analyzer or parser checks the stream of tokens for grammatical correctness according to the language's grammar. It determines if the input is well-formed syntactically. The parser ensures sentences in a program abide by the syntax of the language and detects any errors. The document also provides examples of English and programming language syntax and defines key concepts related to context-free grammars.
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
From many years we have been using Chomsky‟s generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. Syntactic parsing mainly works with syntactic structure of a sentence. The 'syntax' refers to the grammatical and syntactical arrangement of words in a sentence and their relationship with other words. The main focus of syntactic analysis is important to find syntactic structure of a sentence which usually is represented as a tree structure. To identify the syntactic structure is useful in determining the meaning of a sentence Natural language processing processes the data through lexical analysis, Syntax analysis, Semantic analysis, and Discourse processing, Pragmatic analysis. This paper gives various parsing methods. The algorithm in this paper splits the English sentences into parts using POS (Parts Of Speech) tagger, It identifies the type of sentence (Simple, Complex, Interrogate, Facts, active, passive etc.) and then parses these sentences using grammar rules of Natural language. As natural language processing becomes an increasingly relevant, there is a need for tree banks catered to the specific needs of more individualized systems. Here, we present the open source technique to check and correct the grammar. The methodology will give appropriate grammatical suggestions.
Developing links of compound sentences for parsing through marathi link gramm...ijnlc
Marathi is a verb-final language with a relatively free word order. Complex Sentences is one of the major types of sentences which are used commonly in any language. This paper explores the study of complex sentence structure of Marathi language. The paper proposes various links of complex sentence clauses and modelling of the complex sentences using proposed links in the Link Grammar Framework for parsing purpose.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
This document describes a study that analyzed features for a supervised transition-based dependency parser on the Latin Dependency Treebank. It found that using part-of-speech and case features achieved the highest accuracy. The corpus and parsing approach are described, including how dependency graphs are encoded and the transition system used to parse sentences. Projective and non-projective graphs are distinguished, and roughly half the sentences in the corpus exhibited non-projective structures.
Natural Language Generation from First-Order ExpressionsThomas Mathew
In this paper I discuss an approach for generating natural language from a first-order logic representation. The approach shows that a grammar definition for the natural language and a
lambda calculus based semantic rule collection can be applied for bi-directional translation using an overgenerate-and-prune mechanism.
The document discusses Performance Grammar (PG), a psycholinguistically motivated grammar formalism that aims to describe syntactic processing phenomena in language production and comprehension. PG uses a two-stage process to generate sentences, first assembling an unordered hierarchical structure and then applying linear precedence rules to order constituents from left to right. However, a purely precedence-based approach cannot account for incremental sentence production. To address this, PG supplements precedence rules with rules that assign constituents to absolute positions within a constituent's spatial layout.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONijnlc
This document discusses an approach for translating legal sentences by segmenting them based on their logical structure and then selecting appropriate translation rules. It first uses conditional random fields to recognize the logical structure of legal sentences, which involves a requisite and effectuation. It then segments the sentences based on this structure. Finally, it proposes a maximum entropy model for rule selection in statistical machine translation that incorporates linguistic and contextual information from the segmented sentences. An experiment on English to Japanese legal translation showed improvements over baseline systems.
This document proposes a method for translating legal sentences that involves two steps: 1) segmenting legal sentences based on their logical structure using conditional random fields, and 2) selecting translation rules for the segmented sentences using maximum entropy and linguistic/contextual features. The method was tested on a corpus of 516 English-Japanese sentence pairs and showed improvements over baseline systems in recognizing logical structures and BLEU scores.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...Carrie Wang
This project developed a new quantitative methodology using feature-based context-free grammar to analyze discourse semantics from social media discussions in order to identify potential drug abuse. The methodology was able to parse YouTube comments about recreational cough syrup use and perform anaphora resolution. This computational representation of discourse contributes to understanding human language structure and has applications in public health monitoring and clinical research.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
This document discusses a method for dynamically acquiring lexical information during sentence analysis in order to improve the coverage of a parser without requiring manual dictionary editing. New words and attributes are proposed based on contextual templates and accepted or rejected based on whether they are needed to parse sentences successfully. Accepted proposals are stored in auxiliary lexicons which can then be combined with the main lexicon to improve parsing of future sentences, especially in domain-specific texts. Evaluation on a technical manual corpus showed the method significantly improved parsing accuracy by recognizing new words and attributes.
This document provides an overview of tools that can be used to create parsers. It discusses parser generators, which can generate code for a parser, and parser combinators, which are libraries used to build parsers. Popular parser generators for regular languages include JFlex and ANTLR, while tools like ANTLR, APG, and JavaCC can generate parsers for context-free languages. The document also provides basic introductions to key parser concepts like lexical analysis, parse trees, grammars, and language types.
This paper presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model achieves high accuracy by resolving most ambiguities at the lexical level using dictionary information, but handles cases requiring syntactic context in the parsing process. The complexity usually associated with parsing is reduced by pruning implausible segmentations prior to parsing. The approach is implemented in a natural language understanding system developed at Microsoft Research.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
This paper proposes rules to detect and correct real-word spelling errors in the Setswana language. Real-word errors occur when a misspelled word is still a valid word but does not make sense in the context of the sentence. The paper focuses on commonly confused Setswana words that are valid both separately and combined as one word. Contextual rules are developed using neighboring words and parts-of-speech to determine the correct word form. An implementation of the rules achieved an 88% success rate on test documents. The rules provide a way to detect and correct some real-word errors for Setswana words.
The document discusses syntax analysis in compilers. It defines syntax as the arrangement of words and phrases to create meaningful sentences. The syntax analyzer or parser checks the stream of tokens for grammatical correctness according to the language's grammar. It determines if the input is well-formed syntactically. The parser ensures sentences in a program abide by the syntax of the language and detects any errors. The document also provides examples of English and programming language syntax and defines key concepts related to context-free grammars.
i. The document discusses natural language processing (NLP) and covers the following topics:
- Finding the structure of words including words and their components, issues and challenges, and morphological models.
- Finding the structure of documents including introduction, methods used, complexity of approaches, and performance of approaches.
ii. It describes key NLP concepts like tokens, lexemes, morphemes, and morphological parsing. It also discusses challenges in defining words and their internal structure across different languages.
The document describes a system for detecting paraphrases in the Marathi language. It calculates both statistical and semantic similarity between pairs of Marathi sentences. Statistical similarity looks at word order, distance, and vectors, while semantic similarity generates Universal Networking Language (UNL) graphs to compare meaning. The total paraphrase score combines the statistical and semantic similarity scores to judge whether sentences are paraphrases or not. The system aims to address the lack of existing paraphrase detection capabilities for Marathi.
Paraphrasing refers to writing that either differs in its textual content or is dissimilar in rearrangement of words
but conveys the same meaning. Identifying a paraphrase is exceptionally important in various real life applications
such as Information Retrieval, Plagiarism Detection, Text Summarization and Question Answering. A large amount
of work in Paraphrase Detection has been done in English and many Indian Languages. However, there is no
existing system to identify paraphrases in Marathi. This is the first such endeavor in the Marathi Language.
A paraphrase has differently structured sentences, and since Marathi is a semantically strong language, this
system is designed for checking both statistical and semantic similarities of Marathi sentences. Statistical similarity
measure does not need any prior knowledge as it is only based on the factual data of sentences. The factual data
is calculated on the basis of the degree of closeness between the word-set, word-order, word-vector and worddistance. Universal Networking Language (UNL) speaks about the semantic significance in sentence without any
syntactic points of interest. Hence, the semantic similarity calculated on the basis of generated UNL graphs for two
Marathi sentences renders semantic equality of two Marathi sentences. The total paraphrase score was calculated
after joining statistical and semantic similarity scores, which gives a judgment on whether there is paraphrase or
non-paraphrase about the Marathi sentences in question.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesisiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
This document summarizes a student project to develop a shallow parser for Hindi language with input from a transliterator. The plan is to create a transliterator to convert Roman script to Devanagari, generate a lexicon from corpus analysis, develop a morphological analyzer using finite state transducers, and implement a shallow parser using context free grammar. The system architecture and flow chart are presented. In conclusion, the document notes that shallow parsing is needed to build full parsers for Hindi and transliteration is important for translating names and terms across languages with different alphabets.
05 linguistic theory meets lexicographyDuygu Aşıklar
This document discusses the application of linguistic theories to lexicography. It reviews theories like frame semantics that can help lexicographers analyze data and produce dictionary entries. Frame semantics describes how words relate to semantic frames and contexts. The document explains key concepts for lexicographers like sense relationships, inherent properties of words, and what information is relevant for dictionary entries. It provides examples and guidelines for analyzing words using frame semantics and determining lexicographic relevance from corpus data.
This document discusses factors that influence whether English adjectives take the -er/-est suffix form or the more/most periphrastic form in comparatives and superlatives. It reviews previous research identifying factors like syllable length and stress pattern. It then describes the author's data collection of Google search hits for 500 common adjectives in both forms to quantify their relative frequencies. Statistical models will be used to evaluate which factors best predict an adjective's preferred form.
This document is a preprint that summarizes a paper on developing a spell checker model using automata. It discusses using fuzzy automata rather than finite automata for improved string comparison. The proposed spell checker would incorporate autosuggestion features into a Windows application. It would use techniques like edit distance and tries to store dictionaries and suggest corrections. The paper outlines the design of the spell checker, discussing functions like comparing words to a dictionary and considering morphology. Advantages like improved accuracy and speed are discussed along with potential disadvantages like inability to detect all errors.
Relative clause with their equivalence from English into Indonesianimadejuliarta
This document provides an introduction and background to a study on translating relative clauses from English to Indonesian. It discusses translation theory and defines translation as transferring meaning from the source language to the target language. The document reviews previous studies on analyzing relative clauses and nominal clauses. It establishes the problems of this study, which are to identify types of relative clauses, analyze their syntactic functions, and investigate translation equivalence between the source and target languages. The significance of the study is its theoretical contribution to translation and linguistics, as well as its practical application for language learners. The scope is focused on analyzing relative clause types and translation procedures from English to Indonesian.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
The important problem of word segmentation in Thai language is sentential noun phrase. The existing
studies try to minimize the problem. But there is no research that solves this problem directly. This study
investigates the approach to resolve this problem using conditional random fields which is a probabilistic
model to segment and label sequence data. The results present that the corrected data of noun phrase was
detected more than 78.61 % based on our technique.
This type of grammar enables the learners to apply the language in different discourse whether it is oral or written. Competence in the language is very important in order for the learners to understand and construct sentences which are grammatically correct, it is still significant for them to be given the chance to make linguistic outputs so that tey will truly experience the importance of having the language as a medium of communication or instruction and this should be given importance in the teaching and learning process for them not only master linguistic competence, but also linguistic performance.
Similar to Correction of Erroneous Characters (20)
This document discusses customizable segmentation of morphologically derived words in Chinese. It presents a system that can segment words in different ways to meet various user-defined standards. The system represents all morphologically derived words as word trees, where the root nodes are maximal words and leaf nodes are minimal words. Each non-terminal node has a resolution parameter that determines if its daughters are displayed as a single word or separate words. Different segmentations can then be obtained by specifying different combinations of these resolution parameters. This allows a single system to be customized for different segmentation needs.
The document discusses developing an intelligent search system for biblical texts that goes beyond traditional concordance searches based solely on identical word forms and word orders. It aims to enable searches based on similar meanings by accounting for syntax and semantics. An example is given of a traditional concordance search for an identical phrase across passages. The system seeks to improve on this by allowing searches for passages containing phrases that are not identical but have similar meanings.
This document discusses using a data-mining approach to perform word sense detection and disambiguation in biblical texts. It aims to identify the different senses of words in the Bible and disambiguate which sense each instance refers to. The approach uses multiple Bible translations linked to the original texts and groups instances based on translation word similarities through a progressive merging technique. This allows automatic identification of word senses using translation data in an efficient and objective manner to build sense dictionaries and enable refined Bible search and translation tools.
This document analyzes the fidelity and readability of 13 English Bible translations using quantitative linguistic methods. It measures fidelity based on the syntactic transfer rate and consistency of word choices between the original texts and translations. It measures readability based on the rate of common vocabulary words and syntactic fluency compared to a sample of contemporary English. The analysis ranks the translations on fidelity and readability and explores whether a translation can achieve both high fidelity and readability. The results show some translations are ranked highly in both dimensions.
- BibleGrapevine is a website developed by Global Bible Initiative to make linguistic data from Biblical texts and translations available for research
- It displays syntactic trees, alignments between source texts and translations, and links translations to allow comparison across languages
- Current features include basic views, interlinear views, tree views, and translation memory views, with plans to add search for similar linguistic units and word sense exploration
The main users would likely be biblical scholars, linguists, translators, and students interested in in-depth linguistic analysis of biblical texts and translations. Views showing syntactic relationships and alignments between
This document provides information about how manuscripts submitted to UMI are reproduced on microfilm. It explains that the quality of reproduction depends on the quality of the original submitted. It details how oversize materials like maps are reproduced, and how photographs are reproduced either on the microfilm or available separately for an additional charge.
1. Correction of Erroneous Characters
in
Chinese Sentence Analysis
Andi Wu
andiwu@microsoft.com
George Heidorn
georgehe@microsoft.com
Zixin Jiang
jiangz@microsoft.com
Terence Peng
tpeng@microsoft.com
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052-6399
U.S.A
(425) 936-0985
Abstract
This paper presents a method of automatically detecting and correcting erroneous
characters in electronic Chinese texts. The characters to be corrected are those that are
easily confused with other characters because of their phonological or graphical
resemblance. The correction takes place as an integral part of syntactic analysis, where
both the original character in the text and some replacement character(s) are considered,
and the character that ends up in the final parse is identified as the correct one. The
accuracy of this method substantially surpasses existing proofreaders in both recall and
precision in the category of substitution errors. A demo of the system is available.
本文介绍一种汉语文本自动别字纠错系统。该系统所纠正的字均为因音同或形似而
易混淆的字。纠错在句子分析过程中进行:句子分析器会同时考虑原文中的字及该
字混淆集中的字,正确的字为最佳分析结果之结构树中所出现的字。使用该方法所
得到的纠错召回率和纠错精确率均大大超出现有的中文校对系统。
Introduction
Written Chinese uses more than 5,000 distinctive characters and many of them have
similar sounds or shapes. As a result, users of Chinese tend to mistake one character for
another when typing if these two characters are similar phonetically or graphically.
When speech input is used, the confusion transfers from the human user to the computer:
the speech recognizer may make mistakes because of the phonetic similarities between
2. characters. The automatic detection and correction of those erroneous characters are
crucial to the usability of any word processing software for Chinese. They are also
required for robust parsing where sentences containing wrong characters can also be
successfully analyzed.
Existing Chinese word processors, such as Microsoft Word 2000 and IBM WordPro ’97,
already have proofreaders that can detect and occasionally correct wrong characters.
However, their recall and precision rates are fairly low (both below 65%). These
proofreaders are usually based on a word segmenter, an n-gram model, and some local
rules and heuristics. Their performance is hard to improve beyond a certain point
because they do not analyze the structure of a sentence and therefore have no access to
the larger context in which a character is used.
The approach to be discussed in this paper treats typo correction as an integral part of
syntactic analysis. It is motivated by the reality that we must be able to process sentences
that have not been edited and therefore may contain incorrect characters. Implemented in
NLPWin, the general-purpose language understanding system being developed at
Microsoft Research (Heidorn 2000, Jensen et al 1993), it assumes the existence of a
parser, a Chinese grammar, and the word segmentation mechanism described in Wu and
Jiang (1998). In addition, we need a list of confusable pairs.
A confusable pair consists of two characters that are often mistaken for each other.
Examples of such pairs are 那/哪, 帐/账, 清/青, 杨/扬, 于/与, etc. The actual system we
have implemented can also handle confusable triples, quadruples and so on, but we will
focus on confusable pairs in this paper for the ease of exposition. The actual members of
the confusable pairs list can be customized for different purposes. If the text to be
proofread is created with a phonologically-based input method, the list is expected to
contain pairs that have identical sounds. Texts created with stroke-based input method,
on the other hand, will require pairs that are similar in shape. When used in a
pedagogical setting, schoolteachers can construct a list that contains only those pairs that
they are trying to make their students distinguish. When integrated with a statistically-
based input method, the list can also be generated dynamically from the candidates
suggested by the statistical model. In this case, the confusable pair could be the top n
candidates that have competing scores.
The correction process
This section covers the computational details of the correction process. We will start
with a general description of the approach and then give a step-by-step illustration of the
mechanism by going through an example.
General description
As we have mentioned in the introduction, the original motivation for this project is to
have a robust parser that can successfully analyze a sentence even if the sentence contains
3. erroneous characters. Since the analysis of a sentence involves both lexical processing
and syntactic processing, it is quite natural that some errors will be corrected during
lexical analysis, some during syntactic analysis, and some during both. The actual
process depends on the nature of the erroneous character.
In cases where the intended character is part of a multiple-character word, the wrong
character in its place will make it impossible for this word to be looked up in the
dictionary. In the following sentence, for example, the wrong character 彷 in the place of
仿 will prevent us from looking up the two-character word 仿佛.
人们没有理她,彷彷彷彷佛她并不存在。
In this case, the error can be easily corrected if we have the confusable pair 彷/仿 in our
list. Knowing that 彷 can be mistakenly used in place of 仿, we will consider both
possibilities in our dictionary lookup. Given the character 彷, we find that it cannot be
used as a word by itself and it is not able to combine with any of the neighboring
characters to form a word. In other words, the span of the sentence would be broken at
this character. The character 仿, on the other hand, can combine with 佛 to form the
word 仿佛. Since only the use of 仿 can result in a spanning lexical array, we have
sufficient evidence that 彷 should be replaced by 仿 in this sentence. Consequently, 彷
will be removed from the lexical array even before syntactic analysis begins.
Sometimes, the erroneous character is an independent word but it is meant to be part of a
multiple-character word. In the following sentence, 辨 is mistakenly used for 辩, which
is part of the word 辩论. However, both 辨 and 论 can be independent words themselves
and therefore the wrong character will not result in a non-spanning lexical array.
目前律师和法院正在就这个棘手案件进行辨辨辨辨论 。
If 辨/辩 is a confusable pair in our list, we will try to look up both 辨论 and 辩论 in the
dictionary. We will find that only 辩论 is a word in the dictionary. In addition, it has an
“IgnoreParts” attribute, which means that, once these two characters are combined into a
single word, it will no longer be possible for the component characters of this word to be
independent words. In our system, all the records that are subsumed by an “IgnoreParts”
word are removed from the lexical array unless this word overlaps with another word.
The wrong character 辨 in this sentence is not a component of 辩论 but it is in the span
of 辩论 and therefore structurally subsumed by it. As a result, it will be removed from
the lexical array before syntactic analysis, and only the correct character will end up in
the final parse. The use of “IgnoreParts” in this way is not completely safe, since
theoretically 辨 and 论 can be two independent words in the sentence. But so far there
has not been any empirical evidence that this is a problem.
In cases where the multiple-character word does not have the “IgnoreParts” attribute, we
will keep both this word and its components while lowering the probabilities of the
4. components. Consider the following example, assuming that 定/订 is in the list of
confusable pairs.
我冲到楼下的咖啡厅,猛喝了几杯咖啡,然后直奔旅行社定定定定票。
Here the typist mistook 定 for 订. Looking up both 定票 and 订票 in the dictionary, we
find that 订票 is a word while 定票 is not. However, 订票 does not have the
“IgnoreParts” attribute and both 定 and 票 are independent words. We cannot remove
the words subsumed by 订票 in this case. Instead, we will raise the probability of 订票
and lower the probabilities of 定, 订, and 票. During syntactic analysis, we may have
several spanning parses, each consisting of different words. However, the parse
containing 订票 will have the highest score and be selected as the best parse. As a result,
only 订 will be kept in the final analysis.
There are also cases where both the wrong character and the intended character can
combine with their neighbors to form words. In the following example, 注 is used in
place of 主, but both 注 and 主 can combine with 意 to form words.
我遇到极大的困难,你给我出个注注注注意吧。
Assuming that 注/主 is one of our confusable pairs, we will look up both 注意 and 主意.
There is no lexical evidence here to tell us whether 注 or 主 is the correct character in this
sentence. However, the choice is very clear at the syntactic level. 注意 is a verb and 主
意 is a noun. Since the grammar requires a noun in the slot occupied by 注意/主意, only
the use of 主意 in this sentence can produce a spanning parse. If the alternative words
have the same parts of speech and share all other lexical properties, then the choice will
be random unless we have semantic information to prefer one to the other.
Another case where only syntactic information can help is the one where both the wrong
character and the intended character are independent words and neither of them can
combine with neighboring words to form a longer word. The following sentence is such
an example.
那里有二十余幢中外投资大厦相继与与与与年底前开工。
In this sentence, the wrong character 与 is used instead of 于. With 与/于 as one of the
confusable pairs in our system, we will put both words in the chart and see which of them
can “survive” the parsing process. It turns out that we can get a spanning parse only if 于
is used. We can then conclude that the intended character is 于 rather than 与.
Step-by-step illustration
Now we will go through the actual computational steps, using the following input
sentence as an example.
5. 他再那再那再那再那里泡泡泡泡制了这篇混帐帐帐帐文章?
This sentence contains 4 wrong characters: 再 for 在, 那 for 哪, 泡 for 炮, and 帐 for 账.
Now assume that these 4 pairs are in the list of confusables in our system. The correction
procedure will go as follows in the process of sentence analysis.
(1) Segment the input sentence into single characters and look up each of them from
the dictionary. For our example sentence, this results in the following array of
lexical records. It is displayed as a word lattice because there can be more than
one record in a given span of the sentence.
Notice that 帐 is not an independent word in the dictionary. So instead of having
a regular part of speech, it is marked with (n), which means that its “potential”
part of speech is a noun.
(2) For each character, check the list of confusables to see if it is a member of one of
the pairs. If so, retrieve the other member of the pair, look up this character from
the dictionary, and add the new lexical record(s) to the word lattice. Below is the
expanded word lattice where we see the additional records for 哪, 在, 炮 and 帐
(underlined).
6. (3) Look up multiple-character words from the dictionary, using both the original
characters and the characters added in Step 2. Add the new records (underlined)
to the word lattice.
7. We see that neither member of the 再/在 pair is able to combine with its
neighboring characters to form words. This is a case where we have two
competing single-character words. On the other hand, both members of the 那/哪
pair are able to combine with the next character to form 那里 and 哪里
respectively. This is a case where we have two competing multiple-character
words. Of the other two pairs, 泡/炮 and 帐/账, only one member in each pair
can form a multiple-character word with its neighbor, resulting in 炮制 and 混账.
(4) For each multiple-character word in the word lattice, check to see if it has the
“IgnoreParts” attribute. If so, the single characters subsumed by this word will be
removed. Otherwise, the subsumed single characters will be assigned low
probabilities. Here is the word lattice after the removal and probability lowering:
(Low probability words are grayed out.)
By this time, it has already become quite clear that 帐 is a wrong character in the
original text and it should be replaced by 账, because only the latter was able to
survive lexical processing. 泡 is most likely to be wrong, too, since it has low
probability while its competitor 炮 is part of a two-character word that has high
probability. The remaining decisions are to be made in syntactic processing.
(5) Submit the lexical array to syntactic processing where every lexical record in the
word lattice will be in the parser’s chart and compete with each other until a
successful parse is found:
8. We can see that the four replacement characters, 在, 哪, 炮 and 账, are in the tree
while all the wrong characters have disappeared.
(6) Output the leaves of the tree as the corrected sentence:
他在哪里炮制了这篇混账文章?
Evaluation and Discussion
Evaluation
The typo correction mechanism described above has been tested on more than 50
frequent pairs of confusable characters using randomly extracted sentences. For each
pair A/B, we first assume that the correction is in the direction of A B and then in the
direction of B A. In each case, we call the character on the left side the “text character”
and the one on the right side “replacement character”. Following standard practice, we
ran both a recall test and a precision test on each pair in both directions.
The extracted sentences are hand-checked to make sure that they do not contain
erroneous characters. In the recall test, we did the following for each A B where A is
the text character and B is the replacement character.
(1) Collect 100 sentences that contain B;
(2) Replace all the instances of B with A;
(3) Parse the modified sentences with the correction mechanism and output the
corrected sentences.
(4) Compare the output with the original sentences to see how many errors have been
corrected.
(5) Recall rate = number of sentences corrected / total number of sentences.
In the precision test, we did the following for each A B:
(1) Collect 100 sentences that contain A;
(2) Parse those sentences with the correction mechanism.
9. (3) Compare the output with the original sentences to see how many sentences remain
unchanged (where the As have not been mistakenly changed to Bs).
(4) Precision rate = number of sentences unchanged / total number of sentences.
Here is a representative sample of the test results:
The overall precision rate across all the pairs tested is 86.9% and the overall precision
rate is 96.3.
Discussion
Some interesting observations can be made on the results in the above table.
(1) 煅 锻 and 彷 仿 have perfect scores on both recall and precision. This is due
to the fact that all the 4 characters involved here are usually used as parts of
multiple-character words rather than words by themselves. In a given context,
only one member of the pair will be able to combine with neighboring characters
to form words. In these cases, the correction can be basically done at the lexical
level and the result is seldom affected by syntactic analysis.
(2) 那 哪 has a fairly high precision score but a low recall score. This is because 那
can be used wherever 哪 can be used unless there is clear evidence, such as a
question mark, that indicates that the sentence is a wh-question. (This is
comparable to there/where in English.) A similar situation is found in 嘛 吗
where 吗 is only licensed in a yes/no question.
(3) 辨 辩 has a high recall score but a low precision score. Looking at the data, we
find that, in all the cases where 辨 is mistakenly changed to 辩, 辨 is used as an
independent word. Since 辩 can also be used as an independent word with the
same part of speech and other grammatical attributes, the chances of their
appearing in the final parse are almost equal, making the correction close to
random.
Text
Character
Replacement
Character
Recall
Rate
Precision
Rate
那 哪 71% 90%
与 于 87% 83%
帐 账 98% 90%
嘛 吗 82% 96%
辨 辩 95% 74%
煅 锻 100% 100%
彷 仿 100% 100%
作 做 30% 100%
在 再 70% 97%
坐 座 86% 93%
10. (4) 在 再 has a low recall score but a high precision score. Both 在 and 再 are very
active independent words and the grammatical contexts in which 再 appears are a
proper subset of the contexts where 在 appears. Therefore it is not so easy to
switch from 在 to 再.
(5) 作 做 has an extremely low recall score. This is because 作 is becoming a free
variation of 做 in contemporary written Chinese. In many words and grammatical
contexts where 做 should be used, 作 can also be used. In other words, the
confusion is being accepted by the general public. Therefore, this pair should
probably be removed from the list of confusables for practical purposes.
The test results also reveal the following limitations of the current approach:
(1) It assumes that every sentence can be successfully parsed as long as the wrong
characters have been corrected. Therefore, we will have a problem if the input is
not grammatical or is beyond the grammar coverage of the parser. In this case,
the correction is guaranteed to work only if the choice is already clear during
lexical processing.
(2) Some sentences can produce spanning parses in spite of the fact they contain
erroneous characters. This usually happens when the wrong character is an
independent word and it happens to have the right syntactic properties to fit into a
parse.
(3) In cases where both the text character and the replacement character can combine
with their neighbors to form multiple-character words, the success of the
correction is more by chance if the words they form are of the same length and
with the same part of speech and other grammatical properties.
(4) The current approach only covers the correction of “substitution errors” where the
wrong character(s) and intended character(s) span the same position(s) in a
sentence. It does not cover cases where characters are accidentally added or
omitted.
(5) It still remains to be seen how long the list of confusable pairs can become. Since
every confusable character in a sentence introduces some additional records to the
lexical array, a sentence containing too many confusable characters will have a
more complex word lattice, which results in more ambiguity and noise and makes
parsing more difficult.
Despite those limitations, however, the current approach has been able to cover cases
where no solution was available before and achieved accuracy rates that are not
matched by any existing proofreaders. While it does not solve every problem in
Chinese proofreading, it does provide a better framework in which more problems
can have a solution.
References
Heidorn, G. E. (2000): Intelligent Writing Assistance. In Dale R., Moisl H., and Somers
H. (eds.), A Handbook of Natural Language Processing: Techniques and Applications for
11. the Processing of Language as Text. Marcel Dekker, New York, 1998 (published in
August 2000), pages 181-207.
Guo, Zhili and Zhaoming Qiu (1997) The candidate suggestion algorithm in a practical
Chinese error check system. In Chen, Liwei et al (Eds) Language Engineering, pages
325-331, Tsinghua University, Beijing.
Jensen, K., Heidorn G., and Richardson S. (eds.) (1993): Natural Language Processing:
The PLNLP Approach, Boston, Kluwer.
Song, Rou, L. Ouyang and C.Qiu (1998) The functions of a Chinese proofreading system.
In Publication and Printing, Vol. 3, 1998.
Sun, Cai and Zhengsheng Luo (1997). Reseach on the lexical errors in Chinese Text. In
Chen, Liwei et al (Eds) Language Engineering, pages 319-324, Tsinghua University,
Beijing.
Wu, Andi and Zixin Jiang (1998). Word segmentation in sentence analysis. In
Proceedings of the 1998 International Conference on Chinese Information Processing,
pages 169-180, Beijing, China.
Zhang, Yangsen and Bingqing Ding (1997) Research and practice on the lexical error
detecting system based on “bundling and filtering” in Chinese Text Automatic Proofread.
In Proceedings of the 1998 International Conference on Chinese Information Processing,
pages 392-397, Beijing, China.