This document describes a learning procedure to automatically acquire knowledge about verb-noun relations in Chinese. It uses an existing parser, a large corpus, and statistical methods to learn which verb-noun pairs typically occur in a verb-object relation versus a modifier-head relation. The learned knowledge is then used to disambiguate parses, improving the accuracy of the original parser. An evaluation on 500 sentences showed the parser's accuracy improved significantly, with the correct analysis found for 350 sentences when using the acquired knowledge.
This document presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model uses a parser to resolve ambiguities that require syntactic information from the full sentence. Most ambiguities are resolved at the lexical level using dictionary information, reducing complexity for the parser. The model prioritizes parsing efficiency by only presenting unambiguous words and postponing ambiguous words to the parsing stage when needed. It is implemented in a natural language understanding system.
This paper presents a method for automatically detecting and correcting erroneous characters in Chinese text. The method treats typo correction as an integral part of syntactic analysis. It considers both the original character and possible replacement characters from a list of confusable pairs during sentence parsing. The character that results in the best parse is identified as correct. The approach achieves substantially higher recall and precision than existing Chinese proofreaders, which do not perform a full syntactic analysis. An evaluation on 50 character pairs found an overall precision of 86.9% and recall of 96.3%. Cases involving characters that can only form words together tended to have perfect scores, while characters that can stand alone were more difficult to correct.
RuleML2015 The Herbrand Manifesto - Thinking Inside the Box RuleML
The traditional semantics for First Order Logic (sometimes called Tarskian semantics) is based on the notion of interpretations of constants. Herbrand semantics is an alternative semantics based directly on truth assignments for ground sentences rather than interpretations of constants. Herbrand semantics is simpler and more intuitive than Tarskian semantics; and, consequently, it is easier to teach and learn. Moreover, it is more expressive. For example, while it is not possible to finitely axiomatize integer arithmetic with Tarskian semantics, this can be done easily with Herbrand Semantics. The downside is a loss of some common logical properties, such as compactness and completeness. However, there is no loss of inferential power. Anything that can be proved according to Tarskian semantics can also be proved according to Herbrand semantics. In this presentation, we define Herbrand semantics; we look at the implications for research on logic and rules systems and automated reasoning; and and we assess the potential for popularizing logic.
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...cscpconf
In this paper, we are going to propose a technique to find meaning of words using Word Sense Disambiguation using supervised and unsupervised learning. This limitation of information is main flaw of the supervised approach. Our proposed approach focuses to overcome the limitation using learning set which is enriched in dynamic way maintaining new data. We introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having enriched bags using learning methods.
An approach to word sense disambiguation combining modified lesk and bag of w...csandit
In this paper, we are going to propose a technique to find meaning of words using Word Sense
Disambiguation using supervised and unsupervised learning. This limitation of information is
main flaw of the supervised approach. Our proposed approach focuses to overcome the
limitation using learning set which is enriched in dynamic way maintaining new data. We
introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having
enriched bags using learning methods.
Presentation of the Marcu 2000 ACL paper "The rhetorical parsing of unrestricted texts- A surface-based approach" for Discourse Parsing and Language Technology seminar.
This document discusses the computation of presuppositions and entailments from natural language text. It begins by defining presuppositions and entailments, and explaining how they can be computed using tree transformations on semantic representations. The paper then provides examples of elementary presuppositions and entailments. It describes a system that computes presuppositions and entailments while parsing sentences using an augmented transition network. The system applies tree transformations specified in the lexicon to the semantic representation to derive inferences. The paper concludes that presuppositions and entailments exhibit computational properties not shown by the general class of inferences, such as being tied to the semantic and syntactic structure of language.
Coreference Resolution using Hybrid Approachbutest
The document presents a hybrid approach for coreference resolution that uses different methods for pronouns and non-pronouns. For pronouns, it uses a salience-based technique to find antecedents within 2 sentences by evaluating syntactic and semantic constraints. For non-pronouns, it trains a classifier on coreference pairs to group markables into coreference classes. The approach is evaluated on a test set, achieving a precision of 66.1%, recall of 58.4% and an F1-measure of 62%.
This document presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model uses a parser to resolve ambiguities that require syntactic information from the full sentence. Most ambiguities are resolved at the lexical level using dictionary information, reducing complexity for the parser. The model prioritizes parsing efficiency by only presenting unambiguous words and postponing ambiguous words to the parsing stage when needed. It is implemented in a natural language understanding system.
This paper presents a method for automatically detecting and correcting erroneous characters in Chinese text. The method treats typo correction as an integral part of syntactic analysis. It considers both the original character and possible replacement characters from a list of confusable pairs during sentence parsing. The character that results in the best parse is identified as correct. The approach achieves substantially higher recall and precision than existing Chinese proofreaders, which do not perform a full syntactic analysis. An evaluation on 50 character pairs found an overall precision of 86.9% and recall of 96.3%. Cases involving characters that can only form words together tended to have perfect scores, while characters that can stand alone were more difficult to correct.
RuleML2015 The Herbrand Manifesto - Thinking Inside the Box RuleML
The traditional semantics for First Order Logic (sometimes called Tarskian semantics) is based on the notion of interpretations of constants. Herbrand semantics is an alternative semantics based directly on truth assignments for ground sentences rather than interpretations of constants. Herbrand semantics is simpler and more intuitive than Tarskian semantics; and, consequently, it is easier to teach and learn. Moreover, it is more expressive. For example, while it is not possible to finitely axiomatize integer arithmetic with Tarskian semantics, this can be done easily with Herbrand Semantics. The downside is a loss of some common logical properties, such as compactness and completeness. However, there is no loss of inferential power. Anything that can be proved according to Tarskian semantics can also be proved according to Herbrand semantics. In this presentation, we define Herbrand semantics; we look at the implications for research on logic and rules systems and automated reasoning; and and we assess the potential for popularizing logic.
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...cscpconf
In this paper, we are going to propose a technique to find meaning of words using Word Sense Disambiguation using supervised and unsupervised learning. This limitation of information is main flaw of the supervised approach. Our proposed approach focuses to overcome the limitation using learning set which is enriched in dynamic way maintaining new data. We introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having enriched bags using learning methods.
An approach to word sense disambiguation combining modified lesk and bag of w...csandit
In this paper, we are going to propose a technique to find meaning of words using Word Sense
Disambiguation using supervised and unsupervised learning. This limitation of information is
main flaw of the supervised approach. Our proposed approach focuses to overcome the
limitation using learning set which is enriched in dynamic way maintaining new data. We
introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having
enriched bags using learning methods.
Presentation of the Marcu 2000 ACL paper "The rhetorical parsing of unrestricted texts- A surface-based approach" for Discourse Parsing and Language Technology seminar.
This document discusses the computation of presuppositions and entailments from natural language text. It begins by defining presuppositions and entailments, and explaining how they can be computed using tree transformations on semantic representations. The paper then provides examples of elementary presuppositions and entailments. It describes a system that computes presuppositions and entailments while parsing sentences using an augmented transition network. The system applies tree transformations specified in the lexicon to the semantic representation to derive inferences. The paper concludes that presuppositions and entailments exhibit computational properties not shown by the general class of inferences, such as being tied to the semantic and syntactic structure of language.
Coreference Resolution using Hybrid Approachbutest
The document presents a hybrid approach for coreference resolution that uses different methods for pronouns and non-pronouns. For pronouns, it uses a salience-based technique to find antecedents within 2 sentences by evaluating syntactic and semantic constraints. For non-pronouns, it trains a classifier on coreference pairs to group markables into coreference classes. The approach is evaluated on a test set, achieving a precision of 66.1%, recall of 58.4% and an F1-measure of 62%.
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
The project re-implements the architecture of the paper Reasoning with Neural Tensor Networks for Knowledge Base Completion in Torch framework, achieving similar accuracy results with an elegant implementation in a modern language.
Below are some links for further details:
https://github.com/agarwal-shubham/Reasoning-Over-Knowledge-Base
http://darsh510.github.io/IREPROJ/
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
This document summarizes research on rule-based prosody calculation for Marathi text-to-speech synthesis. It presents two studies: 1) an analysis of a Marathi corpus to examine the influence of syntax, information status, and other linguistic factors on prosody, and 2) a listening test to further investigate the prosodic realization of constituents depending on their information status. The results were used to improve prosody prediction in the Marathi text-to-speech synthesis system MARY. The analysis found some assumptions from literature held true for Marathi prosody while others did not. The listening test suggested constituents are always preferred to be accented in Marathi for both new and given information.
The AGM model is the most remarkable framework for modeling belief revision. However, it is not
perfect in all aspects. Paraconsistent belief revision, multi-agent belief revision and non-prioritized
belief revision are three different extensions to AGM to address three important criticisms applied
to it. In this article, we propose a framework based on AGM that takes a position in each of these
categories. Also, we discuss some features of our framework and study the satisfiability of AGM
postulates in this new context.
Flexible querying of relational databases fuzzy set based approach 27-11Adel Sabour
(حلقة تكنولوجية) فى موضوعات حديثة منتقاه فى مجال نظم وتكنولوجيا المعلومات وعلوم الحاسب
يوم الجمعة الساعة الثامنة والنصف مساءًا بتوقيت مصر - كل أسبوع (مباشرة Live) - اليوم
على youtube.com/AdelSabour
https://link.springer.com/chapter/10.1007/978-3-319-13461-1_42
A supervised word sense disambiguation method using ontology and context know...Alexander Decker
This document summarizes a supervised word sense disambiguation method that uses ontology and context knowledge. It begins by introducing word sense disambiguation and different existing approaches, including dictionary-based, supervised, semi-supervised, and unsupervised methods. It then describes the proposed method, which uses a tree-matching approach between a context knowledge base containing sentences and a glossary. Preliminary results are provided when testing an ambiguous word corpus on the system.
International Journal of Computational Engineering Research (IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONijnlc
This document discusses an approach for translating legal sentences by segmenting them based on their logical structure and then selecting appropriate translation rules. It first uses conditional random fields to recognize the logical structure of legal sentences, which involves a requisite and effectuation. It then segments the sentences based on this structure. Finally, it proposes a maximum entropy model for rule selection in statistical machine translation that incorporates linguistic and contextual information from the segmented sentences. An experiment on English to Japanese legal translation showed improvements over baseline systems.
This document proposes a method for translating legal sentences that involves two steps: 1) segmenting legal sentences based on their logical structure using conditional random fields, and 2) selecting translation rules for the segmented sentences using maximum entropy and linguistic/contextual features. The method was tested on a corpus of 516 English-Japanese sentence pairs and showed improvements over baseline systems in recognizing logical structures and BLEU scores.
This document discusses rule-based systems and different approaches to reasoning with rules, including:
- Procedural vs declarative knowledge representations.
- Forward and backward reasoning approaches. Forward reasoning works from the initial states while backward reasoning works from the goal states.
- Backward-chaining rule systems like PROLOG are good for goal-directed problem solving. Forward-chaining rule systems match rules against the current state and add assertions to the state.
This document describes a study that analyzed features for a supervised transition-based dependency parser on the Latin Dependency Treebank. It found that using part-of-speech and case features achieved the highest accuracy. The corpus and parsing approach are described, including how dependency graphs are encoded and the transition system used to parse sentences. Projective and non-projective graphs are distinguished, and roughly half the sentences in the corpus exhibited non-projective structures.
System and Instrumentation for Practising Effective CSRF W
Presented at,
3rd IBL Conference on CSR 2010
Wednesday, 29 September 2010 Time 15.20 s/d 16.30 PM
Balai Kartini Exhibition & Convention Center
Jakarta - Indonesia
The document discusses giving someone a new Marylin statue as a gift. It also mentions Chantal Goya, a French singer and actress known for her children's music, and references the years 1905 to 2005 and the website Slideshare.net in relation to a profile named mireille30100 that includes tags for Music and BECASSINE.
This document discusses common problems owners face during dog training and provides solutions. It addresses issues like puppies taking a long time to learn commands, getting distracted easily, problems with feeding or not wanting to stay in their crate. The solutions include changing training methods, merging words with gestures, adjusting the food, and ensuring the crate is suitable and clean. The overall message is to identify what problems your dog is having so you can address them.
Fast edits that do not allow the viewer to see each shot clearly create a sense of panic and pace in the teaser. Fades-to-black that are slow build suspense by not revealing too much. The teaser uses diegetic sounds like footsteps and screams as well as non-diegetic eerie music to build suspense.
To train German Shepherd, it is important to socializevern43lloyd
This document discusses how to train an overly aggressive German Shepherd, noting that while aggression is a desirable trait, it can also be unsafe if not properly trained. It recommends identifying the type of aggression, such as predatory, territorial, or fear-based, which require different training approaches. The document suggests starting training at 8 weeks, socializing the puppy with other dogs, avoiding other aggressive dogs, keeping the dog leashed to help it stay in control, being patient, and not punishing the dog.
Corporate-informatica-training-in-mumbaiUnmesh Baile
Vibrant Technologies is headquarted in Mumbai,India.We are the best Informatica training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Informatica classes in Mumbai according to our students and corporates
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
The project re-implements the architecture of the paper Reasoning with Neural Tensor Networks for Knowledge Base Completion in Torch framework, achieving similar accuracy results with an elegant implementation in a modern language.
Below are some links for further details:
https://github.com/agarwal-shubham/Reasoning-Over-Knowledge-Base
http://darsh510.github.io/IREPROJ/
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
This document summarizes research on rule-based prosody calculation for Marathi text-to-speech synthesis. It presents two studies: 1) an analysis of a Marathi corpus to examine the influence of syntax, information status, and other linguistic factors on prosody, and 2) a listening test to further investigate the prosodic realization of constituents depending on their information status. The results were used to improve prosody prediction in the Marathi text-to-speech synthesis system MARY. The analysis found some assumptions from literature held true for Marathi prosody while others did not. The listening test suggested constituents are always preferred to be accented in Marathi for both new and given information.
The AGM model is the most remarkable framework for modeling belief revision. However, it is not
perfect in all aspects. Paraconsistent belief revision, multi-agent belief revision and non-prioritized
belief revision are three different extensions to AGM to address three important criticisms applied
to it. In this article, we propose a framework based on AGM that takes a position in each of these
categories. Also, we discuss some features of our framework and study the satisfiability of AGM
postulates in this new context.
Flexible querying of relational databases fuzzy set based approach 27-11Adel Sabour
(حلقة تكنولوجية) فى موضوعات حديثة منتقاه فى مجال نظم وتكنولوجيا المعلومات وعلوم الحاسب
يوم الجمعة الساعة الثامنة والنصف مساءًا بتوقيت مصر - كل أسبوع (مباشرة Live) - اليوم
على youtube.com/AdelSabour
https://link.springer.com/chapter/10.1007/978-3-319-13461-1_42
A supervised word sense disambiguation method using ontology and context know...Alexander Decker
This document summarizes a supervised word sense disambiguation method that uses ontology and context knowledge. It begins by introducing word sense disambiguation and different existing approaches, including dictionary-based, supervised, semi-supervised, and unsupervised methods. It then describes the proposed method, which uses a tree-matching approach between a context knowledge base containing sentences and a glossary. Preliminary results are provided when testing an ambiguous word corpus on the system.
International Journal of Computational Engineering Research (IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONijnlc
This document discusses an approach for translating legal sentences by segmenting them based on their logical structure and then selecting appropriate translation rules. It first uses conditional random fields to recognize the logical structure of legal sentences, which involves a requisite and effectuation. It then segments the sentences based on this structure. Finally, it proposes a maximum entropy model for rule selection in statistical machine translation that incorporates linguistic and contextual information from the segmented sentences. An experiment on English to Japanese legal translation showed improvements over baseline systems.
This document proposes a method for translating legal sentences that involves two steps: 1) segmenting legal sentences based on their logical structure using conditional random fields, and 2) selecting translation rules for the segmented sentences using maximum entropy and linguistic/contextual features. The method was tested on a corpus of 516 English-Japanese sentence pairs and showed improvements over baseline systems in recognizing logical structures and BLEU scores.
This document discusses rule-based systems and different approaches to reasoning with rules, including:
- Procedural vs declarative knowledge representations.
- Forward and backward reasoning approaches. Forward reasoning works from the initial states while backward reasoning works from the goal states.
- Backward-chaining rule systems like PROLOG are good for goal-directed problem solving. Forward-chaining rule systems match rules against the current state and add assertions to the state.
This document describes a study that analyzed features for a supervised transition-based dependency parser on the Latin Dependency Treebank. It found that using part-of-speech and case features achieved the highest accuracy. The corpus and parsing approach are described, including how dependency graphs are encoded and the transition system used to parse sentences. Projective and non-projective graphs are distinguished, and roughly half the sentences in the corpus exhibited non-projective structures.
System and Instrumentation for Practising Effective CSRF W
Presented at,
3rd IBL Conference on CSR 2010
Wednesday, 29 September 2010 Time 15.20 s/d 16.30 PM
Balai Kartini Exhibition & Convention Center
Jakarta - Indonesia
The document discusses giving someone a new Marylin statue as a gift. It also mentions Chantal Goya, a French singer and actress known for her children's music, and references the years 1905 to 2005 and the website Slideshare.net in relation to a profile named mireille30100 that includes tags for Music and BECASSINE.
This document discusses common problems owners face during dog training and provides solutions. It addresses issues like puppies taking a long time to learn commands, getting distracted easily, problems with feeding or not wanting to stay in their crate. The solutions include changing training methods, merging words with gestures, adjusting the food, and ensuring the crate is suitable and clean. The overall message is to identify what problems your dog is having so you can address them.
Fast edits that do not allow the viewer to see each shot clearly create a sense of panic and pace in the teaser. Fades-to-black that are slow build suspense by not revealing too much. The teaser uses diegetic sounds like footsteps and screams as well as non-diegetic eerie music to build suspense.
To train German Shepherd, it is important to socializevern43lloyd
This document discusses how to train an overly aggressive German Shepherd, noting that while aggression is a desirable trait, it can also be unsafe if not properly trained. It recommends identifying the type of aggression, such as predatory, territorial, or fear-based, which require different training approaches. The document suggests starting training at 8 weeks, socializing the puppy with other dogs, avoiding other aggressive dogs, keeping the dog leashed to help it stay in control, being patient, and not punishing the dog.
Corporate-informatica-training-in-mumbaiUnmesh Baile
Vibrant Technologies is headquarted in Mumbai,India.We are the best Informatica training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Informatica classes in Mumbai according to our students and corporates
The document discusses decision making and outlines the decision making process. It describes decision making as examining options and choosing a course of action. Decisions can be programmed, which are simple and routine, or non-programmed, which require more thought. The decision making process involves constructing a clear picture of the decision, compiling requirements, collecting alternatives, comparing options, considering what could go wrong, and committing to a decision. Factors like perception, priorities, acceptability, demands, resources, and judgment can influence decision making.
Aplicación ergonómica metologia rosa Sonne et al., 2011Home
The document describes the development and evaluation of the Rapid Office Strain Assessment (ROSA) checklist for assessing ergonomic risks in office workstations. ROSA was designed to quickly quantify risks associated with computer work based on established risk factors and to provide an action level based on reports of worker discomfort. It evaluates risks related to the chair, monitor, telephone, keyboard and mouse setup. ROSA scores showed a significant correlation with reported body discomfort and high reliability between observers. Mean discomfort increased with higher ROSA scores, suggesting a score of 5 could be used as an action level indicating when changes are needed. The study found ROSA to be an effective and reliable method for identifying computer use risk factors related to discomfort.
Hindustan Zinc Ltd is a leading zinc and lead producer in India with long mine life reserves. The document recommends buying the stock given rising global demand for zinc and lead coupled with HZL's low production costs and adequate resources. HZL is well positioned to benefit from firm zinc prices due to supply deficits and increasing demand from sectors like steel and automotive in India and China. Lead prices are expected to remain stable due to a narrowing deficit.
Rockup is an orienteering event held in the mountains. Participants use maps and compasses to navigate between checkpoints in an unfamiliar rocky terrain. The goal is to complete the course in the shortest time while following orienteering principles of route choice and navigation.
El documento proporciona instrucciones para que los emprendedores exploren los diferentes programas de apoyo disponibles en la página electrónica redemprendedor.gob.mx. Sugiere listar los 10 programas más importantes, diseñar un diagrama de causa-efecto para identificar los elementos involucrados en cada programa, e inscribirse a cursos en línea sobre incubación empresarial y capitalización financiera.
Delivered by Holly Stibbon, 101 Website Design & Email Marketing.
Norfolk Chamber held THE FUTURE IS HERE event on Weds 23 September 2015 to bring businesses an essential opportunity to experience first-hand how digital technology can drive your business forward,
100+ Businesses | 16 Stands | 10 Speakers | 4 Workshops | 1 Amazing Venue
This paper presents a model for Chinese word segmentation that integrates it as part of sentence analysis using a parser. The model achieves high accuracy by resolving most ambiguities at the lexical level using dictionary information, but handles cases requiring syntactic context in the parsing process. The complexity usually associated with parsing is reduced by pruning implausible segmentations prior to parsing. The approach is implemented in a natural language understanding system developed at Microsoft Research.
[Emnlp] what is glo ve part ii - towards data scienceNikhil Jaiswal
GloVe is a new model for learning word embeddings from co-occurrence matrices that combines elements of global matrix factorization and local context window methods. It trains on the nonzero elements in a word-word co-occurrence matrix rather than the entire sparse matrix or individual context windows. This allows it to efficiently leverage statistical information from the corpus. The model produces a vector space with meaningful structure, as shown by its performance of 75% on a word analogy task. It outperforms related models on similarity tasks and named entity recognition. The full paper describes GloVe's global log-bilinear regression model and how it addresses drawbacks of previous models to encode linear directions of meaning in the vector space.
This document summarizes an approach for identifying word translations from non-parallel (unrelated) English and German corpora. It uses a co-occurrence clue, assuming words that frequently co-occur in one language will have translations that co-occur frequently in the other language. The approach computes association vectors for words based on log-likelihood ratios of co-occurrence counts within a 3-word window. It determines the translation of an unknown German word by finding the most similar association vector in an English co-occurrence matrix. Evaluation on 100 test words showed an accuracy of around 72%, a significant improvement over previous methods for non-parallel texts.
Drosophila Three-Point Test Cross Lab Write-Up Instructions.docxharold7fisher61282
This document provides instructions for writing a lab report on a three-point testcross experiment in Drosophila. The report should include an abstract, introduction, methods, results, discussion, and conclusions section. The introduction should provide background on genetic mapping and crossover frequency. The methods should describe the experimental design, scoring, and calculations. The results should present phenotypic counts, genetic maps, and chi-square tests comparing expected and observed values. The discussion should interpret results in light of hypotheses, published data, and difficulties encountered. The conclusions should summarize key findings and ways to improve the experiment.
The role of linguistic information for shallow language processingConstantin Orasan
The document discusses shallow language processing and summarization. It argues that while deep language understanding is limited, shallow methods can be improved by adding linguistic information. As an example, it shows how term frequency, anaphora resolution, discourse cues and genetic algorithms can select extractive summaries that better match human abstracts, without requiring full text comprehension.
Statistically-Enhanced New Word IdentificationAndi Wu
This document discusses a method for identifying new words in Chinese text using a combination of rule-based and statistical approaches. Candidate character strings are selected as potential new words based on their independent word probability being below a threshold. Parts of speech are then assigned to candidate strings by examining the part of speech patterns of their component characters and comparing them to existing words in a dictionary to determine the most likely part of speech based on word formation patterns in Chinese. This hybrid approach avoids the overgeneration of rule-based systems and data sparsity issues of purely statistical approaches.
Comparing Forgetting Heuristics For Complexity Reduction Of JustificationsTimdeBoer16
In ontologies, justifications are used to explain entailments. To simplify these justifications, forgetting is used. We compared several methods of forgetting to estimate the effectiveness of these methods for simplifying the justification.
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
Similarity based methods for word sense disambiguationvini89
The document discusses methods for estimating the probability of unseen word pairs by using information from similar words. It compares four similarity-based estimation methods: KL divergence, total divergence to average, L1 norm, and confusion probability. These are evaluated against Katz's back-off scheme and maximum likelihood estimation (MLE). The total divergence to average method is found to perform the best, estimating probabilities of unseen word pairs up to 40% better than back-off and MLE methods. It works by measuring the similarity between words based on their distributions and combining evidence from similar words, weighted by their similarity.
Similarity based methods for word sense disambiguationvini89
The document discusses methods for estimating the probability of unseen word pairs by using information from similar words. It compares four similarity-based estimation methods: KL divergence, total divergence to average, L1 norm, and confusion probability. These are evaluated against Katz's back-off scheme and maximum likelihood estimation (MLE). The total divergence to average method is found to perform the best, estimating probabilities of unseen word pairs up to 40% better than back-off and MLE methods. It works by measuring the similarity between words and combining information from the most similar words, weighted by their similarity.
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
This document discusses information retrieval and describes its three main phases: 1) asking a question to define an information need, 2) constructing an answer by matching queries to documents, and 3) assessing the relevance of the retrieved answers. It also covers several important information retrieval concepts like keywords, indexing documents, stemming words, calculating TF-IDF weights, and evaluating system performance using recall and precision.
The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is
handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a number that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
This document describes a coreference resolution system created by Avani Sharma and Aishwarya Asesh. The system uses various techniques like exact matching, substring matching, abbreviations, gender, semantic classes, capitalization, pronouns, regular expressions, and appositives to resolve coreferences within a given text. The system achieved evaluation results of 60.46% on one test set and 61.11% on another. Areas for improvement include using external libraries and semantic categories.
- The document discusses a group project analyzing linguistic laws in a corpus of 51 books from Project Gutenberg.
- They tested Heaps' Law, which predicts vocabulary size based on text length, and found it fit better using a power function model than a quadratic one.
- They also tested Zipf's Law, which relates word frequency to rank, and found it fit the data well for words ranked 30 and above when tested on a single book and the mean frequencies across multiple books.
- However, the residuals from regressing the mean frequencies showed a trend, indicating the linear model may not be the best fit for Zipf's Law across the entire corpus.
This document discusses the Arabic language and natural language processing. It covers topics like syntax, grammar, parsing, part-of-speech tagging, and differences between Levantine Arabic dialects and Modern Standard Arabic. It also summarizes a research paper on building an Arabic parser using machine learning techniques and resolving ambiguities in Arabic syntax. Overall, the document examines computational linguistics approaches for Arabic language understanding.
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic concepts that characterizes the domain as well as their definitions and interrelationships. This paper will describe some algorithms for identifying semantic relations and constructing an Information Technology Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences. We then extract these sentences based on English pattern in order to build training set. We use a random sample among 245 categories of ACM to evaluate our results. Results generated show that our system yields superior performance.
Ontologisms have been applied to many applications in recent years, especially on Sematic Web, Information
Retrieval, Information Extraction, and Question and Answer. The purpose of domain-specific ontology
is to get rid of conceptual and terminological confusion. It accomplishes this by specifying a set of generic
concepts that characterizes the domain as well as their definitions and interrelationships. This paper will
describe some algorithms for identifying semantic relations and constructing an Information Technology
Ontology, while extracting the concepts and objects from different sources. The Ontology is constructed
based on three main resources: ACM, Wikipedia and unstructured files from ACM Digital Library. Our
algorithms are combined of Natural Language Processing and Machine Learning. We use Natural Language
Processing tools, such as OpenNLP, Stanford Lexical Dependency Parser in order to explore sentences.
We then extract these sentences based on English pattern in order to build training set. We use a
random sample among 245 categories of ACM to evaluate our results. Results generated show that our
system yields superior performance.
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Shallow parsing is a technique that divides text such as sentences into constituent parts and describes the syntactic relationships between those parts, but does not fully analyze internal structure or function. It aims to infer as much structure as possible from morphological and word order information. Typical modules include part-of-speech tagging, chunking of phrases, and relation finding between chunks. Shallow parsers are useful for processing large texts and are more robust to noise than deep parsers.
This document discusses using machine learning to improve contextual translation by choosing between conflicting translation mappings extracted from bilingual corpora. It presents a machine learning approach that builds decision tree classifiers to select the most appropriate mapping based on linguistic features in the source language. The selected features provide insight into important contextual factors for translation. The approach is evaluated on a Spanish-English translation task using a corpus of 351,026 aligned sentence pairs, achieving significantly better translated output than prior methods that did not distinguish context for conflicting mappings.
Similar to Learning Verb-Noun Relations to Improve Parsing (20)
The MSR-NLP Chinese word segmentation system is part of a full sentence analyzer. It uses a dictionary and rules for basic segmentation, morphology, and named entity recognition to build a word lattice. The system proposes new words, prunes the lattice, and uses a parser to produce the final segmentation. It participated in four segmentation bakeoff tracks, ranking highly in each. An analysis found that parameter tuning, morphology/NER, and lattice pruning contributed most to performance, while the parser helped less. Problems included inconsistent annotations and differences in defining new words.
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
This document discusses a method for dynamically acquiring lexical information during sentence analysis in order to improve the coverage of a parser without requiring manual dictionary editing. New words and attributes are proposed based on contextual templates and accepted or rejected based on whether they are needed to parse sentences successfully. Accepted proposals are stored in auxiliary lexicons which can then be combined with the main lexicon to improve parsing of future sentences, especially in domain-specific texts. Evaluation on a technical manual corpus showed the method significantly improved parsing accuracy by recognizing new words and attributes.
This document discusses customizable segmentation of morphologically derived words in Chinese. It presents a system that can segment words in different ways to meet various user-defined standards. The system represents all morphologically derived words as word trees, where the root nodes are maximal words and leaf nodes are minimal words. Each non-terminal node has a resolution parameter that determines if its daughters are displayed as a single word or separate words. Different segmentations can then be obtained by specifying different combinations of these resolution parameters. This allows a single system to be customized for different segmentation needs.
The document discusses developing an intelligent search system for biblical texts that goes beyond traditional concordance searches based solely on identical word forms and word orders. It aims to enable searches based on similar meanings by accounting for syntax and semantics. An example is given of a traditional concordance search for an identical phrase across passages. The system seeks to improve on this by allowing searches for passages containing phrases that are not identical but have similar meanings.
This document discusses using a data-mining approach to perform word sense detection and disambiguation in biblical texts. It aims to identify the different senses of words in the Bible and disambiguate which sense each instance refers to. The approach uses multiple Bible translations linked to the original texts and groups instances based on translation word similarities through a progressive merging technique. This allows automatic identification of word senses using translation data in an efficient and objective manner to build sense dictionaries and enable refined Bible search and translation tools.
This document analyzes the fidelity and readability of 13 English Bible translations using quantitative linguistic methods. It measures fidelity based on the syntactic transfer rate and consistency of word choices between the original texts and translations. It measures readability based on the rate of common vocabulary words and syntactic fluency compared to a sample of contemporary English. The analysis ranks the translations on fidelity and readability and explores whether a translation can achieve both high fidelity and readability. The results show some translations are ranked highly in both dimensions.
- BibleGrapevine is a website developed by Global Bible Initiative to make linguistic data from Biblical texts and translations available for research
- It displays syntactic trees, alignments between source texts and translations, and links translations to allow comparison across languages
- Current features include basic views, interlinear views, tree views, and translation memory views, with plans to add search for similar linguistic units and word sense exploration
The main users would likely be biblical scholars, linguists, translators, and students interested in in-depth linguistic analysis of biblical texts and translations. Views showing syntactic relationships and alignments between
This document provides information about how manuscripts submitted to UMI are reproduced on microfilm. It explains that the quality of reproduction depends on the quality of the original submitted. It details how oversize materials like maps are reproduced, and how photographs are reproduced either on the microfilm or available separately for an additional charge.
1. Learning Verb-Noun Relations to Improve Parsing
Andi Wu
Microsoft Research
One Microsoft Way, Redmond, WA 98052
andiwu@microsoft.com
Abstract
The verb-noun sequence in Chinese often
creates ambiguities in parsing. These ambi-
guities can usually be resolved if we know
in advance whether the verb and the noun
tend to be in the verb-object relation or the
modifier-head relation. In this paper, we de-
scribe a learning procedure whereby such
knowledge can be automatically acquired.
Using an existing (imperfect) parser with a
chart filter and a tree filter, a large corpus,
and the log-likelihood-ratio (LLR) algo-
rithm, we were able to acquire verb-noun
pairs which typically occur either in verb-
object relations or modifier-head relations.
The learned pairs are then used in the pars-
ing process for disambiguation. Evaluation
shows that the accuracy of the original
parser improves significantly with the use of
the automatically acquired knowledge.
1 Introduction
Computer analysis of natural language sentences is
a challenging task largely because of the ambigui-
ties in natural language syntax. In Chinese, the
lack of inflectional morphology often makes the
resolution of those ambiguities even more difficult.
One type of ambiguity is found in the verb-noun
sequence which can appear in at least two different
relations, the verb-object relation and the modifier-
head relation, as illustrated in the following
phrases.
(1) 登记 手续 的 费用
dengji shouxu de feiyong
register procedure DE expense
“the expense of the registration procedure”
(2) 办理 手续 的 费用
banli shouxu de feiyong
handle procedure DE expense
“the expense of going through the procedure”
In (1), the verb-noun sequence “登记 手续” is an
example of the modifier-head relation while “办理
手续” in (2) is an example of the verb-object rela-
tion. The correct analyses of these two phrases are
given in Figure 1 and Figure 2, where “RELCL”
stands for “relative clause”:
Figure 1. Correct analysis of (1)
Figure 2. Correct analysis of (2)
However, with the set of grammar rules that
cover the above phrases and without any semantic
or collocational knowledge of the words involved,
there is nothing to prevent us from the wrong
analyses in Figure 3 and Figure 4.
Figure 3. Wrong analysis of (1)
Figure 4. Wrong analysis of (2)
2. To rule out these wrong parses, we need to
know that 登记 is a typical modifier of 手续 while
办理 typically takes 手续 as an object. The ques-
tion is how to acquire such knowledge automati-
cally. In the rest of this paper, we will present a
learning procedure that learns those relations by
processing a large corpus with a chart-filter, a tree-
filter and an LLR filter. The approach is in the
spirit of Smadja (1993) on retrieving collocations
from text corpora, but is more integrated with pars-
ing. We will show in the evaluation section how
much the learned knowledge can help improve
sentence analysis.
2 The Learning Procedure
The syntactic ambiguity associated with the verb-
noun sequence can be either local or global. The
kind of ambiguity we have observed in (1) and (2)
is global in nature, which exists even if this noun
phrase is plugged into a larger structure or com-
plete sentence. There are also local ambiguities
where the ambiguity disappears once the verb-
noun sequence is put into a broader context. In the
following examples, the sentences in (3) and (4)
can only receive the analyses in Figure 5 and Fig-
ure 6 respectively.
(3) 这 是 新 的 登记 手续。
zhe shi xin de dengji shouxu
this be new DE register procedure
“This is a new registration procedure.”
(4) 你 不 必 办理 手续。
ni bu bi banli shouxu
you not must handle procedure
“You don’t have to go through the procedure.”
Figure 5. Parse tree of (3)
Figure 6. Parse tree of (4)
In the processing of a large corpus, sentences
with global ambiguities only have a random
chance of being analyzed correctly, but sentences
with local ambiguities can often receive correct
analyses. Although local ambiguities will create
some confusion in the parsing process, increase the
size of the parsing chart, and slow down process-
ing, they can be resolved in the end unless we run
out of resources (in terms of time and space) be-
fore the analysis is complete. Therefore, there
should be sufficient number of cases in the corpus
where the relationship between the verb and the
noun is clear. An obvious strategy we can adopt
here is to learn from the clear cases and use the
learned knowledge to help resolve the unclear
cases. If a verb-noun pair appears predominantly
in the verb-object relationship or the modifier head
relationship throughout the corpus, we should pre-
fer this relationship everywhere else.
A simple way to learn such knowledge is by us-
ing a tree-filter to collect all instances of each
verb-noun pair in the parse trees of a corpus,
counting the number of times they appear in each
relationship, and then comparing their frequencies
to decide which relationship is the predominant
one for a given pair. Once we have the informa-
tion that “登记” is typically a modifier of “手续”
and “办理” typically takes “手续” as an object, for
instance, the sentence in (1) will only receive the
analysis in Figure 1 and (2) only the analysis in
Figure 2. However, this only works in idealized
situations where the parser is doing an almost per-
fect job, in which case no learning would be neces-
sary. In reality, the parse trees are not always
reliable and the relations extracted from the parses
can contain a fair amount of noise. It is not hard to
imagine that a certain verb-noun pair may occur
only a couple of times in the corpus and they are
misanalyzed in every instance. If such noise is not
filtered out, the knowledge we acquire will mislead
us and minimize the benefit we get from this ap-
proach.
3. An obvious solution to this problem is to ignore
all the low frequency pairs and keep the high fre-
quency ones only, as wrong analyses tend to be
random. But the cut-off point is difficult to set if
we are only looking at the raw frequencies, whose
range is hard to predict. The cut-off point will be
too low for some pairs and too high for others. We
need a normalizing factor to turn the raw frequen-
cies into relative frequencies. Instead of asking
“which relation is more frequent for a given pair?”,
the question should be “of all the instances of a
given verb-noun pair in the corpus, which relation
has a higher percentage of occurrence?”. The
normalizing factor should then be the total count of
a verb-noun pair in the corpus regardless of the
syntactic relations between them. The normalized
frequency of a relation for a given pair is thus the
number of times this pair is assigned this relation
in the parses divided by this normalizing factor.
For example, if 登记 手续 occurs 10 times in the
corpus and is analyzed as verb-object 3 times and
modifier-head 7 times, the normalized frequencies
for these two relations will be 30% and 70% re-
spectively. What we have now is actually the
probability of a given pair occurring in a given re-
lationship. This probability may not be very accu-
rate, given the fact that the parse trees are not
always correct, but it should a good approximation,
assuming that the corpus is large enough and most
of the potential ambiguities in the corpus are local
rather than global in nature.
But how do we count the number of verb-noun
pairs in a corpus? A simple bigram count will un-
justly favor the modifier-head relation. While the
verb and the noun are usually adjacent when the
verb modifies the noun, they can be far apart when
the noun is the object of the verb, as illustrated in
(5).
(5) 他们 正在 办办办办理理理理 去 台湾 参加
tamen zhengzai banli qu taiwan canjia
they PROG handle go Taiwan participate
第十九届 国际 计算 语言学
dishijiujie guoji jisuan yuyanxue
19th
international compute linguistics
会议 的 手手手手续续续续。
huiyi de shouxu
conference DE procedure
“They are going through the procedures for
going to Taipei for the 19th
International Con-
ference on Computational Linguistics.”
To get a true normalizing factor, we must count
all the potential dependencies, both local and long-
distance. This is required also because the tree-
filter we use to collect pair relations consider both
local and long-distance dependencies as well.
Since simple string matching is not able to get the
potential long-distance pairs, we resorted to the use
of a chart-filter. As the parser we use is a chart
parser, all the potential constituents are stored in
the chart, though only a small subset of those will
end up in the parse tree. Among the constituents
created in the chart for the sentence in (5), for in-
stance, we are supposed to find [办理] and [去台
湾参加第十九届国际计算语言学会议的手续]
which are adjacent to each other. The fact that 手
续 is the head of the second phrase then makes 手
续 adjacent to 办理. We will therefore be able to
get one count of 办理 followed by 手续 from (5)
despite the long span of intervening words between
them. The use of the chart-filter thus enables us to
make our normalizing factor more accurate. The
probability of a given verb-noun pair occurring in a
given relation is now the total count of this relation
in the parse trees throughout the corpus divided by
the total count of all the potential relations found in
all the charts created during the processing of this
corpus.
The cut-off point we finally used is 50%, i.e. a
pair+relation will be kept in our knowledge base if
the probability obtained this way is more than
50%. This may seem low, but it is higher than we
think considering the fact that verb-object and
modifier-head are not the only relations that can
hold between a verb and a noun. In (6), for exam-
ple, 办理 is not related to 手续 in either way in
spite of their adjacency.
(6) 他们 去 上海 办理 手续 所需
tamen qu shanghai banli shouxu suoxu
they go Shanghai handle procedure need
的 公证 材料。
de gongzheng cailiao
DE notarize material
4. “They went to Shanghai to handle the nota-
rized material needed for the procedure.”
We will still find the 办理 手续 pair in the
chart, but it is not expected to appear in either the
verb-object relation or modifier-head relation in
the parse tree. Therefore, the baseline probability
for any pair+relation might be far below 50% and
more than 50% is a good indicator that a given pair
does typically occur in a given relation. We can
also choose to keep all the pairs with their prob-
abilities in the knowledge base and let the prob-
abilities be integrated into the probability of the
complete parse tree at the time of parse ranking.
The results we obtained from the above proce-
dure are quite clean, in the sense that most of the
pairs that are classified into the two types of rela-
tions with a probability greater than 50% are cor-
rect. Here are some sample pairs that we learned.
Verb-Object:
检验 真理 test - truth
配置 资源 allocate - recourses
经营 业务 manage - business
奉献 爱心 offer - love
欺骗 行人 cheat - pedestrians
Modifier-Head:
检验 标准 testing - standard
配置 方案 allocation - plan
经营 方式 management - mode
奉献 精神 offering - spirit
欺骗 行为 cheating - behavior
However, there are pairs that are correct but not
“typical” enough, especially in the verb-object re-
lations. Here are some examples:
具有 意义 have - meaning
具有 效力 have - impact
具有 色彩 have - color
具有 作用 have - function
具有 功效 have - effect
…
These are truly verb-object relations, but we may
not want to keep them in our knowledge base for
the following reasons. First of all, the verbs in
such cases usually can take a wide range of objects
and the strength of association between the verb
and the object is weak. In other words, the objects
are not “typical”. Secondly, those verbs tend not
to occur in the modifier-head relation with a fol-
lowing noun and we gain very little in terms of
disambiguation by storing those pairs in the
knowledge base. To prune away those pairs, we
used the log-likelihood-ratio algorithm (Dunning,
1993) to compute the degree of association be-
tween the verb and the noun in each pair. Pairs
where there is high “mutual information” between
the verb and noun would receive higher scores
while pairs where the verb can co-occur with many
different nouns would receive lower scores. Pairs
with association scores below a certain threshold
were then thrown out. This not only makes the
remaining pairs more “typical” but helps to clean
out more garbage. The resulting knowledge base
therefore has higher quality.
3 Evaluation
The knowledge acquired by the method described
in the previous section is used in subsequent sen-
tence analysis to prefer those parses where the
verb-noun sequence is analyzed in the same way as
specified in the knowledge base. When processing
a large corpus, what we typically do is analyzing
the corpus twice. The first pass is the learning
phase where we acquire additional knowledge by
parsing the corpus. The knowledge acquired is
used in the second pass to get better parses. This is
one example of the general approach of “improv-
ing parsing by parsing”, as described in (Wu et al
2002).
To find out how much the learned knowledge
contributes to the improvement of parsing, we per-
formed a human evaluation. In the evaluation, we
used our existing sentence analyzer (Heidorn 2000,
Jensen et al 1993, Wu and Jiang 1998) to process a
corpus of 271,690 sentences to learn the verb-noun
relations. We then parsed the same sentences first
without the additional knowledge and then with the
acquired knowledge. Comparing the outputs, we
found that 16,445 (6%) of the sentences had differ-
ent analyses in the two passes. We then randomly
5. selected 500 sentences from those “diff” sentences
and presented them to a linguist from an independ-
ent agency who, given two different parses of the
same sentence, was asked to pick the parse she
judged to be more accurate. The order in which
the parses were presented was randomized so that
the evaluator had no idea as to which tree was from
the first pass and which one from the second pass.
The linguist’s judgment showed that, with the
additional knowledge that we acquired, 350 (70%)
of those sentences parsed better with the additional
knowledge, 85 (17%) parsed worse, and 65 (13%)
had parses that were equally good or bad. In other
words, the accuracy of sentence analysis improved
significantly with the learning procedure discussed
in this paper.
Here is an example where the parse became bet-
ter when the automatically acquired knowledge is
used. Due to space limitation, only the parses of a
fraction of the sentence is given here:
(7) 要 遵照 国家 测试 标准
yao zunzhao guojia ceshi biaozhun
want follow nation testing standard
“(You) must follow the national testing
standards.”
Because of the fact that 遵照 is ambiguous be-
tween a verb (“follow”) and a preposition (“in ac-
cordance with”), this sentence fragment got the
parse tree in Figure 7 before the learned knowledge
was used, where 标准 was misanalyzed as the ob-
ject of 测试:
Figure 7: old parse of (7)
During the learning process, we acquired “测试-
标准” as a typical pair where the two words are in
the modifier-head relationship. Once this pair was
added to our knowledge base, we got the correct
parse, where 遵照 is analyzed as a verb and 测试
as a modifier of 标准:
Figure 8: New tree of (7)
We later inspected the sentences where the
parses became worse and found two sources for the
regressions. The main source was of course errors
in the learned results, since they had not been
manually checked. The second source was an en-
gineering problem: the use of the acquired knowl-
edge required the use of additional memory and
consequently exceeded some system limitations
when the sentences were very long.
4 Future work
The approach described in this paper can be ap-
plied to the learning of many other typical syntac-
tic relations between words. We have already used
it to learn noun-noun pairs where the first noun is a
typical modifier of the second noun. This has
helped us to rule out incorrect parses where the
two nouns were not put into the same constituent.
Other relations we have been trying to learn in-
clude:
• Noun-noun pairs where the two nouns are in
conjunction (e.g. 新郎 新娘 “bride and bride-
groom”);
• Verb-verb pairs where the two verbs are in
conjunction (e.g. 调查 研究 “investigate and
study”);
• Adjective-adjective pairs where two adjectives
are in conjunction (e.g. 年轻 漂亮 “young and
beautiful”);
• Noun-verb pairs where the noun is a typical
subject of the verb.
Knowledge of this kind, once acquired, will benefit
not only parsing, but other NLP applications as
well, such as machine translation and information
retrieval.
In terms of parsing, the benefit we get there is
similar to what we get in lexicalized statistical
parsing where parsing decisions can be based on
6. specific lexical items. However, the training of a
statistical parser requires a tree bank which is ex-
pensive to create while our approach does not. Our
approach does require an existing parser, but this
parser does not have to be perfect and can be im-
proved as the learning goes on. Once the parser is
reasonably good, what we need is just raw text,
which is available in large quantities.
5 Conclusion
We have shown in this paper that parsing quality
can be improved by using the parser as an auto-
matic learner which acquires new knowledge in the
first pass to help analysis in the second pass. We
demonstrated this through the learning of typical
verb-object and modifier-head relations. With the
use of a chart-filter, a tree-filter and the LLR algo-
rithm, we are able to acquire such knowledge with
high accuracy. Evaluation shows that the quality
of sentence analysis can improve significantly with
the help of the automatically acquired knowledge.
References
Dunning, T. 1993. Accurate methods for the statistics
of surprise and coincidence. Computational Linguis-
tics, 19(1): 61-74.
Heidorn, G. E. 2000. Intelligent writing assistance, in A
Handbook of Natural Language Processing: Tech-
niques and Applications for the Processing of Lan-
guage as Text, Dale R., Moisl H., and Somers H. eds.,
Marcel Dekker, New York, pp. 181-207.
Jensen, K., G. Heidorn and S. Richardson. 1993. Natu-
ral Language Processing: the PLNLP Approach”.
Kluwer Academic Publishers, Boston.
Smadja, F. 1993. Retrieving collocations from text:
Xtract. Computational Linguistics, 19(1): 143-177.
Wu, Andi, J. Pentheroudakis and Z. Jiang, 2002. Dy-
namic lexical acquisition in Chinese sentence analy-
sis. In Proceedings of the 19th
International
Conference on Computational Linguistics, pp. 1308-
1312, Taipei, Taiwan.
Wu, Andi, J. and Z. Jiang, 1998. Word segmentation in
sentence analysis, in Proceedings of 1998 Interna-
tional Conference on Chinese Information Process-
ing, pp. 46-51.169-180, Beijing, China.