Robust rule-based parsing
Upcoming SlideShare
Loading in...5

Robust rule-based parsing



Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.

Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Robust rule-based parsing Robust rule-based parsing Presentation Transcript

  • Robust rule-based parsing (quick overview) I. II. III. IV. Robustness Three robust rule-based parsers of English Common features Example : identification of subjects in Syntex
  • I. Robustness (Aït-Mohktar et al. 1997)  « the ability to provide useful analyses for realword input text. By useful analyses, we mean analyses that are (at least partially) correct and usable in some automatic task or application »  implies :    1 analysis (even partial) for any real world input ability to process irregular input, to overcome error analysis efficiency
  • I. Types of robust parsers (Aït Mokhtar et al. 1997)  based on traditional theorical models with rule-based and/or stochastic post-processing  Minipar (Lin 1995)  stochastic parsers  Charniak’s parser (2000)  rule-based parsers  Non-Projective Dependency Parser (Järvinen & Tapanainen 1997)  Syntex (Bourigault 2007)  Cass (Abney 1990,1995)  most parsers are hybrid
  • II.1 Non-Projective Dependency Parser (Tapanainen & Järvinen 1997) Tagged Text Syntactic Labeling valency subcategorization information Selection of syntactic links Pruning OUTPUT « all legitimate surface-syntactic labels are added to the set of morphological readings » « syntactic rules discard contextually illegitimate alternatives or select legitimate ones » General heuristics disambiguate the last of the syntactic links
  • II.1 Non-Projective Dependency Parser (Tapanainen & Järvinen 1997)  Rules establish dependency links between words  Rules are contextual : SELECT (@SUBJ) IF (1C AUXMOD HEAD); SUBJ How do you do ? AUX  If the preceding the word is an unambiguous auxiliary, the current word is the subject of this auxiliary  Rules use syntactic links established by preceding rules
  • II.2 Syntex (Bourigault 2007) Tagged Text Endogenous and exogenous subcategorization information Verb Chunk he will leave non recursive NP the man non recursive SP Object, Subject Endogenous and exogenous subcategorization information Prep Attachement OUTPUT ? ? happy tree friends from Paris This is the man ? ? This is the man from Paris
  • II.2 Syntex (Bourigault 2007)  One module per syntactic relation Each module processes the sentence from left to right.  Like the Non-Projective Dependency Parser, the rules      establish dependency relations between words are contextual use syntactic links established by preceding rules The identification of a dependency link is formulated as a «path» to be followed up through the existing links and grammatical categories   from governor to dependent or from dependent to governor Ambiguous relations : selection of potential governors + desambiguisation with probabilities Those who think they are interested in water supply must vote
  • II.3 Cass (Abney 1990,1995) Tagged Text CHUNK FILTER NP filter Chunk filter CLAUSE FILTER Raw Clause filter Non recursive chunks Internal structure remains ambiguous [NP the happy tree friends] [VP will leave] [SP from [NP the happy tree friends] Subject-predicate relation Beginning and end of simplex clauses [SUBJThis] [PREDis] [NPthe man][SPfrom Paris] Clause Repair filter subcategorization information PARSE FILTER OUTPUT Repair if no Subject-predicate relation Assembles recursive structures [[This] [is] [NPthe man][SPfrom Paris] ]
  • II.3 Cass (Abney 1990,1995)  Each filter uses transducers : PP  (Prep|To)+(NP|Vbg) Use of repair   (also used in Syntex and NPDP but less explicit): Each filter makes a decision (determinism), the safest one in case of ambiguity « ambiguity is not propagated downstream »   « repair consists in directly modifying erroneous structure without regard to the history of computation that produced the structure » « when errors become apparent downstream, the parser attempts to repair them »
  • II.3 Cass (Abney 1990,1995)  Example of repair In South Australia beds of boulders were deposited …  Erroneous structure output from the Chunk filter [SPIn [NPSouth Australia beds]][SPof [NPboulders]][VPwere deposited]   Raw Clause filter : no subject is found Repair filter tries to find a subject by modifying the structure [SPIn [NPSouth Australia beds]][SPof [SPof boulders][VPwere Australia]][NP-SUBJbeds][ NPboulders]][VPwere deposited]
  • III. Common features : Incrementality  The parsing task is divided into substasks  reduces the overall complexity of the main task : « factoring the problem into a sequence of small, well defined questions » (Abney 1990).  The sentence is parsed in several phases, each phase producing an intermediate structure  allows each phase to use the syntactic information left by the predecing phase « the level of abstraction produced during the 1st phase (...) facilitates the description of deeper syntactic relations» (Aït-Mohktar et al. 1997)  ease of maintenance  problem of circularity : difficult to choose in what order the relation should be identified (Bourigault 2007)
  • III. Common features : determinism and repair  Each parsing phase yields one solution.    In case of ambiguity, the safest choice is made, even if some higher level information is needed ambiguity is not propagated downstream Most regular errors can be repaired later on  ≠ parallelism, backtracking « The salient performance is not errors vs no errors, but the tradeoff between speed and error rate » (Abney 1990)
  • III. Common features: no syntactic theory  Difference between :    Difficulties in automatic syntactic analysis :      the theoretical study of the syntactic structures of language automatic identification of grammatical relation in real-word texts lack of knowledge (semantics/pragmatics for desambiguation) deviation from the norm of the language errors of preceding processing steps Use of common grammatical knowledge Hours of corpus observation to find clues for automatic identification
  • III. Common features : implicit grammatical knowledge Bipartite architecture :    Lexical information Recognition routines No independent declaration of grammatical knowlege  Difficult / impossible to set apart :   Grammatical knowledge Non grammar-based heuristics  No linguist/computer scientist job separation  Need both linguistic and programming know-hows  A condition to scalability and robustness
  • IV. Example : the subject relation in Syntex  The identification of the subject relation is formulated as a «path» through the already identified grammatical relations :    start from tensed verb move to the left stop when you encounter an ungoverned Noun SUJ DET PREP NOMPREP the cost of Det Noun SUBJECT Prep OBJ NOMPREP technology takes time to Noun Verb Noun TENSED VERB Prep shrink Noun
  • IV. Using existing links   The Subject might be far from the tensed verb Lots of configuration are possible : Initiatives leading to cessation of smoking in workplaces are adopted Gerund PP PP Those who think they are interested in water supply must vote. Clause Clause PP No reference to the war, or to the alliance, should remain PP   Conj PP Existing links form dependency islands (~syntagms or isolated words) Following up the islands until a reasonnable subject is found allows to find subjects without describing all possible configurations or doing too much computing
  • IV. Ambiguities Many persons have died in Darfur since the conflict began A person sitting on the death row since the age of 16 is not the same as before. Many adults believe education equates intelligence. Those who think they are interested in water supply must vote.  When to stop? When to follow up ? When to repair ?
  • IV. Path decomposition  At each island, a decision is made by a dedicated submodule (one type of island = one sub-module) : follow up to the island on the left  stop and identify a subject   without repair  with repair  change path direction  to the right  to any other position in the sentence    call other module stop and return failure Decisions are encoded as if-then rules that may test :  local and non-local context : lemmas, ms tags, links, presence of commas…  specific information left by other modules : encountered tags, activated modules …
  • IV. Path Example : following up SUBJ Korea who we believe to have WMD is safe from us. Clause PP PP module Clause module _ RelPron [[SUJPron] Verb ]
  • IV. Path example : repair OBJ SUBJ Many adults believe education equates intelligence. Clause Clause module ## [[SUBJNP] Verb [[OBJ [SUBJNP] Verb ] Verb OBJNP]]
  • IV. Path example : sub-module call SUBJ  On the walls were scarlett banners PP PP module Wall module ## [PP] Verb NP InvertedSubject module _
  • IV. Path example : change path   On the contrary, war hysteria was continuous and PP module Clause module Conj deliberate, and acts such as looting, murdering, the Adj slaughters of Noun PP module prisonners, were considered as normal. Commas module +2.6 Recall -0.07 Precision All three political Parties at the federal level, and certainly at the provincial level in different sections, have parity clauses. Although no directive was ever issued, it was known that the chief of the Departement intended that within one week no reference to the war with Eurasia, or to the alliance, should remain
  • IV. Evaluation on Susanne Corpus Tensed verb Identification Subject Identification (TreeTagger) (if tensed verb correct) (correct tensed verb and correct subject) precision 94,87 94,56 89,51 recall 89,76 90,84 81,53 f-mesure 92,24 92,66 85,33   Shallow subjects evaluation only are not identified or evaluated :    I’ve never seen the dog hiding his bones. She wants me to clean my shoes The book is read by the boy SUBJECT RELATION
  • Bibliography   Abney (1990) : « Rapid Incremental Parsing with Repair », Proceedings of the 6th New OED Conference, University of Waterloo, Waterloo, Ontario. Abney (1995) : «Partial Parsing with finite state cascade », Natural Language Engineering, Cambridge University Press    Aït-Mokhtar et al. (1997) : « Incremental Finite State Parsing », Proceedings of the ANLP-97, Washington Bourigault (2007) : Syntex, analyseur syntaxique opérationnel, Thèse d’Habilitation à Diriger les Recherches, Université Toulouse - Le Mirail.    Tapanainen & Järvinen (1997) : « A Dependency Parser for English», Technical Reports, No.TR-1, Department of General Linguistics University, March 1997.   Lin (1995) :« Dependency-based Evaluation of Minipar », Proceedings of JCAI.   Charniak (2000): «A maximum-entropy-inspired parser », In The Proceedings of the North American, Chapter of the Association for Computational Linguistics,pp 132–139.   TreeTagger : Evaluation Corpus :