.ppt

363 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
363
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A few algorithms for discourse parsing have been proposed. Cue phrases, cohesive devices, occurrence of identical/synonymous words/phrases, and certain similarity between two sentences
  • The abstracts were also segmented into a series of sentences using a program. Each sentence was converted into a vector of term weights. Binary weighting is used. If a word occurs in the sentence, the value is “1”; otherwise, the value is “0”. The dataset is formatted as a table with sentences as rows and words as columns. Don’t include stop words. Each word is in its base form. These words occur in at least 5 sentences. The words with low frequency have been filtered out.
  • .ppt

    1. 1. Shiyan Ou, Chris Khoo, Dion Goh, Hui-Ying Heng   Division of Information Studies School of Communication & Information Nanyang Technological University, Singapore Automatic Discourse Parsing of Dissertation Abstracts as Sentence Categorization
    2. 2. Objective <ul><li>To develop an automatic method to parse the discourse structure of sociology dissertation abstracts </li></ul><ul><li>To segment a dissertation abstract into f ive sections ( macro-level structure ) : </li></ul><ul><ul><li>1. background </li></ul></ul><ul><ul><li>2. problem statement/objectives </li></ul></ul><ul><ul><li>3. research method </li></ul></ul><ul><ul><li>4. research results </li></ul></ul><ul><ul><li>5. concluding remarks </li></ul></ul>
    3. 3. Approach <ul><li>D iscourse parsing is treated a s a text categorization problem </li></ul><ul><ul><li>to assign each sentence to 1 of the 5 sections or categories. </li></ul></ul><ul><li>Machine learning method </li></ul><ul><ul><li>D ecision tree induction using C5.0 within Clementine data mining system </li></ul></ul><ul><ul><li>Rule induction </li></ul></ul><ul><li>P art of a broader study to develop a method for multi-document summarization of dissertation abstracts </li></ul>
    4. 4. Previous Studies <ul><li>2 approaches </li></ul><ul><li>Hand-crafted algorithms using lexical and syntactic clues </li></ul><ul><ul><li>e.g., Kurohashi & Nagao (1994) </li></ul></ul><ul><li>Models developed using supervised machine learning </li></ul><ul><ul><li>e.g., Nomoto & Matsumoto (1998), Marcu (1999), Le & Abeysinghe (2003) </li></ul></ul>
    5. 5. Previous Studies (cont.) <ul><li>Features of studies </li></ul><ul><li>different kinds of text and domain </li></ul><ul><ul><li>e.g. news articles, scientific articles </li></ul></ul><ul><li>different discourse models </li></ul><ul><ul><li>micro-level structure vs. macro-level structure </li></ul></ul><ul><ul><li>theoretical perspectives, e.g. rhetorical relations </li></ul></ul>
    6. 6. Data Preparation <ul><li>300 sociology dissertation abstracts from 2001 </li></ul><ul><ul><li>Training set: 200 abstracts for constructing the classifier </li></ul></ul><ul><ul><li>Test set: 100 abstracts for evaluating the accuracy of the classifier </li></ul></ul><ul><li>Abstracts were segmented into sentences with a simple program </li></ul><ul><li>Sentences were manually categorized into 1 of the 5 sections. </li></ul><ul><ul><li>Some problems </li></ul></ul><ul><ul><li>Some abstracts don’t follow the regular structure. These were deleted from the training and test set (29 in training set and 16 in test set) </li></ul></ul><ul><ul><li>Some sentences can be assigned to multiple categories </li></ul></ul>
    7. 7. Data Preparation (Cont.) <ul><li>Sentences were tokenized and words were stemmed using the Conexor parser </li></ul><ul><li>Each sentence was converted into a vector of binary term weights </li></ul><ul><ul><li>using 1 and 0 to indicate whether a word occurs in the sentence </li></ul></ul><ul><li>The dataset was formatted as a table with sentences as rows and words as columns </li></ul>
    8. 8. Data Preparation (cont.) … 1 0 0 0 0 0 0 1 2.05 … 0 1 0 1 0 0 0 0 2.04 … 1 0 1 0 0 0 0 1 2.03 … 0 0 0 0 1 1 0 1 2.02 … 0 0 0 0 0 0 1 1 2.01 … racial study implication factor design data control american doc_sen
    9. 9. Machine learning <ul><li>Using C5.0 -- a decision tree induction program within the Clementine data mining system </li></ul><ul><li>Using 10-fold cross-validation to estimate accuracy when developing the model </li></ul><ul><li>Minimum records per branch set at 5 to avoid overfitting </li></ul><ul><li>Different amounts of pruning were tried </li></ul><ul><li>Boosting not employed </li></ul><ul><li>Evaluation using the test set of 84 abstracts </li></ul>
    10. 10. Categorization models <ul><li>Three models were developed </li></ul><ul><li>Model 1 – indicative words present in the sentence </li></ul><ul><li>Model 2 – indicative words + position of sentence </li></ul><ul><li>Model 3 – indicative words + position of sentence + indicative words in neighboring sentences (before and after the sentence being categorized) </li></ul>
    11. 11. Model 1 Estimated accuracy using 10-fold cross validation 50.1 50.8 51.1 44 >100 50.7 51.0 51.6 75 >75 55.5 56.4 56.5 153 >50 56.2 57.9 57.5 242 >35 56.3 55.6 56.4 454 >20 50.7 50.7 50.7 30 >125 53.7 54.4 54.4 876 >10 53.9 53.9 53.7 1463 >5 99% 95% 90% Pruning Severity Number of words input Word frequency threshold
    12. 12. Accuracy of Model 1 Using the test abstracts <ul><li>Applying the best model to the test sample (with word frequency threshold = 35, and pruning = 95%) </li></ul><ul><ul><li>for 100 abstracts (including 16 unstructured abstracts): accuracy = 50.04% </li></ul></ul><ul><ul><li>for 84 abstracts (not including 16 unstructured abstracts): accuracy = 60.8% </li></ul></ul><ul><li>Preprocessing to filter out unstructured abstracts can improve categorization accuracy substantially </li></ul><ul><li>Only high frequency words are useful for categorizing the sentences </li></ul><ul><li>Only a small number of indicator words (20) were selected by the decision tree program </li></ul>
    13. 13. Model 2 Estimated accuracy using 10-fold cross validation <ul><li>Result using the 84 test abstracts: </li></ul><ul><li>accuracy = 71.6% (compared to 60.8% for Model 1) </li></ul><ul><li>Sentence position is useful in the sentence categorization. </li></ul>66.5 57.0 80% Pruning Severity 66.4 57.9 85% 65.1 57.5 90% 65.1 66.6 Yes ( Model 2 ) 56.2 57.9 No ( Model 1 ) 242 >35 99% 95% Sentence position as an additional attribute Number of words input Word frequency threshold
    14. 14. Model 2 2 rules for Section 1: Background <ul><li>Rule 1 (confidence=0.61) </li></ul><ul><ul><li>sentence_position <= 0.44 </li></ul></ul><ul><ul><li>sentence does NOT contain the words: STUDY, EXAMINE, DATA, ANALYSIS, DISSERTA, PARTICIP, INVESTIG, SHOW, SCALE, SECOND, INTERVIE, STATUS, CONDUCT, REVEAL, AGE, FORM, PERCEPTI </li></ul></ul><ul><li>Rule 2 (confidence=0.36) </li></ul><ul><ul><li>sentence position <= 0.44 </li></ul></ul>
    15. 15. Model 2 9 rules for Section 2: Problem statement <ul><li>Rule 1 (confidence = 1.0) </li></ul><ul><ul><li>Sentence position <= 0.44 </li></ul></ul><ul><ul><li>Sentence contains: PERCEPTI </li></ul></ul><ul><ul><li>Sentence does NOT contain: INTERVIE, COMPLETE </li></ul></ul><ul><li>Rule 2 (confidence = 0.88) </li></ul><ul><ul><li>Sentence contains: DISSERTA </li></ul></ul><ul><ul><li>Sentence does NOT contain: METHOD </li></ul></ul><ul><li>Rule 3 (confidence = 0.83) </li></ul><ul><ul><li>Sentence contains: EXAMINE </li></ul></ul><ul><ul><li>Sentence position <= 0.44 </li></ul></ul><ul><li>Words used in other rules: </li></ul><ul><ul><li>INVESTIGATE, EXAMINE, EXPLORE, STUDY, STATUS, SECOND </li></ul></ul>
    16. 16. Model 2 Words used in rules <ul><li>Section 3: Research method </li></ul><ul><ul><li>COMPLETE, CONDUCT, FORM, METHOD, INTERVIE, SCALE, PARTICIP, TEST, ASSESS, DATA, ANALYSIS </li></ul></ul><ul><li>Section 4: Research results </li></ul><ul><ul><li>REVEAL, SHOW </li></ul></ul><ul><ul><li>Default category </li></ul></ul><ul><li>Section 5: Concluding remarks </li></ul><ul><ul><li>IMPLICAT, FUTURE </li></ul></ul>
    17. 17. Model 3 <ul><li>To investigate whether indicator words in neighboring sentences can improve categorization </li></ul><ul><li>Model = indicator words + position of sentence + indicator words in neighboring sentences (before and after the sentence being categorized) </li></ul><ul><li>Procedure </li></ul><ul><li>Extract indicator words from Model 1 & Model 2 (about 30) </li></ul><ul><li>For each sentence, identify the nearest sentence (before and after) containing each indicator word </li></ul><ul><li>For each nearest sentence containing an indicator word, calculate the distance (difference in sentence position) </li></ul>
    18. 18. Model 3 Example of indicator words occurring before and after Sentence 13 in Doc 4 <ul><li>3 models investigated </li></ul><ul><li>Indicator words before the sentence being categorized </li></ul><ul><li>Indicator words after the sentence being categorized </li></ul><ul><li>Indicator words before and after the sentence being categorized </li></ul>
    19. 19. Accuracy of Model 2 & Model 3 based on 84 test abstracts 68.62% 74.47% 73.99% 71.59% 1042 Total 55.17% 58.62% 58.62% 58.62% 29 5 89.31% 91.03% 91.03% 87.61% 468 4 39.15% 52.38% 52.38% 49.74% 189 3 49.18% 52.46% 48.63% 55.74% 183 2 67.63% 79.77% 80.92% 71.10% 173 1 With indicator words after With indicator words before With all indicator words Model 3 correctly classified Model 2 correctly classified No. of sentences Section
    20. 20. Examples of Model 3 rules <ul><li>Rule for Section 1 (confidence = 0.64) </li></ul><ul><ul><li>Sentence position <= 0.44 </li></ul></ul><ul><ul><li>Sentence does NOT contain: EXAMINE, INTERVIE, EXPLORE, DISSERTA, DATA, MOTHER, COMPARE </li></ul></ul><ul><ul><li>STUDY and PARTICIPANT does not appear in an earlier sentence </li></ul></ul><ul><li>Rule for Section 3 (confidence = 0.818) </li></ul><ul><ul><li>Sentence position <= 0.5 </li></ul></ul><ul><ul><li>Sentence does NOT contain STYLE </li></ul></ul><ul><ul><li>PARTICIPANT appears in an earlier sentence </li></ul></ul>
    21. 21. Model 4 Use of sequential association rules <ul><li>Use of sequential pattern mining to identify sequential associations of the form: </li></ul><ul><ul><li>word1 > word2 => section 4 </li></ul></ul><ul><ul><li>word1 > word2 > section 3 => section 4 </li></ul></ul><ul><ul><li>Various window sizes can be specified </li></ul></ul><ul><li>Initial results are not promising </li></ul><ul><ul><li>probably because of the small training sample </li></ul></ul><ul><li>1 possibility is to use transition probabilities </li></ul><ul><ul><li>section 3 => section 1 </li></ul></ul><ul><ul><li>section 3 => section 2 </li></ul></ul><ul><ul><li>section 3 => section 3 </li></ul></ul><ul><ul><li>section 3 => section 4 </li></ul></ul><ul><ul><li>Applied as a second pass to refine the predictions of the decision tree </li></ul></ul>
    22. 22. Conclusion <ul><li>Decision tree induction used to develop a ruleset to parse the macro-level discourse structure of sociology dissertation abstracts </li></ul><ul><li>Discourse parsing treated as a sentence categorization task </li></ul><ul><li>3 models constructed </li></ul><ul><ul><li>Model 1 : Indicator words present in the sentence (60.8%accuracy) </li></ul></ul><ul><ul><li>Model 2 : Indicator words + sentence position (71.6% accuracy) </li></ul></ul><ul><ul><li>Model 3 : Indicator words + sentence position + indicator words in sentences before the sentence being categorized (74.5% accuracy) </li></ul></ul>
    23. 23. Future work <ul><li>More in-depth error analysis to determine whether some kind of inferencing can improve accuracy </li></ul><ul><li>Obtain 2 more manual codings so that inter-indexer consistency can be calculated </li></ul><ul><li>To “generalize” the models by replacing word tokens with synonym sets </li></ul><ul><li>Investigate use of SVM (support vector machine) as an alternative to decision tree induction </li></ul><ul><li>Develop a method to identify and process “unstructured” abstracts </li></ul><ul><li>Extend the work to journal article abstracts </li></ul>

    ×