Pattern Mining To Unknown Word Extraction (10

1,189 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,189
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Pattern Mining To Unknown Word Extraction (10

    1. 1. Pattern Mining to Chinese Unknown word Extraction 資工碩三 955202037 楊傑程 2008/10/14
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Related Works </li></ul><ul><li>Unknown Word Detection </li></ul><ul><li>Unknown Word Extraction </li></ul><ul><li>Experiments </li></ul><ul><li>Conclusions </li></ul>
    3. 3. Introduction <ul><li>Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years. </li></ul><ul><li>Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. </li></ul><ul><ul><li>There is no blank to mark word boundaries in Chinese texts. </li></ul></ul>
    4. 4. Introduction <ul><li>Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. </li></ul><ul><li>Ambiguity </li></ul><ul><ul><li>One un-segmented Chinese character string has different segmentations according to different context information. </li></ul></ul><ul><li>Unknown Words </li></ul><ul><ul><li>Also known as Out-Of-Vocabulary words (OOV words), mostly unfamiliar proper nouns or new-born words. </li></ul></ul><ul><ul><ul><li>Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into </li></ul></ul></ul><ul><ul><ul><ul><li>“ 王 義氣 熱衷 於 研究 生命” </li></ul></ul></ul></ul><ul><ul><ul><ul><li>because “ 王義氣” is a uncommon personal name, which is not in vocabularies. </li></ul></ul></ul></ul>
    5. 5. Introduction- types of unknown words <ul><li>In this paper, we focus on Chinese unknown word problem. </li></ul>Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex: 中油、中大 Personal names Ex: 王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type compounds Ex: 1986 年、 19 巷
    6. 6. Introduction- unknown word identification <ul><li>Chinese Word Segmentation Process: </li></ul><ul><li>Initial Segmentation (Dictionary assisted) </li></ul><ul><ul><li>Correctly identified words are called known words. </li></ul></ul><ul><ul><li>Unknown words are wrongly segmented into two or more parts. </li></ul></ul><ul><ul><ul><li>Ex: personal name 王小明 after initial segmentation, </li></ul></ul></ul><ul><ul><ul><li>become 王 小 明 </li></ul></ul></ul><ul><li>Unknown word identification </li></ul><ul><ul><li>Characters belong to one unknown word should re-combine together. </li></ul></ul><ul><ul><ul><li>Ex: re-combine 王 小 明 together as 王小明 </li></ul></ul></ul>
    7. 7. Introduction- unknown word identification <ul><li>How does unknown word identification work? </li></ul><ul><ul><li>A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). </li></ul></ul><ul><li>Unknown Word Detection </li></ul><ul><ul><li>Find detection rules to distinguish monosyllabic words from monosyllabic morphemes. </li></ul></ul><ul><li>Unknown Word Extraction </li></ul><ul><ul><li>focus on detected morphemes and combine them. </li></ul></ul>
    8. 8. Introduction- applied techniques <ul><li>In this paper, we apply continuity pattern mining to discover unknown word detection rules. </li></ul><ul><li>Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words. </li></ul><ul><ul><li>Utilize syntactic information 、 context information and heuristic statistical information. </li></ul></ul><ul><li>Our unknown word identification method is a general method </li></ul><ul><ul><li>not limited on specific types of unknown words </li></ul></ul>
    9. 9. Related Works- particular methods <ul><li>So far, research on Chinese word segmentation has lasted for a decade. </li></ul><ul><li>First, researchers apply different kinds of information to discover different kinds of unknown words (particular). </li></ul><ul><ul><li>Proper nouns (Chinese personal names 、 transliteration names 、 Organization names) </li></ul></ul><ul><ul><ul><li><[Chen & Li, 1996] 、 [Chen & Chen, 2000]> </li></ul></ul></ul><ul><ul><ul><ul><li>Patterns, Frequency, Context Information </li></ul></ul></ul></ul>
    10. 10. Related Works- general methods (Rule-based) <ul><li>Then, researchers start to figure out methods extracting whole kinds of unknown words. </li></ul><ul><li>Rule-based Detection and Extraction: </li></ul><ul><ul><li><[Chen et al., 1998]> </li></ul></ul><ul><ul><ul><li>Distinguish monosyllabic words and monosyllabic morphemes </li></ul></ul></ul><ul><ul><li><[Chen et al., 2002]> </li></ul></ul><ul><ul><ul><li>Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%) </li></ul></ul></ul><ul><ul><li><[Ma et al., 2003]> </li></ul></ul><ul><ul><ul><li>Utilize context free grammar concept and propose a bottom-up merging algorithm </li></ul></ul></ul><ul><ul><ul><li>Adopt morphological rules and general rules to extract all kinds of unknown words. ( Precision: 76%, Recall: 57%) </li></ul></ul></ul>
    11. 11. Related Works- general methods (Machine Learning-based) <ul><li>Sequential Learning: </li></ul><ul><ul><li><[T. G. Dietterich, 2002]> </li></ul></ul><ul><ul><ul><li>Transform sequential learning problem into classification problem </li></ul></ul></ul><ul><ul><li>Direct method, like HMM 、 CRF </li></ul></ul><ul><ul><ul><li><[Goh et. al, 2006]> </li></ul></ul></ul><ul><ul><ul><ul><li>HMM+SVM, (Precision: 63.8%, Recall: 58.3%) </li></ul></ul></ul></ul><ul><ul><ul><li><[Tsai et. al, 2006]> </li></ul></ul></ul><ul><ul><ul><ul><li>CRF, (Recall: 73%) </li></ul></ul></ul></ul><ul><ul><li>Indirect method, like Sliding Window 、 Recurrent Sliding Windows </li></ul></ul>
    12. 12. Related Works – Imbalanced Data <ul><li>Imbalance Data Problem </li></ul><ul><ul><li>Ensemble method </li></ul></ul><ul><ul><ul><li><C. Li, 2007> </li></ul></ul></ul><ul><ul><ul><ul><li>Combine learning ability of multiple base classifiers using voting. </li></ul></ul></ul></ul><ul><ul><li>Cost-sensitive learning and sampling </li></ul></ul><ul><ul><ul><li><G. M. Weiss et. al, 2007> </li></ul></ul></ul><ul><ul><ul><ul><li>Focus more on minority class examples. </li></ul></ul></ul></ul><ul><ul><ul><li><C. Drummond et. al, 2003> </li></ul></ul></ul><ul><ul><ul><ul><li>Under-sampling is more sensitive than over-sampling. </li></ul></ul></ul></ul><ul><ul><ul><li><[Seyda et. al, 2007]> </li></ul></ul></ul><ul><ul><ul><ul><li>Select the most informative instances. </li></ul></ul></ul></ul>
    13. 13. Unknown Word Detection & Extraction <ul><li>Our idea is similar to [Chen et al, 2002]: </li></ul><ul><ul><li>Unknown word detection </li></ul></ul><ul><ul><ul><li>Continuity pattern mining to derive detection rules. </li></ul></ul></ul><ul><ul><li>Unknown word extraction </li></ul></ul><ul><ul><ul><li>Machine learning based – classification algorithms and sequential learning (indirect). </li></ul></ul></ul><ul><li>We call: </li></ul><ul><ul><li>unknown word detection as “Phase 1” </li></ul></ul><ul><ul><li>unknown word extraction as “Phase 2”. </li></ul></ul>
    14. 14. Unknown Word Detection & Extraction Unknown Word Detection (Detection Rule Mining) Judge Judge Unknown Word Extraction (Machine Learning- Classification) 8/10 corpus + detection tags (Initial Segmentation) 8/10 corpus 1/10 corpus (Validation) 1/10 corpus (Initial Segmentation) Classification Decision 1/10 corpus + detection tags training testing Phase 1 Phase 2 Rules 1/10 corpus (Validation) Mining tool (Prowl) Model POS tagging POS tagging
    15. 15. Unknown Word Detection <ul><li>Mine detection rules: </li></ul><ul><ul><li>8/10 corpus learning </li></ul></ul><ul><ul><li>Continuity pattern mining </li></ul></ul><ul><ul><li>Focus on monosyllables. </li></ul></ul>
    16. 16. Unknown word detection- Pattern Mining <ul><li>Pattern Mining: </li></ul><ul><ul><li>Sequential Pattern: </li></ul></ul><ul><ul><ul><li>“ 因為… , 所以…” </li></ul></ul></ul><ul><ul><ul><li>Required items match pattern order </li></ul></ul></ul><ul><ul><ul><li>Allow noise in the middle of required items. </li></ul></ul></ul><ul><ul><li>Continuity Pattern: </li></ul></ul><ul><ul><ul><li>“ 打 * 球” => “ 打棒球” : match, “ 打躲避球” : not match </li></ul></ul></ul><ul><ul><ul><li>Strict definition to each items and order. </li></ul></ul></ul><ul><ul><ul><li>Efficient pattern mining </li></ul></ul></ul>
    17. 17. Unknown word detection- Continuity Pattern Mining <ul><li>Prowl </li></ul><ul><ul><li><[Huang et. al, 2004]> </li></ul></ul><ul><ul><ul><li>Starts with 1-frequent pattern </li></ul></ul></ul><ul><ul><ul><li>Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency. </li></ul></ul></ul><ul><ul><ul><li>Iteratively extends to longer length of patterns. </li></ul></ul></ul>
    18. 18. Encoding <ul><li>Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) </li></ul><ul><ul><li>“ 葡萄” , in the lexicon => “ 葡萄” labels as known word (Y) </li></ul></ul><ul><ul><li>“ 葡萄皮” , not in the lexicon => “ 葡萄皮” labels as unknown word (N) </li></ul></ul><ul><li>Encoding examples: </li></ul><ul><ul><li>葡萄 (Na)  葡 (Na) Y + 萄 (Na) Y </li></ul></ul><ul><ul><li>葡萄皮 (Na)  葡 (Na) N + 萄 (Na) N+ 皮 (Na) N </li></ul></ul>
    19. 19. Create Detection Rules <ul><li>Rule pattern: </li></ul><ul><ul><li>character, pos, label </li></ul></ul><ul><ul><li>Max length = 3. </li></ul></ul><ul><ul><li>character within “{ }” is primary character of rule. </li></ul></ul><ul><ul><li>Ex: ( { 葡 }, 萄 ): “ 葡” be a known word when “ 葡萄” appears. </li></ul></ul><ul><li>Rule Accuracy: </li></ul><ul><ul><li>Ex: ( { 葡 (Na)}, 萄 (Na) ) : =P(#( 葡 (Na) be a known word) | #( 葡 (Na), 萄 (Na) )) </li></ul></ul>( 葡 (Na), 萄 (Na), ) : 2 ( 葡 (Na) Y, 萄 (Na), ) : 1 ( 葡 (Na) N, 萄 (Na) N, ) : 1 ( 葡 (Na) Y, 萄 (Na) Y, ) : 1 ( 葡 , 萄 , ) : 2 ( 葡 (Na), 萄 , ) : 2 ( 葡 , 萄 (Na), ) : 2
    20. 20. Unknown Word Extraction <ul><li>Machine Learning </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Sequential learning </li></ul></ul>
    21. 21. Unknown Word Extraction- feature ( Pos) <ul><li>We use TnT POS tagger to detect part-of-speech (pos) tags of terms. </li></ul><ul><li>Kinds of pos tags : </li></ul><ul><ul><li>Nouns (Na, Nb,…) </li></ul></ul><ul><ul><li>Verbs (VA, VB, VC,…) </li></ul></ul><ul><ul><li>Adjectives (A…) </li></ul></ul><ul><ul><li>Punctuations (Comma, Period,…) </li></ul></ul><ul><ul><li>… </li></ul></ul>
    22. 22. Unknown Word Extraction- feature ( term_attribute) <ul><li>After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. </li></ul><ul><li>Six different “ term_attributes ” are as follows : </li></ul><ul><ul><li>ms() monosyllabic word , Ex: 你、我、他 </li></ul></ul><ul><ul><li>ms(?) morphemes of unknown word , Ex: “ 王 ”、“ 小 ”、“ 明 ” on “ 王小明 ” </li></ul></ul><ul><ul><li>ds() double-syllabic word , Ex: 學校 </li></ul></ul><ul><ul><li>ps() poly-syllabic word , Ex: 筆記型電腦 </li></ul></ul><ul><ul><li>dot() punctuation , Ex: “ ,”、 “。”… </li></ul></ul><ul><ul><li>none() no above information or new term </li></ul></ul><ul><li>Target of unknown word: at least one ms(?) </li></ul>運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () ps() dot() ds() ds() ms(?) ms(?) ms(?) dot() ds() ms() ms() ms()
    23. 23. Data Processing- Sliding Window <ul><li>Sequential Supervised Learning </li></ul><ul><ul><li>Indirect method: transform sequential learning to classification learning </li></ul></ul><ul><ul><ul><li>Sliding Window </li></ul></ul></ul><ul><li>We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4. </li></ul><ul><li>Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on. </li></ul><ul><ul><li>Window: n+2 terms (n+prefix+suffix) </li></ul></ul><ul><ul><li>N-gram: n term </li></ul></ul><ul><ul><ul><li>must exist at least one ms(?) in n terms. </li></ul></ul></ul>t3 t2 t1 prefix t0 3-gram suffix t4
    24. 24. EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) 姿 (?) 分 (?) ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
    25. 25. Unknown Word Extraction- feature (Statistical Information) <ul><li>Statistical information: (exemplified by 3-gram Model), </li></ul><ul><ul><li>Frequency of 3-gram. </li></ul></ul><ul><ul><li>p( prefix | 3-gram), e.g. p( prefix | t1~t3) </li></ul></ul><ul><ul><li>p( suffix | 3-gram), e.g. p( suffix | t1~t3) </li></ul></ul><ul><ul><li>p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) </li></ul></ul><ul><ul><li>p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) </li></ul></ul><ul><ul><li>p( pos_freq(prefix) / pos_freq(prefix in training positive)) </li></ul></ul><ul><ul><li>p( pos_freq(suffix) / pos_freq(suffix in training positive)) </li></ul></ul>t3 t2 t1 prefix t0 3-gram suffix t4
    26. 26. Data presentation <ul><li>Format of machine learning usage: </li></ul><ul><li>Dimension: accumulative </li></ul>term_attribute (6) pos (55) t2 term_attribute (6) pos (55) term_attribute (6) pos (55) prefix t1 …… … statistics (7) term_attribute (6) pos (55) …… suffix
    27. 27. Experiments <ul><li>Unknown word detection. </li></ul><ul><li>Unknown word extraction. </li></ul>
    28. 28. Unknown Word Detection <ul><li>8/10 balanced corpus (460m words) as training data. </li></ul><ul><li>Use Pattern mining tool: Prowl [Huang et al., 2004] </li></ul><ul><li>1/10 balanced corpus as validation data. </li></ul><ul><li>Use accuracy and frequency as threshold of detection rules. </li></ul><ul><li>1/10 balanced corpus as real test data (for phase 2): </li></ul><ul><ul><li>60.3% precision and 93.6% recall </li></ul></ul>Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036 0.76158 0.9092 0.6552 29 0.77033 0.780085 0.787466 0.795082 F-measure 0.8995 0.8932 0.8819 0.8288 Recall 0.6736 0.6924 0.7113 0.764 Precision 19 11 7 3 Fre>=
    29. 29. Unknown Word Extraction <ul><li>8/10 balanced corpus (460m words) as training data. </li></ul><ul><li>1/10 balanced corpus as testing data. </li></ul><ul><li>Imbalanced data solution: </li></ul><ul><ul><li>Ensemble method (voting) + under-sampling (random) </li></ul></ul><ul><ul><li>Use another 1/10 balanced corpus as validation to find sampling ratio: </li></ul></ul><ul><ul><ul><li>2-gram: 1:2 (positive: negative) </li></ul></ul></ul><ul><ul><ul><li>3-gram: 1:3 </li></ul></ul></ul><ul><ul><ul><li>4-gram: 1:6 </li></ul></ul></ul>
    30. 30. Unknown Word Extraction <ul><li>In judging overlap and conflict problem of different combination of unknown words : </li></ul><ul><ul><li><[Chen et al., 2002]> </li></ul></ul><ul><ul><ul><li>frequency (w) * length (w) . </li></ul></ul></ul><ul><ul><ul><li>Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3 </li></ul></ul></ul><ul><ul><li>Our method: </li></ul></ul><ul><ul><li>First solve identical N-gram overlap : </li></ul></ul><ul><ul><ul><li>P (combine | overlap) </li></ul></ul></ul><ul><ul><ul><ul><li>Ex: “ 單 親 家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 ) </li></ul></ul></ul></ul><ul><ul><li>Then solve different N-gram conflict : </li></ul></ul><ul><ul><ul><li>Real frequency </li></ul></ul></ul><ul><ul><ul><ul><li>freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院” </li></ul></ul></ul></ul>
    31. 31. Extraction result <ul><li>Comparison: </li></ul><ul><ul><li><[Ma et al., 2003]> </li></ul></ul><ul><ul><ul><li>morphological rules+ statistical rules+ context free grammar rules </li></ul></ul></ul><ul><ul><ul><li>Precision: 76%, Recall: 57% </li></ul></ul></ul><ul><ul><li>Our result </li></ul></ul>0.627 68.2% 58.1% Total 0.614 67.1% 56.7% 2-gram 0.707 80% 63.3% 3-gram 0.426 70.3% 30.6% 4-gram F1-score Recall Precision n-gram
    32. 32. Ensemble Method Improvement 0.426 0.703 0.306 0.707 0.8 0.633 0.614 0.671 0.567 Censemble 0.336 0.59 0.238 0.66 0.765 0.583 0.594 0.653 0.544 Caverage 0.412 0.662 0.299 0.669 0.776 0.587 0.587 0.645 0.538 C12 0.335 0.554 0.24 0.667 0.74 0.607 0.593 0.668 0.533 C11 0.344 0.662 0.232 0.655 0.723 0.599 0.596 0.661 0.543 C10 0.321 0.635 0.215 0.645 0.715 0.587 0.598 0.657 0.548 C9 0.309 0.486 0.226 0.676 0.813 0.579 0.6 0.673 0.541 C8 0.325 0.703 0.211 0.648 0.691 0.611 0.604 0.66 0.557 C7 0.333 0.608 0.23 0.641 0.735 0.568 0.582 0.636 0.536 C6 0.299 0.554 0.205 0.644 0.779 0.549 0.603 0.66 0.555 C5 0.42 0.676 0.305 0.667 0.796 0.574 0.598 0.645 0.557 C4 0.28 0.378 0.222 0.664 0.81 0.563 0.58 0.633 0.535 C3 0.338 0.743 0.219 0.7 0.791 0.627 0.61 0.657 0.569 C2 0.315 0.419 0.252 0.649 0.808 0.542 0.572 0.64 0.518 C1 F1-Score Recall Precision F1-Score Recall Precision F1-Score Recall Precision 4-gram 3-gram 2-gram 分類 模型
    33. 33. Experiment- One phase <ul><li>What if without unknown word detection? </li></ul><ul><li>Two phases do work better. </li></ul>0.627 68.2% 58.1% Two Phases 0.52 71.4% 40.8% One Phase F-score Recall Precision Classification Performance
    34. 34. Conclusions <ul><li>We adopt two phases method to solve unknown word problems </li></ul><ul><ul><li>Unknown word detection </li></ul></ul><ul><ul><ul><li>Continuity pattern mining to derive detection rules. </li></ul></ul></ul><ul><ul><li>Unknown word extraction </li></ul></ul><ul><ul><ul><li>Machine learning based – classification algorithms and sequential learning (indirect). </li></ul></ul></ul><ul><ul><ul><li>Imbalanced data solution </li></ul></ul></ul><ul><li>Our experiment prove two phases do work better than one phase. </li></ul><ul><li>Future work: </li></ul><ul><ul><li>Utilize Machine learning on detection. </li></ul></ul><ul><ul><li>Utilize more information (patterns 、 rules) to improve extraction precision. </li></ul></ul>

    ×