Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
LiCord: Language Independent Content Word Finder
Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama &
Ryutaro Ichise
A...
Background
currently 100s of languages are available, only few of them can be
automatically mined because of low or no NLP...
Content Word
definition: Content Words [ref: American Heritage Dictionary]
are nouns, most verbs, adjectives, and adverbs t...
Problem & Possible Solution
problem
Content Words finding requires language dependent NLP-resource
language parser
parallel...
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Deci...
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Deci...
1.NGram Constructor
segment text and construct variable token (length) n-grams
calculate n-gram frequencies
Table: Variabl...
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Deci...
2.Function Word Decider
Function Words
express grammatical relationships with other words
have little lexical meaning or h...
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Deci...
3.Feature Value Calculator
select fifteen different morphological features of text & calculate
their values for n-grams over...
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Deci...
4.Classifier Learner (1/2)
construct frequency-range-wise classification models
Reason
consume a large amount of time, if al...
4.Classifier Learner (2/2)
construct frequency-range-wise classification models
Method
collect range-based n-grams
X(i,j) = ...
Experiment
check whether LiCord can identify Content Words language
independently
analyzed language − English, Vietnamese,...
Language Independent Content Word Finding (1/2)
testing method − check test n-grams whether they are Content
Words
Table: ...
Language Independent Content Word Finding (2/2)
Newly discovered Content Words finding accuracy %
Frequency English Indone-...
Conclusion
language independent way Content Word finding a requirement in
current days’ text mining
we propose a supervised...
Question & Suggestion
Md-Mizanur Rahoman, mizan@nii.ac.jp
Rahoman et.al., | LiCord | 19
Experiment 1 (1/2)
purpose − whether LiCord can identify NEs (Named Entities), and
act like sentence parser
identifying NE...
Experiment 1 (2/2)
acting as parser − executed for some test sentences, compared with
Stanford parser for Content Words
Ta...
Upcoming SlideShare
Loading in …5
×

LiCord: Language Independent Content Word Finder

266 views

Published on

Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identification, document summarization, question answering etc. Usually, the identification of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs finding is a requirement. In this research, we devise a framework that identifies text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classification that classifies the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language specific tool. We conduct experiments of our framework for three different languages: English, Vietnamese and Indonesian, and found that it works with more than 83% accuracy.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

LiCord: Language Independent Content Word Finder

  1. 1. LiCord: Language Independent Content Word Finder Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama & Ryutaro Ichise April 18, 2016
  2. 2. Background currently 100s of languages are available, only few of them can be automatically mined because of low or no NLP-resources availability creating NLP-resources for all languages is not feasible Content Words finding system for languages can be considered basic NLP-resource Rahoman et.al., | LiCord | 2
  3. 3. Content Word definition: Content Words [ref: American Heritage Dictionary] are nouns, most verbs, adjectives, and adverbs that refer to some object, action, or characteristic carry independent meaning are usually open i.e, new words can be added example: “NO8DO” is the official motto of Seville. usage: Content Words can be used (new) topic identification document summarizing question answering etc. Rahoman et.al., | LiCord | 3
  4. 4. Problem & Possible Solution problem Content Words finding requires language dependent NLP-resource language parser parallel corpora etc. NLP-resource developing for all language is costly and “not feasible” possible solution morphological features of text segment can classify whether a segment is Content Word machine learning model can classify text segment into Content Word big text corpus can generate balanced morphological features for such text segments Rahoman et.al., | LiCord | 4
  5. 5. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 5
  6. 6. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 6
  7. 7. 1.NGram Constructor segment text and construct variable token (length) n-grams calculate n-gram frequencies Table: Variable length n-grams and their frequencies for an exemplary corpus T- = “Japan is an Asian country. Japan is a peaceful country”. n-grams and frequencies over the T- size 1 n-gram {[Japan−2], [is−2], [an−1], ..., } (/uni-gram) [country−2], [a−1], ... } size 2 n-gram {[Japan is−2], [is an−1], ..., } (/bi-gram) [Asian country−1], ...} size 3 n-gram {[Japan is an−1], [is an Asian−1], } (/tri-gram) [an Asian country−1], ... } Rahoman et.al., | LiCord | 7
  8. 8. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 8
  9. 9. 2.Function Word Decider Function Words express grammatical relationships with other words have little lexical meaning or have ambiguous meaning are frequent n-grams over a text document example: “the”, “in”, “in spite of” etc. decide by pick a threshold number of frequent n-grams map frequent n-grams with available translation of known Function Words use threshold only, if translation service is not available n-gram # of token frq frq% the 1 3124631 67.60 in 1 1774988 38.40 ... ... ... ... united states 2 43698 0.94 ... ... ... ... Rahoman et.al., | LiCord | 9
  10. 10. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 10
  11. 11. 3.Feature Value Calculator select fifteen different morphological features of text & calculate their values for n-grams over a big corpus where the n-grams appear i.e., begining/mid/end part of the sentences how frequent the n-grams appear in a corpus how the n-grams get added with Function Words, punctuation etc. Rahoman et.al., | LiCord | 11
  12. 12. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 12
  13. 13. 4.Classifier Learner (1/2) construct frequency-range-wise classification models Reason consume a large amount of time, if all n-grams are used as training example does not represent entire dataset, if randomly picked assume same frequency n-grams shares same kind of morphological features (over the corpus) Rahoman et.al., | LiCord | 13
  14. 14. 4.Classifier Learner (2/2) construct frequency-range-wise classification models Method collect range-based n-grams X(i,j) = {x | x ∈ N ∧ i ≤ frq(x) ≤ j} N = all n-grams in corpus, x = n-gram select threshold number of n-grams as training n-grams for each range calculate features for each range-wise selected n-grams learn classification model for each range training n-grams Rahoman et.al., | LiCord | 14
  15. 15. Experiment check whether LiCord can identify Content Words language independently analyzed language − English, Vietnamese, and Indonesian used training resource − Wikipedia Pages & Wikipedia Titles +ve: when n-gram (text segment) exists on Wikipedia Title. E.g., Seville, official motto etc. -ve: otherwise. E.g.“NO8DO” is, is the etc. classification algorithm − Support Vector Machine and C4.5 (tree-based algorithm) Rahoman et.al., | LiCord | 15
  16. 16. Language Independent Content Word Finding (1/2) testing method − check test n-grams whether they are Content Words Table: CW finding accuracy % Frequency English Indone- Vietnam- Range sian ese (1,1) 76.68 90.56 90.30 (2,2) 83.00 93.20 94.15 (3,4) 84.37 94.23 94.76 (5,9) 83.87 95.89 93.97 (10,14) 87.09 96.15 94.95 Average 83.25 93.80 93.54 Rahoman et.al., | LiCord | 16
  17. 17. Language Independent Content Word Finding (2/2) Newly discovered Content Words finding accuracy % Frequency English Indone- Vietnam- Range sian ese (1,1) 27.90 11.34 10.63 (2,2) 45.00 18.54 25.00 (3,4) 52.11 24.45 27.56 (5,9) 50.34 25.56 30.88 (10,14) 61.90 29.89 35.13 Average 47.45 21.95 22.50 finding − checking of a large number of sentences for their specific morphological features over a big corpus can generate machine learning model to find Content Words Rahoman et.al., | LiCord | 17
  18. 18. Conclusion language independent way Content Word finding a requirement in current days’ text mining we propose a supervised Machine Learning technique to classify text segments to Content Words experiment results show proposed methods can serve as a Content Word finder Rahoman et.al., | LiCord | 18
  19. 19. Question & Suggestion Md-Mizanur Rahoman, mizan@nii.ac.jp Rahoman et.al., | LiCord | 19
  20. 20. Experiment 1 (1/2) purpose − whether LiCord can identify NEs (Named Entities), and act like sentence parser identifying NEs − executed for some test sentences, compared with Wikifier and Spotlight Table: Comparison for LiCord with Wikifier Recall Wikifier 33.33% LiCord 90.47% Table: Comparison for LiCord with Spotlight Recall Spotlight 83.33% LiCord 91.66% Rahoman et.al., | LiCord | 20
  21. 21. Experiment 1 (2/2) acting as parser − executed for some test sentences, compared with Stanford parser for Content Words Table: Comparison for LiCord with Parser Language Recall English 92.30% finding − checking of a large number of sentences for their specific morphological features over a big corpus can support word segmenting Rahoman et.al., | LiCord | 21

×