SlideShare a Scribd company logo
1 of 21
Download to read offline
LiCord: Language Independent Content Word Finder
Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama &
Ryutaro Ichise
April 18, 2016
Background
currently 100s of languages are available, only few of them can be
automatically mined because of low or no NLP-resources availability
creating NLP-resources for all languages is not feasible
Content Words finding system for languages can be considered
basic NLP-resource
Rahoman et.al., | LiCord | 2
Content Word
definition: Content Words [ref: American Heritage Dictionary]
are nouns, most verbs, adjectives, and adverbs that refer to some
object, action, or characteristic
carry independent meaning
are usually open i.e, new words can be added
example: “NO8DO” is the official motto of Seville.
usage: Content Words can be used
(new) topic identification
document summarizing
question answering etc.
Rahoman et.al., | LiCord | 3
Problem & Possible Solution
problem
Content Words finding requires language dependent NLP-resource
language parser
parallel corpora etc.
NLP-resource developing for all language is costly and “not feasible”
possible solution
morphological features of text segment can classify whether a segment
is Content Word
machine learning model can classify text segment into Content Word
big text corpus can generate balanced morphological features for such
text segments
Rahoman et.al., | LiCord | 4
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Decider − devise feature values for the segments
Feature Value Calculator − devise feature values for the segments
Classifier Learner − generate classification model to decide the
segments into Content Words
Rahoman et.al., | LiCord | 5
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Decider − devise feature values for the segments
Feature Value Calculator − devise feature values for the segments
Classifier Learner − generate classification model to decide the
segments into Content Words
Rahoman et.al., | LiCord | 6
1.NGram Constructor
segment text and construct variable token (length) n-grams
calculate n-gram frequencies
Table: Variable length n-grams and their frequencies for an exemplary
corpus T- = “Japan is an Asian country. Japan is a peaceful country”.
n-grams and frequencies over the T-
size 1 n-gram {[Japan−2], [is−2], [an−1], ..., }
(/uni-gram) [country−2], [a−1], ... }
size 2 n-gram {[Japan is−2], [is an−1], ..., }
(/bi-gram) [Asian country−1], ...}
size 3 n-gram {[Japan is an−1], [is an Asian−1], }
(/tri-gram) [an Asian country−1], ... }
Rahoman et.al., | LiCord | 7
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Decider − devise feature values for the segments
Feature Value Calculator − devise feature values for the segments
Classifier Learner − generate classification model to decide the
segments into Content Words
Rahoman et.al., | LiCord | 8
2.Function Word Decider
Function Words
express grammatical relationships with other words
have little lexical meaning or have ambiguous meaning
are frequent n-grams over a text document
example: “the”, “in”, “in spite of” etc.
decide by
pick a threshold number of frequent n-grams
map frequent n-grams with available translation of known Function
Words
use threshold only, if translation service is not available
n-gram # of token frq frq%
the 1 3124631 67.60
in 1 1774988 38.40
... ... ... ...
united states 2 43698 0.94
... ... ... ...
Rahoman et.al., | LiCord | 9
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Decider − devise feature values for the segments
Feature Value Calculator − devise feature values for the segments
Classifier Learner − generate classification model to decide the
segments into Content Words
Rahoman et.al., | LiCord | 10
3.Feature Value Calculator
select fifteen different morphological features of text & calculate
their values for n-grams over a big corpus
where the n-grams appear i.e., begining/mid/end part of the sentences
how frequent the n-grams appear in a corpus
how the n-grams get added with Function Words, punctuation
etc.
Rahoman et.al., | LiCord | 11
System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Decider − devise feature values for the segments
Feature Value Calculator − devise feature values for the segments
Classifier Learner − generate classification model to decide the
segments into Content Words
Rahoman et.al., | LiCord | 12
4.Classifier Learner (1/2)
construct frequency-range-wise classification models
Reason
consume a large amount of time, if all n-grams are used as training
example
does not represent entire dataset, if randomly picked
assume same frequency n-grams shares same kind of morphological
features (over the corpus)
Rahoman et.al., | LiCord | 13
4.Classifier Learner (2/2)
construct frequency-range-wise classification models
Method
collect range-based n-grams
X(i,j) = {x | x ∈ N ∧ i ≤ frq(x) ≤ j}
N = all n-grams in corpus, x = n-gram
select threshold number of n-grams as training n-grams for each range
calculate features for each range-wise selected n-grams
learn classification model for each range training n-grams
Rahoman et.al., | LiCord | 14
Experiment
check whether LiCord can identify Content Words language
independently
analyzed language − English, Vietnamese, and Indonesian
used training resource − Wikipedia Pages & Wikipedia Titles
+ve: when n-gram (text segment) exists on Wikipedia Title.
E.g., Seville, official motto etc.
-ve: otherwise.
E.g.“NO8DO” is, is the etc.
classification algorithm − Support Vector Machine and C4.5
(tree-based algorithm)
Rahoman et.al., | LiCord | 15
Language Independent Content Word Finding (1/2)
testing method − check test n-grams whether they are Content
Words
Table: CW finding accuracy %
Frequency English Indone- Vietnam-
Range sian ese
(1,1) 76.68 90.56 90.30
(2,2) 83.00 93.20 94.15
(3,4) 84.37 94.23 94.76
(5,9) 83.87 95.89 93.97
(10,14) 87.09 96.15 94.95
Average 83.25 93.80 93.54
Rahoman et.al., | LiCord | 16
Language Independent Content Word Finding (2/2)
Newly discovered Content Words finding accuracy %
Frequency English Indone- Vietnam-
Range sian ese
(1,1) 27.90 11.34 10.63
(2,2) 45.00 18.54 25.00
(3,4) 52.11 24.45 27.56
(5,9) 50.34 25.56 30.88
(10,14) 61.90 29.89 35.13
Average 47.45 21.95 22.50
finding − checking of a large number of sentences for their specific
morphological features over a big corpus can generate machine
learning model to find Content Words
Rahoman et.al., | LiCord | 17
Conclusion
language independent way Content Word finding a requirement in
current days’ text mining
we propose a supervised Machine Learning technique to classify
text segments to Content Words
experiment results show proposed methods can serve as a Content
Word finder
Rahoman et.al., | LiCord | 18
Question & Suggestion
Md-Mizanur Rahoman, mizan@nii.ac.jp
Rahoman et.al., | LiCord | 19
Experiment 1 (1/2)
purpose − whether LiCord can identify NEs (Named Entities), and
act like sentence parser
identifying NEs − executed for some test sentences, compared with
Wikifier and Spotlight
Table: Comparison for LiCord
with Wikifier
Recall
Wikifier 33.33%
LiCord 90.47%
Table: Comparison for LiCord
with Spotlight
Recall
Spotlight 83.33%
LiCord 91.66%
Rahoman et.al., | LiCord | 20
Experiment 1 (2/2)
acting as parser − executed for some test sentences, compared with
Stanford parser for Content Words
Table: Comparison for LiCord with Parser
Language Recall
English 92.30%
finding − checking of a large number of sentences for their specific
morphological features over a big corpus can support word
segmenting
Rahoman et.al., | LiCord | 21

More Related Content

What's hot

What's hot (11)

C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
An NLP-based architecture for the autocompletion of partial domain models
An NLP-based architecture for the autocompletion of partial domain modelsAn NLP-based architecture for the autocompletion of partial domain models
An NLP-based architecture for the autocompletion of partial domain models
 
Classification of Arabic Texts using Four Classifiers
Classification of Arabic Texts using Four ClassifiersClassification of Arabic Texts using Four Classifiers
Classification of Arabic Texts using Four Classifiers
 
P-6
P-6P-6
P-6
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksPPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
 
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONEFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
 
Oops (inheritance&interface)
Oops (inheritance&interface)Oops (inheritance&interface)
Oops (inheritance&interface)
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 

Similar to LiCord: Language Independent Content Word Finder

Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
DhruvKushwaha12
 

Similar to LiCord: Language Independent Content Word Finder (20)

D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysisIndexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
Quality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingQuality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemming
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation method
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
 

More from National Inistitute of Informatics (NII), Tokyo, Japann

More from National Inistitute of Informatics (NII), Tokyo, Japann (6)

Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
Relation-wise Automatic Domain-Range Information Management for Knowledge Ent...
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
BoTLRet: A Template-based Linked Data Information Retrieval
 BoTLRet: A Template-based Linked Data Information Retrieval BoTLRet: A Template-based Linked Data Information Retrieval
BoTLRet: A Template-based Linked Data Information Retrieval
 
TLDRet: A Temporal Semantic Facilitated Linked Data Retrieval Framework
TLDRet: A Temporal Semantic Facilitated Linked Data Retrieval FrameworkTLDRet: A Temporal Semantic Facilitated Linked Data Retrieval Framework
TLDRet: A Temporal Semantic Facilitated Linked Data Retrieval Framework
 
Inclusion of Temporal Semantics over Keyword-based Linked Data Retrieval
Inclusion of Temporal Semantics over Keyword-based Linked Data RetrievalInclusion of Temporal Semantics over Keyword-based Linked Data Retrieval
Inclusion of Temporal Semantics over Keyword-based Linked Data Retrieval
 
An automated template selection framework for keyword query over linked data
An automated template selection framework for keyword query over linked dataAn automated template selection framework for keyword query over linked data
An automated template selection framework for keyword query over linked data
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Recently uploaded (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 

LiCord: Language Independent Content Word Finder

  • 1. LiCord: Language Independent Content Word Finder Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama & Ryutaro Ichise April 18, 2016
  • 2. Background currently 100s of languages are available, only few of them can be automatically mined because of low or no NLP-resources availability creating NLP-resources for all languages is not feasible Content Words finding system for languages can be considered basic NLP-resource Rahoman et.al., | LiCord | 2
  • 3. Content Word definition: Content Words [ref: American Heritage Dictionary] are nouns, most verbs, adjectives, and adverbs that refer to some object, action, or characteristic carry independent meaning are usually open i.e, new words can be added example: “NO8DO” is the official motto of Seville. usage: Content Words can be used (new) topic identification document summarizing question answering etc. Rahoman et.al., | LiCord | 3
  • 4. Problem & Possible Solution problem Content Words finding requires language dependent NLP-resource language parser parallel corpora etc. NLP-resource developing for all language is costly and “not feasible” possible solution morphological features of text segment can classify whether a segment is Content Word machine learning model can classify text segment into Content Word big text corpus can generate balanced morphological features for such text segments Rahoman et.al., | LiCord | 4
  • 5. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 5
  • 6. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 6
  • 7. 1.NGram Constructor segment text and construct variable token (length) n-grams calculate n-gram frequencies Table: Variable length n-grams and their frequencies for an exemplary corpus T- = “Japan is an Asian country. Japan is a peaceful country”. n-grams and frequencies over the T- size 1 n-gram {[Japan−2], [is−2], [an−1], ..., } (/uni-gram) [country−2], [a−1], ... } size 2 n-gram {[Japan is−2], [is an−1], ..., } (/bi-gram) [Asian country−1], ...} size 3 n-gram {[Japan is an−1], [is an Asian−1], } (/tri-gram) [an Asian country−1], ... } Rahoman et.al., | LiCord | 7
  • 8. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 8
  • 9. 2.Function Word Decider Function Words express grammatical relationships with other words have little lexical meaning or have ambiguous meaning are frequent n-grams over a text document example: “the”, “in”, “in spite of” etc. decide by pick a threshold number of frequent n-grams map frequent n-grams with available translation of known Function Words use threshold only, if translation service is not available n-gram # of token frq frq% the 1 3124631 67.60 in 1 1774988 38.40 ... ... ... ... united states 2 43698 0.94 ... ... ... ... Rahoman et.al., | LiCord | 9
  • 10. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 10
  • 11. 3.Feature Value Calculator select fifteen different morphological features of text & calculate their values for n-grams over a big corpus where the n-grams appear i.e., begining/mid/end part of the sentences how frequent the n-grams appear in a corpus how the n-grams get added with Function Words, punctuation etc. Rahoman et.al., | LiCord | 11
  • 12. System Framework the model generation has four processes: NGram Constructor − perform text segmentation Function Word Decider − devise feature values for the segments Feature Value Calculator − devise feature values for the segments Classifier Learner − generate classification model to decide the segments into Content Words Rahoman et.al., | LiCord | 12
  • 13. 4.Classifier Learner (1/2) construct frequency-range-wise classification models Reason consume a large amount of time, if all n-grams are used as training example does not represent entire dataset, if randomly picked assume same frequency n-grams shares same kind of morphological features (over the corpus) Rahoman et.al., | LiCord | 13
  • 14. 4.Classifier Learner (2/2) construct frequency-range-wise classification models Method collect range-based n-grams X(i,j) = {x | x ∈ N ∧ i ≤ frq(x) ≤ j} N = all n-grams in corpus, x = n-gram select threshold number of n-grams as training n-grams for each range calculate features for each range-wise selected n-grams learn classification model for each range training n-grams Rahoman et.al., | LiCord | 14
  • 15. Experiment check whether LiCord can identify Content Words language independently analyzed language − English, Vietnamese, and Indonesian used training resource − Wikipedia Pages & Wikipedia Titles +ve: when n-gram (text segment) exists on Wikipedia Title. E.g., Seville, official motto etc. -ve: otherwise. E.g.“NO8DO” is, is the etc. classification algorithm − Support Vector Machine and C4.5 (tree-based algorithm) Rahoman et.al., | LiCord | 15
  • 16. Language Independent Content Word Finding (1/2) testing method − check test n-grams whether they are Content Words Table: CW finding accuracy % Frequency English Indone- Vietnam- Range sian ese (1,1) 76.68 90.56 90.30 (2,2) 83.00 93.20 94.15 (3,4) 84.37 94.23 94.76 (5,9) 83.87 95.89 93.97 (10,14) 87.09 96.15 94.95 Average 83.25 93.80 93.54 Rahoman et.al., | LiCord | 16
  • 17. Language Independent Content Word Finding (2/2) Newly discovered Content Words finding accuracy % Frequency English Indone- Vietnam- Range sian ese (1,1) 27.90 11.34 10.63 (2,2) 45.00 18.54 25.00 (3,4) 52.11 24.45 27.56 (5,9) 50.34 25.56 30.88 (10,14) 61.90 29.89 35.13 Average 47.45 21.95 22.50 finding − checking of a large number of sentences for their specific morphological features over a big corpus can generate machine learning model to find Content Words Rahoman et.al., | LiCord | 17
  • 18. Conclusion language independent way Content Word finding a requirement in current days’ text mining we propose a supervised Machine Learning technique to classify text segments to Content Words experiment results show proposed methods can serve as a Content Word finder Rahoman et.al., | LiCord | 18
  • 19. Question & Suggestion Md-Mizanur Rahoman, mizan@nii.ac.jp Rahoman et.al., | LiCord | 19
  • 20. Experiment 1 (1/2) purpose − whether LiCord can identify NEs (Named Entities), and act like sentence parser identifying NEs − executed for some test sentences, compared with Wikifier and Spotlight Table: Comparison for LiCord with Wikifier Recall Wikifier 33.33% LiCord 90.47% Table: Comparison for LiCord with Spotlight Recall Spotlight 83.33% LiCord 91.66% Rahoman et.al., | LiCord | 20
  • 21. Experiment 1 (2/2) acting as parser − executed for some test sentences, compared with Stanford parser for Content Words Table: Comparison for LiCord with Parser Language Recall English 92.30% finding − checking of a large number of sentences for their specific morphological features over a big corpus can support word segmenting Rahoman et.al., | LiCord | 21