SlideShare a Scribd company logo
1 of 10
Download to read offline
Rupak Bhattacharyya et al. (Eds) : ACER 2013,
pp. 109–118, 2013. © CS & IT-CSCP 2013 DOI : 10.5121/csit.2013.3211
DOMAIN KEYWORD EXTRACTION
TECHNIQUE: A NEW WEIGHTING METHOD
BASED ON FREQUENCY ANALYSIS
Rakhi Chakraborty
Department of Computer Science & Engineering, Global Institute Of
Management and Technology, Nadia, India
rakhi.chakraborty84@yahoo.in
ABSTRACT
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage
such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization, summarization, and topic detection are
based on feature extraction.It is extremely time consuming and difficult task to extract keyword
or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a
new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect
the division of terms in the document, and then can’t reflect the significance degree and the
difference between categories. This paper proposes a new weighting method to which a new
weight is added to express the differences between domains on the base of original TF•IDF.The
extracted feature can represent the content of the text better and has a better distinguished
ability.
.KEYWORDS
Text mining,Feature extraction,weighting method, Term Frequency Inverse Document
Frequency (TF•IDF), Domain keyword extraction.
1. INTRODUCTION
For rapidly development of computer network technology various aspects of electronic
documents also grows rapidly and most of the enterprise information saved as a text form. Hence
text mining has become an increasingly popular and also important field in the research of data
mining. Text mining is different from the traditional data mining has been pointed out by Han. J,
and Kamer. M[1]. The conventional data mining defines the relationship, things and structured
data as the research target. While the text mining defines the text database as the research
targetwhich consists of a large number of documents from the various data sources, including
research papers, news articles, books, journals, reports, patent specifications, conference paper, e-
mail messages, web pages and so on. Text mining is aninfantile interdisciplinary field which
draws on information retrieval, data mining, machine learning, statistics and computational
linguistics, Fig.1 shows that how the text mining is interconnected with other. Text mining would
like to solve problems such as the uncertainty and ambiguity in the text information.Text mining
or knowledge discovery from text (KDT) mentioned for the first time by Feldman et al. [2] which
is deals with the machine supported analysis of text. Text mining uses techniques from
information retrieval, information extraction as well as natural language processing (NLP) and
connects them with the algorithms and methods of KDD, data mining, machine learning and
statistics.
110 Computer Science & Information Technology (CS & IT)
Figure1. Interdisciplinary domain
2. AUTOMATED DOMAIN KEYWORD EXTRACTION TECHNIQUE
Keyword extraction is an extremely time consuming and difficult task, when it is done manually.
Due to large volume of published news articles, it is almost impossible to extract keywords
manually.Forestablish an automated process that extract keywords from news
articles,anunsupervised keyword extraction technique namely Automated Domain keyword
Extraction Technique is introduced in that research paper. Here a new weighting method is also
introduced on the base of the conventional TF•IDF [3] [4].
2.1. Keywords
Keywords are a set of significant words in an article that gives high-level narrative of its contents
to readers. To produce a short summary of news articles identifying the keyword from a large
number of online news data is very useful. Keyword extraction technique is used to extract main
features in studies such as text classification, text categorization, information retrieval,topic
detection and document summarization.
2.2. Different Methods forextracting keyword
The main methods of the keyword extraction are TF*IDF (Term Frequency Inverse Document
Frequency), mutual information, information gain, NGL coefficient, chi-square and so on. TFIDF
and mutual information are the conventional methods. It’s a old weighting method but making it
popular by recent algorithm Salton, G. & Buckley, C. [5]
The significant effect has been proved in the practical applications using TFIDF formula to
acquire single text keyword. Mutual information is commonly used in statistical language models
to evaluate the correlated degree between strings. Bigger mutual information between strings
indicates the stronger correlation in the viewpoint of statistics. But the small mutual information
does not always means that there is weaker correlation between strings and in computing the
string requires minimum number.
Data Mining
Statistics
Text
Mining
Information
Retrieval
Web
Mining
Computational
linguistics
Computer Science & Information Technology (CS & IT) 111
2.3. Single Text Keyword Vs Domain keywords
TF-IDF (Term Frequency-Inverse Document Frequency) weightingmodel [6]is a statistical model
that evaluates the degree of importance of a word in a single document, but it is not suitable for
extracting domain keywords.The keywords of single text are difficult to accurately reflect text
domain knowledge and user interests, which will cause the Web more difficult to provide high-
quality personalized services for readers.
Reader want extract multi-texts keywords to reflect the domain knowledge of texts, in order to
provide automatically text clustering, classification and summarization and topic detection.
Figure2.Relation between single text keyword and domain keyword
2.4 Conventional TF-IDF Formula
The conventional, TF-IDF weight scheme invented byBerger, A et al [7] which is as follows:
TF: ܶ‫ܨ‬ሺ‫,ݐ‬ ݀௜ሻ =
௡೟,೔
∑ ௡ೖ,೔
|೅|
ೖసభ
Where, TF (t, di) =term frequency of word t in document di
nt,i = number of accurances of term t in di
nk,i = number of accurances of all terms in di
IDF: ‫ܨܦܫ‬௧ = ݈‫݃݋‬
ெ
௠೟ା଴.଴ଵ
Where, M = total number of documents in the corpus
mt = total number of documents in the corpus where term t appears.
TF-IDF: ‫ݓ‬ሺ‫,ݐ‬ ݀௜ሻ = ܶ‫ܨ‬ሺ‫,ݐ‬ ݀௜ሻ × ‫ܨܦܫ‬௧
Where, w (t, di) = weight of term t in document di.
TF-IDF value is composed of two components: TF and IDF values. The rationale of TF value is
that more frequent words in a document are more important than less frequent words. TF value of
a particular word in a given document is the number of occurrences in the document. This count
is usually normalized to prevent a bias towards longer documents to give a measure of the
importance of the termti within the particular documentdj, like a TF equation given in above.The
second component of TF-IDF value, IDF, represents rarity across the whole collection. This value
is obtained by dividing the number of all documents by the number of documents containing the
Domain keyword
The keyword
of text j
The keyword
of text i
The keyword
of text k
112 Computer Science & Information Technology (CS & IT)
term, and then taking the logarithm of that quotient, like an IDF equation given above. For
example, a word, ‘today’ appears in many documents and this word is weighted as low IDF value.
Thus it is regarded as a meaningless word.
2.5 Problem of Conventional TF-IDF Weighting
TF-IDF is generally accepted as an effective way of feature extraction for a single document.
TF•IDF is based on the assumption that
The term has a high frequency in the document, i.e., a large TF value;
The term has a low document frequency in the whole document collection, i.e., a small
DF value.
It is embodied in the following two aspects:
If the TF of a term in a document is low but high in certain category (not all the
document), the term can also represent the feature of the text very well.
According to TF-IDF, the terms which have low document frequency in the whole
document collection can represent the feature of the text. But for a highly important term
of a certain category should have high document frequency in that category.
The two aspects neglect the frequency in the terms of a certain category. Because of the problems
we have to add a weight to the original TF•IDF. The added weight considers the frequency of the
term, which is in a particular category in the whole text collection, rather than simply consider the
frequency of the term which is in the other documents of the whole text collection.
2.6. Block Diagram of Domain Keyword Extraction Technique
Computer Science & Information Technology (CS & IT) 113
2.7. Process of Domain Keyword Extraction Technique
2.7.1. Problem Definition
In that research work, features are also referred as keywords, a set of significant words or terms in
a text which gives high-level description of its contents. The problem to be solved in thispaper is
to extract significant features for each news domain.
Let, D = {d1, d2 ,. , dM} be a set of news documents that belong to various news domain. Tj = {tj1,
tj2 , …,tjn} be a set of n terms extracted from a single document dj. Then T, a set of terms
extracted from a document set that belong to a particular domain is a union of T1, T2, … , TC. M
= Total number of documents in the whole corpus. C = Total number of documents belong to a
particular domain or category. mt = The number of documents that term t occurrences in the
documents set D called document frequency. ct = The number of documents that term t
occurrences in the same domain documents set is called document domain frequency.
To extract significant keywords from T, weighting the term of each document in a domain is
done. Then collect n% terms of each document. After that count term frequency of each term
from term collection and sort the terms by this weight and extract top-N terms with high weight.
Then a word list namely ‘keyword list ’is formed with these terms for that domain.
2.7.2 Document Preprocessing
• Remove Stop words
The irrelevantset of words is even appearing frequently. For example; a, the, of, with, for etc…
(I.e. articles, preposition, pronouns)
• Stemming
Identify groups of words where the words in a group are small syntactic variants of one another
and collect only the common word stem per group. For example, “development”, “developed”
and “developing” will be all treated as “develop”.
Stop-list, stemming reduces the amount of dimensions and enhances the relevancy between word
and document or categories. So, we remove stop words and word stemming.
2.7.3 TDDF Formula
The propose TDDF (TF-IDF based Document domain frequency) formula for the domain
keyword extraction as follows:
ܹሺ‫,ݐ‬ ‫,ܦ‬ ‫ܥ‬ሻ = ൫ܹௗ
௧
௜
ሺ‫,ݐ‬ ݀௜ሻ + 0.01൯ ×
ܿ௧
‫ܥ‬
Where, ܹௗ
௧
௜
ሺ‫,ݐ‬ ݀௜ሻ = ‫݂ݐ‬ሺ‫,ݐ‬ ݀௜ሻ × log
ெ
௠೟
,is the weight of word t in document di
‫݂ݐ‬ሺ‫,ݐ‬ ݀௜ሻ , is the normalized term frequency of word t in document di
log
ெ
௠೟
, is the inverse document frequency of t
௖೟
஼
, is word common possession rate
114 Computer Science & Information Technology (CS & IT)
ܹሺ‫,ݐ‬ ‫,ܦ‬ ‫ܥ‬ሻ , weight of term t in document di .with respect to the domain in
which document di belongs
C = total number of documents in a particular domain in which di belongs
ct = total number of documents containing term t in a particular domain in
which dibelongs
The original TF-IDFis TF multiplies by IDF, where TF and IDF are short for term frequency and
inverse document frequency respectively. The addition of a weight to the original TF-IDF which
is done to overcome the problem of TF•IDF, it is known as common word possession rate. This
common word possession rate considers the frequency of the term, which is in a particular
category in the whole text collection, rather than simply consider the frequency of the term which
is in the other documents of the whole text collection.
Common word possession ratereflecting the possibility of word t becomes the domain keyword.
Higher c means that the word t becomes keyword with greater possibility, vice versa.
The 0.01 in ൫ܹௗ
௧
௜
ሺ‫,ݐ‬ ݀௜ሻ + 0.01൯ is to prevent the word t frequency from becoming zero in di
which causing the denominator is also zero.
2.7.4 Table Term Frequency
In this step, we calculate a threshold value. Then we collect those terms, whose weight is above
the threshold value.
Here, the threshold is the most important n% terms from each document according to TDDF
value calculated in the previous step.
Then we count term frequencies from the term collected. We name this term frequency as “Table
Term Frequency” because the terms collected are stored in a temporary table.
2.7.5 Generate Keyword List
From the table where we stored terms with these TTF, collect top n% terms and ranking them
according to the highest table term frequency.
Those are the keyword list for that domain.
2.8. Experiments and Result
Step 1
The first step to extract keywords is downloading news documents from Internet portal site.
Internet portal site provides news pages by domain, such as flash, politics, business, society,
life/culture, world, IT/science, entertainment, column, English, magazine and special. Here sports
news is choosing for experiment. About 22 news documents for sports are taken. The news pages
of HTML is written in a fixed structure, Using this structure of HTML page and extract pure
news article and stored into text files.
Computer Science & Information Technology (CS & IT) 115
Step 2
After preprocessing each document,i.e. stop-word removing and stemming,then store the each
term and its occurrences against its document id into relational database.
Step 3
Then take a “dictionary” table in relational database to store all the terms exist in the document
corpus and its document domain frequency.
116 Computer Science & Information Technology (CS & IT)
Step 4
Then calculate the weight by using TDDF formula.
Step 5
Take top high-weighted term which are the above threshold value and store them in a table called
“ntopterms” table.
Computer Science & Information Technology (CS & IT) 117
Step 6
Then count the term frequency from the collected term and rank them according to the high term
frequency whose value are above the threshold value.
Those are the “KEYWORD LIST” for sports domain.
3. CONCLUSIONS
In that research workkeyword extracting technique that can extract domain keywords is
proposed.These keyword extracting techniques can be used to extracting main features from
specified document set and applied to document classification. The domain keyword of news
document is one of the basic elements of text classification, clustering, summarization; topic
detection etc.The experiment shows that the proposed TDDF formula can extract “multi-text”
domain keyword more effectively. The quality and quantity of domain keyword can be flexibly
controlled by “Word common possession rate”.
Further experimental work is needed to test the generality of this result, although news articles
can be considered as a representative of various types of documents.
ACKNOWLEDGEMENTS
I would like to express my warmest gratitude for the inspiration, encouragement and assistance
that I received from my esteemed guidesMr. Apurba Paul, throughout the research work. It is
because of his continuous guidance encouragement and valuable advices at every aspect and
strata of the problem from the embryonic to the developmental stage, that research has seen the
light of this day.
118 Computer Science & Information Technology (CS & IT)
REFERENCES
[1] Han.J, Kamer.M. .Data Mining Concepts and Techniques.BeiJing:Higher education press, 2001. 285-
295.
[2] R. Feldman and I. Dagan.Kdt - knowledge discovery in texts. In Proc.of the First Int. Conf. on
Knowledge Discovery (KDD), pages 112–117,1995.
[3] Robertson, S. E., “Term specificity [letter to the editor]”, Journal of Documentation, Vol. 28,
1972, pp. 164-165.
[4] Stephen Robertson, “Understanding inverse document frequency: on theoretical arguments for IDF”,
Journal of Documentation, Vol. 60, No.5, 2004, pp 503-520
[5] Salton, G. & Buckley, C. (1988). Term-weighingapproaches in automatic text retrieval.In
Information Processing & Management, 24(5): 513-523.
[6] Thorsten, J., 1996. Probabilistic Analysis of the Rocchio Algorithmwith TFIDF for Text
Categorization.Proceedingsof 14th International Conference on Machine Learning.McLean. Virginia,
USA, p.78-85.
[7] Berger, A et al (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In
Proc. Int. Conf. Research and Development in Information Retrieval, 192-199.
AUTHOR
RakhiChakrabortyis Currently working as an Assistant Professor in Global Institute of
Management & Technology,Krishnanagar, Kolkata, India. Prior toshe obtained her
Bachelor degreeand M.tech degree in Computer Science and Engineering in the West
Bengal university of Technology.Her areas of interest is in Data Mining and computer
architecture

More Related Content

What's hot

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...IJECEIAES
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationIJCSIS Research Publications
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trendsKU Leuven
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Performance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganographyPerformance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganographyjournalBEEI
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlescsandit
 

What's hot (20)

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Ir 02
Ir   02Ir   02
Ir 02
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Performance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganographyPerformance analysis on secured data method in natural language steganography
Performance analysis on secured data method in natural language steganography
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 

Similar to DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENCY ANALYSIS

6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papersEditorJST
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...ijbuiiir1
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodijcsit
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification ofijaia
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...idescitation
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
Using Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text ClassificationUsing Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435IJRAT
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfIJDKP
 

Similar to DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENCY ANALYSIS (20)

6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation method
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
 
Text mining
Text miningText mining
Text mining
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Using Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text ClassificationUsing Class Frequency for Improving Centroid-based Text Classification
Using Class Frequency for Improving Centroid-based Text Classification
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 

More from cscpconf

ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
 
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
 
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
 
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIESPROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
 
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICA SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
 
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
 
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
 
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICTWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
 
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINDETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
 
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
 
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMIMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
 
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
 
AUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWAUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
 
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKCLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
 
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
 
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAPROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
 
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHCHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
 
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
 
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGESOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
 

More from cscpconf (20)

ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR
 
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATION
 
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...
 
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIESPROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIES
 
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICA SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
 
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
 
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS
 
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICTWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTIC
 
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINDETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAIN
 
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...
 
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMIMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
 
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...
 
AUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEWAUTOMATED PENETRATION TESTING: AN OVERVIEW
AUTOMATED PENETRATION TESTING: AN OVERVIEW
 
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKCLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORK
 
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...
 
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAPROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
 
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHCHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCH
 
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
 
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGESOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGE
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENCY ANALYSIS

  • 1. Rupak Bhattacharyya et al. (Eds) : ACER 2013, pp. 109–118, 2013. © CS & IT-CSCP 2013 DOI : 10.5121/csit.2013.3211 DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENCY ANALYSIS Rakhi Chakraborty Department of Computer Science & Engineering, Global Institute Of Management and Technology, Nadia, India rakhi.chakraborty84@yahoo.in ABSTRACT On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those applications such as search engine, text categorization, summarization, and topic detection are based on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished ability. .KEYWORDS Text mining,Feature extraction,weighting method, Term Frequency Inverse Document Frequency (TF•IDF), Domain keyword extraction. 1. INTRODUCTION For rapidly development of computer network technology various aspects of electronic documents also grows rapidly and most of the enterprise information saved as a text form. Hence text mining has become an increasingly popular and also important field in the research of data mining. Text mining is different from the traditional data mining has been pointed out by Han. J, and Kamer. M[1]. The conventional data mining defines the relationship, things and structured data as the research target. While the text mining defines the text database as the research targetwhich consists of a large number of documents from the various data sources, including research papers, news articles, books, journals, reports, patent specifications, conference paper, e- mail messages, web pages and so on. Text mining is aninfantile interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics, Fig.1 shows that how the text mining is interconnected with other. Text mining would like to solve problems such as the uncertainty and ambiguity in the text information.Text mining or knowledge discovery from text (KDT) mentioned for the first time by Feldman et al. [2] which is deals with the machine supported analysis of text. Text mining uses techniques from information retrieval, information extraction as well as natural language processing (NLP) and connects them with the algorithms and methods of KDD, data mining, machine learning and statistics.
  • 2. 110 Computer Science & Information Technology (CS & IT) Figure1. Interdisciplinary domain 2. AUTOMATED DOMAIN KEYWORD EXTRACTION TECHNIQUE Keyword extraction is an extremely time consuming and difficult task, when it is done manually. Due to large volume of published news articles, it is almost impossible to extract keywords manually.Forestablish an automated process that extract keywords from news articles,anunsupervised keyword extraction technique namely Automated Domain keyword Extraction Technique is introduced in that research paper. Here a new weighting method is also introduced on the base of the conventional TF•IDF [3] [4]. 2.1. Keywords Keywords are a set of significant words in an article that gives high-level narrative of its contents to readers. To produce a short summary of news articles identifying the keyword from a large number of online news data is very useful. Keyword extraction technique is used to extract main features in studies such as text classification, text categorization, information retrieval,topic detection and document summarization. 2.2. Different Methods forextracting keyword The main methods of the keyword extraction are TF*IDF (Term Frequency Inverse Document Frequency), mutual information, information gain, NGL coefficient, chi-square and so on. TFIDF and mutual information are the conventional methods. It’s a old weighting method but making it popular by recent algorithm Salton, G. & Buckley, C. [5] The significant effect has been proved in the practical applications using TFIDF formula to acquire single text keyword. Mutual information is commonly used in statistical language models to evaluate the correlated degree between strings. Bigger mutual information between strings indicates the stronger correlation in the viewpoint of statistics. But the small mutual information does not always means that there is weaker correlation between strings and in computing the string requires minimum number. Data Mining Statistics Text Mining Information Retrieval Web Mining Computational linguistics
  • 3. Computer Science & Information Technology (CS & IT) 111 2.3. Single Text Keyword Vs Domain keywords TF-IDF (Term Frequency-Inverse Document Frequency) weightingmodel [6]is a statistical model that evaluates the degree of importance of a word in a single document, but it is not suitable for extracting domain keywords.The keywords of single text are difficult to accurately reflect text domain knowledge and user interests, which will cause the Web more difficult to provide high- quality personalized services for readers. Reader want extract multi-texts keywords to reflect the domain knowledge of texts, in order to provide automatically text clustering, classification and summarization and topic detection. Figure2.Relation between single text keyword and domain keyword 2.4 Conventional TF-IDF Formula The conventional, TF-IDF weight scheme invented byBerger, A et al [7] which is as follows: TF: ܶ‫ܨ‬ሺ‫,ݐ‬ ݀௜ሻ = ௡೟,೔ ∑ ௡ೖ,೔ |೅| ೖసభ Where, TF (t, di) =term frequency of word t in document di nt,i = number of accurances of term t in di nk,i = number of accurances of all terms in di IDF: ‫ܨܦܫ‬௧ = ݈‫݃݋‬ ெ ௠೟ା଴.଴ଵ Where, M = total number of documents in the corpus mt = total number of documents in the corpus where term t appears. TF-IDF: ‫ݓ‬ሺ‫,ݐ‬ ݀௜ሻ = ܶ‫ܨ‬ሺ‫,ݐ‬ ݀௜ሻ × ‫ܨܦܫ‬௧ Where, w (t, di) = weight of term t in document di. TF-IDF value is composed of two components: TF and IDF values. The rationale of TF value is that more frequent words in a document are more important than less frequent words. TF value of a particular word in a given document is the number of occurrences in the document. This count is usually normalized to prevent a bias towards longer documents to give a measure of the importance of the termti within the particular documentdj, like a TF equation given in above.The second component of TF-IDF value, IDF, represents rarity across the whole collection. This value is obtained by dividing the number of all documents by the number of documents containing the Domain keyword The keyword of text j The keyword of text i The keyword of text k
  • 4. 112 Computer Science & Information Technology (CS & IT) term, and then taking the logarithm of that quotient, like an IDF equation given above. For example, a word, ‘today’ appears in many documents and this word is weighted as low IDF value. Thus it is regarded as a meaningless word. 2.5 Problem of Conventional TF-IDF Weighting TF-IDF is generally accepted as an effective way of feature extraction for a single document. TF•IDF is based on the assumption that The term has a high frequency in the document, i.e., a large TF value; The term has a low document frequency in the whole document collection, i.e., a small DF value. It is embodied in the following two aspects: If the TF of a term in a document is low but high in certain category (not all the document), the term can also represent the feature of the text very well. According to TF-IDF, the terms which have low document frequency in the whole document collection can represent the feature of the text. But for a highly important term of a certain category should have high document frequency in that category. The two aspects neglect the frequency in the terms of a certain category. Because of the problems we have to add a weight to the original TF•IDF. The added weight considers the frequency of the term, which is in a particular category in the whole text collection, rather than simply consider the frequency of the term which is in the other documents of the whole text collection. 2.6. Block Diagram of Domain Keyword Extraction Technique
  • 5. Computer Science & Information Technology (CS & IT) 113 2.7. Process of Domain Keyword Extraction Technique 2.7.1. Problem Definition In that research work, features are also referred as keywords, a set of significant words or terms in a text which gives high-level description of its contents. The problem to be solved in thispaper is to extract significant features for each news domain. Let, D = {d1, d2 ,. , dM} be a set of news documents that belong to various news domain. Tj = {tj1, tj2 , …,tjn} be a set of n terms extracted from a single document dj. Then T, a set of terms extracted from a document set that belong to a particular domain is a union of T1, T2, … , TC. M = Total number of documents in the whole corpus. C = Total number of documents belong to a particular domain or category. mt = The number of documents that term t occurrences in the documents set D called document frequency. ct = The number of documents that term t occurrences in the same domain documents set is called document domain frequency. To extract significant keywords from T, weighting the term of each document in a domain is done. Then collect n% terms of each document. After that count term frequency of each term from term collection and sort the terms by this weight and extract top-N terms with high weight. Then a word list namely ‘keyword list ’is formed with these terms for that domain. 2.7.2 Document Preprocessing • Remove Stop words The irrelevantset of words is even appearing frequently. For example; a, the, of, with, for etc… (I.e. articles, preposition, pronouns) • Stemming Identify groups of words where the words in a group are small syntactic variants of one another and collect only the common word stem per group. For example, “development”, “developed” and “developing” will be all treated as “develop”. Stop-list, stemming reduces the amount of dimensions and enhances the relevancy between word and document or categories. So, we remove stop words and word stemming. 2.7.3 TDDF Formula The propose TDDF (TF-IDF based Document domain frequency) formula for the domain keyword extraction as follows: ܹሺ‫,ݐ‬ ‫,ܦ‬ ‫ܥ‬ሻ = ൫ܹௗ ௧ ௜ ሺ‫,ݐ‬ ݀௜ሻ + 0.01൯ × ܿ௧ ‫ܥ‬ Where, ܹௗ ௧ ௜ ሺ‫,ݐ‬ ݀௜ሻ = ‫݂ݐ‬ሺ‫,ݐ‬ ݀௜ሻ × log ெ ௠೟ ,is the weight of word t in document di ‫݂ݐ‬ሺ‫,ݐ‬ ݀௜ሻ , is the normalized term frequency of word t in document di log ெ ௠೟ , is the inverse document frequency of t ௖೟ ஼ , is word common possession rate
  • 6. 114 Computer Science & Information Technology (CS & IT) ܹሺ‫,ݐ‬ ‫,ܦ‬ ‫ܥ‬ሻ , weight of term t in document di .with respect to the domain in which document di belongs C = total number of documents in a particular domain in which di belongs ct = total number of documents containing term t in a particular domain in which dibelongs The original TF-IDFis TF multiplies by IDF, where TF and IDF are short for term frequency and inverse document frequency respectively. The addition of a weight to the original TF-IDF which is done to overcome the problem of TF•IDF, it is known as common word possession rate. This common word possession rate considers the frequency of the term, which is in a particular category in the whole text collection, rather than simply consider the frequency of the term which is in the other documents of the whole text collection. Common word possession ratereflecting the possibility of word t becomes the domain keyword. Higher c means that the word t becomes keyword with greater possibility, vice versa. The 0.01 in ൫ܹௗ ௧ ௜ ሺ‫,ݐ‬ ݀௜ሻ + 0.01൯ is to prevent the word t frequency from becoming zero in di which causing the denominator is also zero. 2.7.4 Table Term Frequency In this step, we calculate a threshold value. Then we collect those terms, whose weight is above the threshold value. Here, the threshold is the most important n% terms from each document according to TDDF value calculated in the previous step. Then we count term frequencies from the term collected. We name this term frequency as “Table Term Frequency” because the terms collected are stored in a temporary table. 2.7.5 Generate Keyword List From the table where we stored terms with these TTF, collect top n% terms and ranking them according to the highest table term frequency. Those are the keyword list for that domain. 2.8. Experiments and Result Step 1 The first step to extract keywords is downloading news documents from Internet portal site. Internet portal site provides news pages by domain, such as flash, politics, business, society, life/culture, world, IT/science, entertainment, column, English, magazine and special. Here sports news is choosing for experiment. About 22 news documents for sports are taken. The news pages of HTML is written in a fixed structure, Using this structure of HTML page and extract pure news article and stored into text files.
  • 7. Computer Science & Information Technology (CS & IT) 115 Step 2 After preprocessing each document,i.e. stop-word removing and stemming,then store the each term and its occurrences against its document id into relational database. Step 3 Then take a “dictionary” table in relational database to store all the terms exist in the document corpus and its document domain frequency.
  • 8. 116 Computer Science & Information Technology (CS & IT) Step 4 Then calculate the weight by using TDDF formula. Step 5 Take top high-weighted term which are the above threshold value and store them in a table called “ntopterms” table.
  • 9. Computer Science & Information Technology (CS & IT) 117 Step 6 Then count the term frequency from the collected term and rank them according to the high term frequency whose value are above the threshold value. Those are the “KEYWORD LIST” for sports domain. 3. CONCLUSIONS In that research workkeyword extracting technique that can extract domain keywords is proposed.These keyword extracting techniques can be used to extracting main features from specified document set and applied to document classification. The domain keyword of news document is one of the basic elements of text classification, clustering, summarization; topic detection etc.The experiment shows that the proposed TDDF formula can extract “multi-text” domain keyword more effectively. The quality and quantity of domain keyword can be flexibly controlled by “Word common possession rate”. Further experimental work is needed to test the generality of this result, although news articles can be considered as a representative of various types of documents. ACKNOWLEDGEMENTS I would like to express my warmest gratitude for the inspiration, encouragement and assistance that I received from my esteemed guidesMr. Apurba Paul, throughout the research work. It is because of his continuous guidance encouragement and valuable advices at every aspect and strata of the problem from the embryonic to the developmental stage, that research has seen the light of this day.
  • 10. 118 Computer Science & Information Technology (CS & IT) REFERENCES [1] Han.J, Kamer.M. .Data Mining Concepts and Techniques.BeiJing:Higher education press, 2001. 285- 295. [2] R. Feldman and I. Dagan.Kdt - knowledge discovery in texts. In Proc.of the First Int. Conf. on Knowledge Discovery (KDD), pages 112–117,1995. [3] Robertson, S. E., “Term specificity [letter to the editor]”, Journal of Documentation, Vol. 28, 1972, pp. 164-165. [4] Stephen Robertson, “Understanding inverse document frequency: on theoretical arguments for IDF”, Journal of Documentation, Vol. 60, No.5, 2004, pp 503-520 [5] Salton, G. & Buckley, C. (1988). Term-weighingapproaches in automatic text retrieval.In Information Processing & Management, 24(5): 513-523. [6] Thorsten, J., 1996. Probabilistic Analysis of the Rocchio Algorithmwith TFIDF for Text Categorization.Proceedingsof 14th International Conference on Machine Learning.McLean. Virginia, USA, p.78-85. [7] Berger, A et al (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In Proc. Int. Conf. Research and Development in Information Retrieval, 192-199. AUTHOR RakhiChakrabortyis Currently working as an Assistant Professor in Global Institute of Management & Technology,Krishnanagar, Kolkata, India. Prior toshe obtained her Bachelor degreeand M.tech degree in Computer Science and Engineering in the West Bengal university of Technology.Her areas of interest is in Data Mining and computer architecture