Document Clustering algorithms goal is to create clusters that are coherent internally, but clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable to eliminate it and keeping just the useful information. Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining applications can use it to improve her results. The Keyphrases are defined as phrases that capture the main topics discussed in document; they offer a brief and precise summary of document content. Therefore, it can be a good solution to get rid of the existent noise from documents. In this paper, we propose a new method to solve the problem cited above especially for Arabic language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach, we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting Keyphrases improves the clustering results.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
Document Clustering is a branch of a larger area of scientific study known as data mining .which is an
unsupervised classification using to find a structure in a collection of unlabeled data. The useful
information in the documents can be accompanied by a large amount of noise words when using Full Text
Representation, and therefore will affect negatively the result of the clustering process. So it is with great
need to eliminate the noise words and keeping just the useful information in order to enhance the quality of
the clustering results. This problem occurs with different degree for any language such as English,
European, Hindi, Chinese, and Arabic Language. To overcome this problem, in this paper, we propose a
new and efficient Keyphrases extraction method based on the Suffix Tree data structure (KpST), the
extracted Keyphrases are then used in the clustering process instead of Full Text Representation. The
proposed method for Keyphrases extraction is language independent and therefore it may be applied to any
language. In this investigation, we are interested to deal with the Arabic language which is one of the most
complex languages. To evaluate our method, we conduct an experimental study on Arabic Documents
using the most popular Clustering approach of Hierarchical algorithms: Agglomerative Hierarchical
algorithm with seven linkage techniques and a variety of distance functions and similarity measures to
perform Arabic Document Clustering task. The obtained results show that our method for extracting
Keyphrases increases the quality of the clustering results. We propose also to study the effect of using the
stemming for the testing dataset to cluster it with the same documents clustering techniques and
similarity/distance measures.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...csandit
Document Clustering algorithms goal is to create clusters that are coherent internally, but
clearly different from each other. The useful expressions in the documents is often accompanied
by a large amount of noise that is caused by the use of unnecessary words, so it is indispensable
to eliminate it and keeping just the useful information.
Keyphrases extraction systems in Arabic are new phenomena. A number of Text Mining
applications can use it to improve her results. The Keyphrases are defined as phrases that
capture the main topics discussed in document; they offer a brief and precise summary of
document content. Therefore, it can be a good solution to get rid of the existent noise from
documents.
In this paper, we propose a new method to solve the problem cited above especially for Arabic
language documents, which is one of the most complex languages, by using a new Keyphrases
extraction algorithm based on the Suffix Tree data structure (KpST). To evaluate our approach,
we conduct an experimental study on Arabic Documents Clustering using the most popular
approach of Hierarchical algorithms: Agglomerative Hierarchical algorithm with seven linkage
techniques and a variety of distance functions and similarity measures to perform Arabic
Document Clustering task. The obtained results show that our approach for extracting
Keyphrases improves the clustering results.
A new keyphrases extraction method based on suffix tree data structure for ar...ijma
Document Clustering is a branch of a larger area of scientific study known as data mining .which is an
unsupervised classification using to find a structure in a collection of unlabeled data. The useful
information in the documents can be accompanied by a large amount of noise words when using Full Text
Representation, and therefore will affect negatively the result of the clustering process. So it is with great
need to eliminate the noise words and keeping just the useful information in order to enhance the quality of
the clustering results. This problem occurs with different degree for any language such as English,
European, Hindi, Chinese, and Arabic Language. To overcome this problem, in this paper, we propose a
new and efficient Keyphrases extraction method based on the Suffix Tree data structure (KpST), the
extracted Keyphrases are then used in the clustering process instead of Full Text Representation. The
proposed method for Keyphrases extraction is language independent and therefore it may be applied to any
language. In this investigation, we are interested to deal with the Arabic language which is one of the most
complex languages. To evaluate our method, we conduct an experimental study on Arabic Documents
using the most popular Clustering approach of Hierarchical algorithms: Agglomerative Hierarchical
algorithm with seven linkage techniques and a variety of distance functions and similarity measures to
perform Arabic Document Clustering task. The obtained results show that our method for extracting
Keyphrases increases the quality of the clustering results. We propose also to study the effect of using the
stemming for the testing dataset to cluster it with the same documents clustering techniques and
similarity/distance measures.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
This paper focus on keyphrase extraction for news articles because news article is one of the popular document genres on the web and most news articles have no author-assigned keyphrases. Existing methods for single document keyphrase extraction usually make use of only the information contained in the specified document. This paper proposes to use a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. Experimental results demonstrate the good effectiveness and robustness of our proposed approach. According to experiments conducted on several documents and keyphrases we consider top 10 keyphrases are most suitable keyphrases.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
From last few years online information is growing tremendously on World Wide Web or on user’s desktops and thus online information gains much more attention in the field of automatic text summarization. Text mining has become a significant research field as it produces valuable data from unstructured and large amount of texts. Summarization systems provide the possibility of searching the important keywords of the texts and so the consumer will expend less time on reading the whole document. Main objective of summarization system is to generate a new form which expresses the key meaning of the contained text. This paper study on various existing techniques with needs of novel Multi-Document summarization schemes. This paper is motivated by arising need to provide high quality summary in very short period of time. In proposed system, user can quickly and easily access correctly-developed summaries which expresses the key meaning of the contained text. The primary focus of this paper lies with thef_β-optimal merge function, a function recently presented here, that uses the weighted harmonic mean to discover a harmony in the middle of precision and recall. Proposed system utilizes Bisect K-means clustering to improve the time and Neural Networks to improve the accuracy of summary generated by NEWSUM algorithm.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphrase’s. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
This paper focus on keyphrase extraction for news articles because news article is one of the popular document genres on the web and most news articles have no author-assigned keyphrases. Existing methods for single document keyphrase extraction usually make use of only the information contained in the specified document. This paper proposes to use a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. Experimental results demonstrate the good effectiveness and robustness of our proposed approach. According to experiments conducted on several documents and keyphrases we consider top 10 keyphrases are most suitable keyphrases.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
Keyword extraction, concept finding are in learning objects is very important subject in today’s eLearning environment. Keywords are subset of words that contains the useful information about the content of the document. Keyword extraction is a process that is used to get the important keywords from documents. In this proposed System Decision tree algorithm is used for feature selection process using wordnet dictionary. WordNet is a lexical database of English which is used to find similarity from the candidate words. The words having highest similarity are taken as keywords.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Answer extraction and passage retrieval forWaheeb Ahmed
—Question Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the user’s questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the user’s
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSkevig
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
PATENT DOCUMENT SUMMARIZATION USING CONCEPTUAL GRAPHSijnlc
In this paper a methodology to mine the concepts from documents and use these concepts to generate an
objective summary of the claims section of the patent documents is proposed. Conceptual Graph (CG)
formalism as proposed by Sowa (Sowa 1984) is used in this work for representing the concepts and their
relationships. Automatic identification of concepts and conceptual relations from text documents is a
challenging task. In this work the focus is on the analysis of the patent documents, mainly on the claim’s
section (Claim) of the documents. There are several complexities in the writing style of these documents as
they are technical as well as legal. It is observed that the general in-depth parsers available in the open
domain fail to parse the ‘claims section’ sentences in patent documents. The failure of in-depth parsers
has motivated us, to develop methodology to extract CGs using other resources. Thus in the present work
shallow parsing, NER and machine learning technique for extracting concepts and conceptual
relationships from sentences in the claim section of patent documents is used. Thus, this paper discusses i)
Generation of CG, a semantic network and ii) Generation of abstractive summary of the claims section of
the patent. The aim is to generate a summary which is 30% of the whole claim section. Here we use
Restricted Boltzmann Machines (RBMs), a deep learning technique for automatically extracting CGs. We
have tested our methodology using a corpus of 5000 patent documents from electronics domain. The results
obtained are encouraging and is comparable with the state of the art systems.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF –IDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Now the age of information technology, textual document is spontaneously increasing over the internet, e-mail, b pages, offline and online reports, journals, articles and they stored in the electronic database format. Millions of new text file created in a day, but for the proper classification, people miss vast information those are useful to several challenges in daily life. To maintain and access those documents are very difficult without adequate rating and when there has classification without any information provide call clustering. To overcome such difficulties K-means and others old clustering algorithms are unfit to impart as may be expected on Natural languages. Because of high-dimensional about texts, the presence of logical structure clues within the texts and novel segmentation techniques have taken advantage of advances in generative topic modeling algorithms, specifically designed to spot questions at intervals text to cipher word–topic distributions. By considering those challenges there, in the current thesis proposed a semantic document clustering framework and the framework be developed by using Python platform and tested each of steps. In this context there have preprocessing steps like tag elimination, removed stop words according to Oxford dictionary, applying lemmatization process after getting the help of WordNet semantic information available and synsets for each word individually from raw text. So considering the limitation of K-Means algorithm and other old algorithms, COBB conceptual clustering algorithm applied to the preprocessed data in this context. Clusters quality and accuracy is one of the most significant contributions to this research. For ensuring the accuracy of clusters, the f-measure accuracy measuring methods selected for evaluate the clusters and feedback the accuracy of clusters. F-Measure returns the accuracy of clusters and also ensuring the purity of clustering process. Framework tests on 20 samples of 20 different articles and minimum accuracy considered as the accuracy of the clusters and the developed system return 71.42% accurate. There are several challenges, such as synonym, high dimensionality, extracting core semantics from texts, and assigning appropriate description for the generated clusters need to experiment further. This research to work to find an accurate way to cluster text documents based on semantic meaning by the help of WordNet database.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.
Similar to AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW KEYPHRASES EXTRACTION ALGORITHM (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
2. Computer Science & Information Technology (CS & IT) 244
Traditional documents clustering algorithms use the full-text in the documents to generate feature
vectors. Such methods often produce unsatisfactory results because there is much noisy
information in documents. There is always a need to summarize information into compact form
that could be easily absorbed. The challenge is to extract the essence of text documents
collections and present it in a compact form that identifies their topic(s).
In this paper, we propose to use Keyphrases extraction to tackle these issues when clustering
documents. The Keyphrases extraction from free text documents is becoming increasingly
important as the uses for such technology expands. Keyphrases extraction plays a vital role in the
task of indexing, summarization, clustering [1], categorization and more recently in improving
search results and in ontology learning. Keyphrases extraction is a process by which the set of
words or phrases that best describe a document is specified. In our work, we presented our novel
Keyphrases extraction algorithm based on the Suffix Tree data structure (KpST) to extract the
important Keyphrases from Arabic documents.
The outputs of the proposed approach will be the inputs of the most popular approach of
Hierarchical algorithms used in our experimental study on Arabic documents clustering:
Agglomerative Hierarchical algorithms with seven linkage techniques for hierarchical clustering
algorithms using a wide variety of distance functions and similarity measures, such as the
Euclidean Distance, Cosine Similarity, Jaccard Coefficient, and the Pearson Correlation
Coefficient [2][3], in order to test their effectiveness on Arabic documents clustering.
The remainder of this paper is organized as follows. The next section discusses our novel
approach to extract the Keyphrases using suffix tree algorithm for Arabic documents. Section 3
presents the similarity measures and their semantics. Section 4 explains experiment settings,
dataset, evaluation approach, results and analysis. Section 5 concludes and discusses our future
works.
2. ARABIC KEYPHRASE EXTRACTION USING SUFFIX TREE
ALGORITHM
2.1. Keyphrase Extraction
Keyphrases are widely used in large document collections. They describe the content of single
documents and provide a kind of semantic metadata that is useful for a variety of purposes. Many
Keyphrases extractors view the problem as a classification problem and therefore they need
training documents (i.e. documents which their Keyphrases are known in advance). Other
systems view Keyphrases extraction as a ranking problem. In the latter approach, the words or
phrases of a document are ranked based on their importance and phrases with high importance
(usually located at the beginning of the list) are recommended as possible Keyphrases for a
document.
From the observation of human-assigned Keyphrases, we conclude that good Keyphrases of a
document should satisfy the following properties:
- Understandable. The Keyphrases are understandable to people. This indicates the
extracted Keyphrases should be grammatical.
- Relevant. The Keyphrases are semantically relevant with the document theme.
- Good coverage. The Keyphrases should cover the whole document well.
3. 245 Computer Science & Information Technology (CS & IT)
Keyphrases extraction algorithms fall into two categories: Keyphrases extraction from single
documents, which is often posed as a supervised learning task [19], and Keyphrases extraction
from a set of documents, which is an unsupervised learning task that tries to discover the topics
rather than learn from examples. As an example of an unsupervised Keyphrases extraction
approach, the graph-based ranking [20] regards Keyphrases extraction as a ranking task, where a
document is represented by a term graph based on term relatedness, and then a graph-based
ranking algorithm is used to assign importance scores to each term. Existing methods usually use
term co occurrences within a specified window size in the given document as an approximation
of term relatedness [20].
In this paper, a novel unsupervised Keyphrases extraction approach based on generalized Suffix
Tree construction for Arabic documents is presented and described.
2.2. Suffix Tree
A Suffix Tree is a data structure that allows efficient string matching and querying. It have been
studied and used extensively, and have been applied to fundamental string problems such as
finding the longest repeated substring, strings comparisons , and text compression[24]. The Suffix
Tree commonly deals with strings as sequences of characters, or with documents as sequences of
words. A Suffix Tree of a string is simply a compact tree of all the suffixes of that string. The
Suffix Tree has been used firstly by Zamir et al. [25] as clustering algorithm named Suffix Tree
Clustering (STC). It’s linear time clustering algorithm that is based on identifying the shared
phrases that are common to some Document’s Snippets in order to group them in one cluster. A
phrase in our context is an ordered sequence of one or more words. ST has two logical steps:
Document’s “Cleaning”, and Identifying Suffix Tree Document Model.
2.2.1. Document’s “Cleaning”
In this step, each snippet is processed for Arabic stop-words removal such as (e.g., فانهوانوالذي
وھذافكانستكون …): Stop-word means high frequency and low discrimination and should be filtered
out in the IR system. They are functional, general, and common words of the language that
usually do not contribute to the semantics of the documents and have no read added value. Many
Information retrieval systems (IRS) attempt to exclude stop-words from the list of features to
reduce feature space, and to increase their performance. In addition, in our case, to deal especially
with Arabic snippets, we propose also in this step to remove Latin words and specials characters
such as (e.g. $, #,...).
2.2.2. Suffix Tree Document Model
The Suffix Tree treats documents as a set of phrases (sentences) not just as a set of words. The
sentence has a specific, semantic meaning (words in the sentence are ordered). Suffix Tree
Document Model (STDM) considers a document d = w1w2…wm as a string consisting of words
wi, not characters (i = 1; 2;…; m). A revised definition of suffix tree can be presented as follow:
A Generalized Suffix Tree for a set S of n strings, each of length mn, is a rooted directed tree
with exactly ∑mn leaves marked by a two number index (k,l) where k ranges from 1 to n and l
ranges from 1 to mk. Each internal node, other than the root, has at least two children and each
edge is labeled with a nonempty substring of words of a string in S. No two edges out of a node
can have edge labels beginning with the same word. For any leaf (i,j), the concatenation of the
edge labels on the path from the root to leaf(i, j) exactly spells out the suffix of Si that starts at
position j , that’s it spells out Si[ j…mi ][12].The Figure.1 shows an example of the generated
Suffix Tree of a set of three Arabic strings or three Documents–Document1: "الجبن يأكل القط " ,
Document2: الايضا الجبن فارياكل" ", Document3: "الفارايضا ياكل القط ", respectively the Figure.2 shows
4. Computer Science & Information Technology (CS & IT) 246
the same example in English Language (Document1: "cat ate cheese", Document2: "mouse ate
cheese too “ and Document3: "cat ate mouse too") .
Figure 3. Description of Our Novel Approach for Arabic Keyphrases Extraction
5. 247 Computer Science & Information Technology (CS & IT)
A Suffix Tree of document d is a compact tree containing all suffixes of document d, this tree is
presented by a set of nodes and leaves and labels. The label of a node in the tree is defined as the
concatenation, in order, of the sub-strings labeling the edges of the path from the root to that
node. Each node should have a score. The node can be ranked according to its score.
The score is relevant to:
a. The length of the label node words.
b. The number of the occurrence of word in document Term Frequency
Each node of the suffix tree is scored as:
Where |B| is the number of documents in B, and |P| is the number of words making up the phrase
P, wi represents the words in P and TFIDF represent the Term Frequency Inverse Document
Frequency for each word wi in P.
Figure 3 summarize different step of our novel technique for extraction Keyphrases from each
Arabic document.
3. SIMILARITY MEASURES
In this section we present the five similarity measures that were tested in [2] and our works [3][6],
and we include these five measures in our work to effect the Arabic text document clustering.
3.1. Euclidean Distance
Euclidean distance is widely used in clustering problems, including clustering text. It satisfies all
the above four conditions and therefore is a true metric. It is also the default distance measure
used with the K-means algorithm.
Measuring distance between text documents, given two documents
da and
db represented by
their term vectors
ta
ur
and
tb
ur
respectively, the Euclidean distance of the two documents is defined
as:
12
2( , ) ( ) ,, ,1
m
D t t w wa t aE b t bt
∑= −
=
ur ur
where the term set is
}{ ,...,1T t tm=
. As mentioned previously, we use the tfidf value as term
weights, that is
( , ),w tfidf d tat a =
.
(1)
(2)
(3)
6. Computer Science & Information Technology (CS & IT) 248
3.2. Cosine Similarity
Cosine similarity is one of the most popular similarity measure applied to text documents, such as
in numerous information retrieval applications [7] and clustering too [8].
Given two documents
ta
ur
and
tb
ur
, their cosine similarity is:
.
( , ) ,
t ta b
SIM t taC b
t ta b
=
×
ur ur
ur ur
ur ur
where
ta
ur
and
tb
ur
are m-dimensional vectors over the term set
}{ ,...,1T t tm=
. Each dimension
represents a term with its weight in the document, which is non-negative. As a result, the cosine
similarity is non-negative and bounded between
[ ]0,1
. An important property of the cosine
similarity is its independence of document length. For example, combining two identical copies
of a document d to get a new pseudo document 0d
, the cosine similarity between d and 0d
is 1,
which means that these two documents are regarded to be identical.
3.3. Jaccard Coefficient
The Jaccard coefficient, which is sometimes referred to as the Tanimoto coefficient, measures
similarity as the intersection divided by the union of the objects. For text document, the Jaccard
coefficient compares the sum weight of shared terms to the sum weight of terms that are present
in either of the two documents but are not the shared terms. The formal definition is:
.
( , ) 2 2
.
t ta b
SIM t taJ b
t t t ta ab b
=
+ −
ur ur
ur ur
ur ur ur ur
The Jaccard coefficient is a similarity measure and ranges between 0 and 1. It is 1 when the
t ta b=
ur ur
and 0 when
ta
ur
and
tb
ur
are disjoint. The corresponding distance measure is 1D SIMJ J= −
and we will use D J instead in subsequent experiments.
3.4. Pearson Correlation Coefficient
Pearson’s correlation coefficient is another measure of the extent to which two vectors are
related. There are different forms of the Pearson correlation coefficient formula. Given the term
set
}{ ,...,1T t tm=
, a commonly used form is:
,1 ,
( , )
2 2 2 2
,1 1 ,
mm w w TF TFt a at t b b
SIM t taP b m mm w TF m w TFt a at t t b b
∑ × − ×=
=
∑ ∑− −= =
ur ur
where ,1
mTF wa t at= ∑ = and 1 ,
mTF wtb t b= ∑ =
(4)
(5)
(6)
7. 249 Computer Science & Information Technology (CS & IT)
This is also a similarity measure. However, unlike the other measures, it ranges from -1 to +1 and
it is 1 when
t ta b=
ur ur
. In subsequent experiments we use the corresponding distance measure,
which is
1D SIMP P= −
when
0SIM P ≥
and
D SIMP P=
when
0SIMP p
.
3.5. Averaged Kullback-Leibler Divergence
In information theory based clustering, a document is considered as a probability distribution of
terms. The similarity of two documents is measured as the distance between the two
corresponding probability distributions. The Kullback-Leibler divergence (KL divergence), also
called the relative entropy, is a widely applied measure for evaluating the differences between
two probability distributions. Given two distributions P and Q, the KL divergence from
distribution P to distribution Q is defined as:
( || ) log( )
P
D P Q PKL Q
=
In the document scenario, the divergence between two distributions of words is:
,
( || ) log( ).,
1
,
wm t a
D t t wa t aKL b t wt b
∑= ×
=
ur ur
However, unlike the previous measures, the KL divergence is not symmetric, i.e.
( || ) ( || )D P Q D Q PK L K L≠
. Therefore it is not a true metric. As a result, we use the averaged KL
divergence instead, which is defined as:
( || ) ( || ) ( || ),1 2D P Q D P M D Q MKL KLAvgKL π π= +
where
,1 2
P Q
P Q P Q
π π= =
+ + and 1 2M P Qπ π= +
For documents, the averaged KL divergence can
be computed with the following formula:
( || ) ( ( || ) ( || )),,1 2 ,1
m
D t t D w w D w wa t a t tAvgKL b t bt
π π∑= × + ×
=
ur ur
where
, ,
, ,1 2
, ,, ,
wwt a t b
w w w wt a t at b t b
π π= =
+ +
and ,1 2 ,w w wt t a t bπ π= × + ×
.
The average weighting between two vectors ensures symmetry, that is, the divergence from
document i to document j is the same as the divergence from document j to document i. The
averaged KL divergence has recently been applied to clustering text documents, such as in the
family of the Information Bottleneck clustering algorithms [9], to good effect.
4. EXPERIMENTS AND RESULTS
In our experiments (Figure 4), we used the Agglomerative Hierarchical algorithms as documents
clustering methods for Arabic documents. The similarity measures do not directly fit into the
algorithms, because smaller values indicate dissimilarity [3]. The Euclidean distance and the
Averaged KL Divergence are distance measures, while the Cosine Similarity, Jaccard coefficient
(7)
(8)
(9)
(10)
8. Computer Science & Information Technology (CS & IT) 250
and Pearson coefficient are similarity measures. We apply a simple transformation to convert the
similarity measure to distance values. Because both Cosine Similarity and Jaccard coefficient are
bounded in
[ ]0,1
and monotonic, we take 1D SIM= − as the corresponding distance value. For
Pearson coefficient, which ranges from −1 to +1, we take 1D SIM= − when 0SIM ≥ and
D SIM=
when 0SIM p .
For the testing dataset, we experimented with different similarity measures in two cases: in the
first one, we apply the proposed method above to extract Keyphrases for the all documents in
dataset and then cluster them. In the second case, we cluster the original documents. Moreover,
each experiment was run for many times and the results are the averaged value over many runs.
Each run has different initial seed sets; in the total we had 35 experiments for Agglomerative
Hierarchical algorithm using 7 techniques for merging the clusters described below in the next
section.
4.1. Agglomerative Hierarchical Techniques
Agglomerative algorithms are usually classified according to the inter-cluster similarity measure
they use. The most popular of these are [11][12]:
• Linkage: minimum distance criterion
• Complete Linkage : maximum distance criterion
• Average Group : average distance criterion
• Centroid Distance Criterion :
• Ward : minimize variance of the merge cluster.
Jain and Dubes (1988) showed general formula that first proposed by Lance and William (1967)
to include most of the most commonly referenced hierarchical clustering called SAHN
(Sequential, Agglomerative, Hierarchical and nonoverlapping) clustering method. Distance
between existing cluster k with nk objects and newly formed cluster (r,s) with nr and ns objects is
given as:
( , )k r s r k r s k s r s k r k sd d d d d dα α β γ→ → → → → →= + + + −
The values of the parameters are given in the in Table 1.
(15)
max( )A B ij
i A
j B
d d
ε
ε
→
∀
∀
=
( )A B ij
i A
j B
d av erage d
ε
ε
→
∀
∀
=
1
( )A B A B ij
i Ai j
j B
d C C d
n n ε
ε
→
∀
∀
= − = ∑
m in ( )A B ij
i A
j B
d d
ε
ε
→
∀
∀
= (11)
(12)
(13)
(14)
9. 251 Computer Science & Information Technology (CS & IT)
Table 1. The values of the parameters of the general formula of hierarchical clustering SAHN
Figure 4. Description of Our Experiments
Clustering Method rα sα β γ
Single Link 1/2 1/2 0 -1/2
Complete Link 1/2 1/2 0 1/2
Unweighted Pair Group Method
Average (UPGMA)
r
r s
n
n n+
s
r s
n
n n+
0 0
Weighted Pair Group Method
Average (WPGMA)
1/2 1/2 0 0
Unweighted Pair Group Method
Centroid (UPGMC)
r
r s
n
n n+
s
r s
n
n n+
2
( )
r s
r s
n n
n n
−
+
0
Weighted Pair Group Method
Centroid (WPGMC)
1/2 1/2 -1/4 0
Ward’s Method
r k
r s k
n n
n n n
+
+ +
s k
r s k
n n
n n n
+
+ +
k
r s k
n
n n n
−
+ +
0
10. Computer Science & Information Technology (CS & IT) 252
4.2. Dataset
The testing dataset [13] (Corpus of Contemporary Arabic (CCA)) is composed of 12 several
categories, each latter contains documents from websites and from radio Qatar. A summary of the
testing dataset is shown in Table 2. To illustrate the benefits of our proposed approach, we
extracted the appropriate Keyphrases from the Arabic documents in our testing dataset using this
approach, and we ranked terms by their weighting schemes Tfidf and use them in our experiments.
Table 2. Number of texts and number of Terms in each category of the testing dataset
4.3. Results
The quality of the clustering result was evaluated using two evaluation measures: purity and
entropy, which are widely used to evaluate the performance of unsupervised learning algorithms
[14], [15]. On the one hand, the higher the purity value (P (cluster) =1), the better the quality of
the cluster is. On the other hand, the smaller the entropy value (E (cluster) =0), the better the
quality of the cluster is.
The goal of these experiments is to evaluate our proposed approach to extract Keyphrases from
Arabic documents then, the obtained results will be compared with our previous work [16].
Tables 3-5 show the average purity and entropy results for each similarity/distance measure with
document clustering algorithms cited above.
The overall entropy and purity values, for our experiments using the Agglomerative Hierarchical
algorithm with 7 schemes for merging the clusters, are shown in the tables 3, 4, 5 and 6.
Tables 3 and 5 summarize the obtained entropy scores in the all experiments, we remarked that
the scores shown in the first one are generally worst than those in the second gotten using the
extracted Keyphrases, but for those two tables the Agglomerative Hierarchical algorithm
performs good using the COMPLETE, UPGMA, WPGMA schemes, and Ward function with the
Cosine Similarity, the Jaccard measures and Pearson Correlation. The same behavior can be
concluded from purity scores tables.
Text Categories
Number of
Texts
Number of
Terms
Economics 29 67 478
Education 10 25 574
Health and
Medicine
32 40 480
Interviews 24 58 408
Politics 9 46 291
Recipes 9 4 973
Religion 19 111 199
Science 45 104 795
Sociology 30 85 688
Spoken 7 5 605
Sports 3 8 290
Tourist and Travel 61 46 093
11. 253 Computer Science & Information Technology (CS & IT)
The above obtained results (shown in the different Tables) lead us to conclude that:
• For the Agglomerative Hierarchical algorithm, the use of the COMPLETE, UPGMA [16],
WPGMA schemes, and Ward function as linkage techniques yield good results.
• Cosine Similarity, Jaccard and Pearson Correlation measures perform better relatively to the
other measures.
• The obtained overall entropy values shown in the different tables prove that the extracted
Keyphrases can make their topics salient and improve the clustering performance [16].
4.4. Discussion
The above conclusions shows that, the use of the extracted Keyphrases instead of the full-text
representation of documents is the best performing when clustered our Arabic documents that we
investigated. There is also another issue that must be mentioned, our experiments show the
improvements of the clustering quality and time. In the following, we make a few brief comments
on the behavior of the all tested linkage techniques:
The COMPLETE linkage technique is non-local, the entire structure of the clustering can
influence merge decisions. This results in a preference for compact clusters with small diameters,
but also causes sensitivity to outliers.
The Ward function allows us to minimize variance of the merge cluster; the variance is a measure
of how far a set of data is spread out. So the Ward function is a non-local linkage technique.
With the two techniques described above, a single document far from the center can increase
diameters of candidate merge clusters dramatically and completely change the final clustering.
That why these techniques produce good results than UPGMA [18], WPGMA schemes and better
than the other all tested linkage techniques; because this merge criterion give us local
information. We pay attention solely to the area where the two clusters come closest to each
other. Other, more distant parts of the cluster and the clusters overall structure are not taken into
account.
12. Computer Science & Information Technology (CS & IT) 254
5. CONCLUSIONS
To conclude, this investigation found that using the full-text representation of Arabic documents;
the Cosine Similarity, the Jaccard and the Pearson Correlation measures have comparable
effectiveness and performs better relatively to the other measures for all techniques cited above to
find more coherent clusters.
Furthermore, our experiments with different linkage techniques yield us to conclude that
COMPLETE, UPGMA, WPGMA and Ward produce efficient results than other linkage
techniques. A closer look to those results, show that the Ward technique is the best in all cases,
although the two other techniques are often not much worse.
Instead of using full-text as the representation for Arabic documents, we use our novel approach
based on Suffix Tree algorithm as Keyphrases extraction technique to eliminate the noise in the
documents and select the most salient sentences. Furthermore, Keyphrases extraction can help us
to overcome the varying length problem of the diverse documents.
In our experiments using extracted Keyphrases, we remark that again the Cosine Similarity, the
Jaccard and the Pearson Correlation measures have comparable effectiveness to produce more
coherent clusters than the Euclidean Distance and averaged KL Divergence; on the other hand,
the good results are detected when using COMPLETE, UPGMA, WPGMA and Ward as linkage
techniques.
Finally, we believe that the novel Keyphrases extraction approach and the comparative study
presented in this paper should be very useful to support the research in the field any Arabic Text
Mining applications; the main contribution of this paper is three manifolds:
1. We must mention that our experiments show the improvements of the clustering quality and
time when we use the extracted Keyphrases with our novel approach,
2. Cosine Similarity, Jaccard and Pearson Correlation measures are quite similar to find more
coherent clusters,
3. Ward technique is effective than other linkage techniques to produce more coherent clusters
using the Agglomerative Hierarchical algorithm.
REFERENCES
[1] Turney, P. D. (2000) Learning Algorithms for Keyphrase Extraction, Information Retrieval, 2, 303-
336.
[2] A.Huang, “Similarity Measures for Text Document Clustering”, NZCSRSC 2008, April 2008,
Christchurch, New Zealand.
[3] H.Froud, A.Lachkar, S. Ouatik, and R.Benslimane (2010). Stemming and Similarity Measures for
Arabic Documents Clustering. 5th International Symposium on I/V Communications and Mobile
Networks ISIVC, IEEE Xplore.
[4] P.Berkhin, “Survey of clustering data mining techniques”, Technical report, Accrue Software, San
Jose, California, 2002. http://citeseer.nj.nec.com/berkhin02survey.html.
[5] K.Sathiyakumari, G.Manimekalai, V.Preamsudha, M.Phil Scholar, “A Survey on Various Approaches
in Document Clustering”, Int. J. Comp. Tech. Appl., Vol 2 (5), 1534-1539, IJCTA | SEPT-OCT 2011.
[6] H. Froud, A. Lachkar, S. Alaoui Ouatik, “A Comparative Study Of Root-Based And Stem-Based
Approaches For Measuring The Similarity Between Arabic Words For Arabic Text Mining
Applications”, Advanced Computing: An International Journal (ACIJ), Vol.3, No.6, November 2012.
[7] R. B. Yates and B. R. Neto.”Modern Information Retrieval”. ADDISON-WESLEY, New York, 1999.
13. 255 Computer Science & Information Technology (CS & IT)
[8] B. Larsen and C. Aone.” Fast and Effective Text Mining using Linear-time Document Clustering”. In
Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 1999.
[9] N. Z. Tishby, F. Pereira, and W. Bialek. “The Information Bottleneck Method”. In Proceedings of the
37th Allerton Conference on Communication, Control and Computing, 1999.
[10] Khoja, S. and Garside, R. “Stemming Arabic Text”. Computing Department, Lancaster University,
Lancaster, 1999.
[11] Daniel Müllner, “Modern hierarchical, agglomerative clustering algorithms”, Lecture Notes in
Computer Science, Springer (2011).
[12] Teknomo, Kardi. (2009) Hierarchical Clustering Tutorial.
http://people.revoledu.com/kardi/tutorial/clustering/
[13] L. Al-Sulaiti , E.Atwell, “The Design of a Corpus of Contemporary Arabic”, University of Leeds.
[14] Y. Zhao and G. Karypis.”Evaluation of Hierarchical Clustering Algorithms for Document Datasets”.
In Proceedings of the International Conference on Information and Knowledge Management, 2002.
[15] Y. Zhao and G. Karypis.”Empirical and Theoretical Comparisons of Selected Criterion Functions for
Document Clustering”. Machine Learning, 55(3), 2004.
[16] H. Froud, A. Lachkar, S. Alaoui Ouatik, ”Arabic Text Summarization Based on Latent Semantic
Analysis to Enhance Arabic Documents Clustering”, International Journal of Data Mining &
Knowledge Management Process (IJDKP) Vol.3, No.1, January 2013.
[17] H. Froud, A. Lachkar, “Agglomerative Hierarchical Clustering Techniques for Arabic Documents”,
submitted to the Second International Workshop On Natural Language Processing (NLP 2013), KTO
Karatay University, June 07 ~ 09, 2013, Konya, Turkey.
[18] Michael Steinbach , George Karypis, andVipin Kumar, "A comparison of document clustering
techniques,"In KDD Workshop on Text Mining, 2002.
[19] P. Turney. Coherent keyphrase extraction via web mining. Technical Report ERB-1057, Institute for
Information Technology, National Research Council of Canada, 1999.
[20] Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into texts. In Proceedings of the 2004
Conference on Empirical Methods in Natural Language Processing.
[21] Weiner, P. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on
Foundations of Computer Science (FOCS’73), 1-11, 1973.
[22] Landau, G. M. and Vishkin, U. Fast Parallel and serial approximate string matching. Journal of
Algorithms, 10, 157-169, 1989.
[23] Ehrenfeucht, A. and Haussler, D. A new distance metric on strings computable in linear time.
Discrete Applied Math, 40, 1988.
[24] Rodeh, M., Pratt, V. R. and Even, S. Linear algorithm for data compression via string matching.
Journal of the ACM, 28(1):16-24, 1981.
[25] O. Zamir and O. Etzioni, “Grouper: a dynamic clustering interface to Web search results”, Computer
Networks, 31(11-16), pp. 1361-1374, 1999.
14. Computer Science & Information Technology (CS & IT) 256
Authors
Miss. Hanane Froud Phd Student in Laboratory of Information Science and Systems,
ECOLE NATIONALE DES SCIENCES APPLIQUÉES (ENSA-FEZ),University Sidi
Mohamed Ben Abdellah (USMBA),Fez, Morocco. She has also presented different
papers at different National and International conferences.
Issam SAHMOUDI received his Computer Engineering degree from ECOLE
NATIONALE DES SCIENCES APPLIQUÉES, FEZ (ENSA-FEZ). Now he is PhD
Student in Laboratory of Information Science and Systems LSIS, at (E.N.S.A), Sidi
Mohamed Ben Abdellah University (USMBA), Fez, Morocco. His current research
interests include Arabic Text Mining Applications: Arabic Web Document Clustering and
Browsing, Arabic Information and Retrieval Systems, and Arabic web Mining.
Pr. Abdelmonaime LACHKAR received his PhD degree from the USMBA, Morocco in
2004, He is Professor and Computer Engineering Program Coordinator at (E.N.S.A, FES),
and the Head of the Systems Architecture and Multimedia Team (LSIS Laboratory) at Sidi
Mohamed Ben Abdellah University, Fez, Morocco. His current research interests include
Arabic Natural Language Processing ANLP, Arabic Web Document Clustering and
Categorization, Arabic Information Retrieval Systems, Arabic Text Summarization, Arabic
Ontologies development and usage, Arabic Semantic Search Engines (SSEs).