The internet is comprised of massive amount of information in the form of zillions of web pages.This information can be categorized into the surface web and the deep web. The existing search engines can effectively make use of surface web information.But the deep web remains unexploited yet. Machine learning techniques have been commonly employed to access deep web content.
Under Machine Learning, topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using
contextual clues, topic models can connect words with similar meanings and distinguish between words with multiple meanings. Clustering is one of the key solutions to organize the deep web databases.In this paper, we cluster deep web databases based on the relevance found among deep web forms by employing a generative probabilistic model called Latent Dirichlet
Allocation(LDA) for modeling content representative of deep web databases. This is implemented after preprocessing the set of web pages to extract page contents and form
contents.Further, we contrive the distribution of “topics per document” and “words per topic”
using the technique of Gibbs sampling. Experimental results show that the proposed method clearly outperforms the existing clustering methods.
Clustering of Deep WebPages: A Comparative Studyijcsit
The internethas massive amount of information. This information is stored in the form of zillions of
webpages. The information that can be retrieved by search engines is huge, and this information constitutes
the ‘surface web’.But the remaining information, which is not indexed by search engines – the ‘deep web’,
is much bigger in size than the ‘surface web’, and remains unexploited yet.
Several machine learning techniques have been commonly employed to access deep web content. Under
machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A ‘topic’is
a cluster of words that frequently occur together and topic models can connect words with similar
meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web
databases employing several methods, and then perform a comparative study. In the first method, we apply
Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic
model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web
databases.Both these techniques are implemented after preprocessing the set of web pages to extract page
contents and form contents.Further, we propose another version of Latent Dirichlet Allocation (LDA) to the
dataset. Experimental results show that the proposed method outperforms the existing clustering methods.
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
Effective Feature Selection for Mining Text Data with Side-InformationIJTET Journal
Abstract— Many text documents contain side-information. Many web documents consist of meta-data with them which correspond to different kinds of attributes such as the source or other information related to the origin of the document. Data such as location, ownership or even temporal information may be considered as side-information. This huge amount of information may be used for performing text clustering. This information can either improve the quality of the representation for the mining process, or can add noise to the process. When the information is noisy it can be a risky approach for performing mining process along with the side-information. These noises can reduce the quality of clustering while if the side-information is informative then it can improve the quality of clustering. In existing system, Gini index is used as the feature selection method to filter the informative side-information from text documents. It is effective to a certain extent but the remaining number of features is still huge. It is important to use feature selection methods to handle the high dimensionality of data for effective text categorization. In the proposed system, In order to improve the document clustering and classification accuracy as well as reduce the number of selected features, a novel feature selection method was proposed. To improve the accuracy and purity of document clustering with less time complexity a new method called Effective Feature Selection (EFS) is introduced. This three-stage procedure includes feature subset selection, feature ranking and feature re-ranking.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
The document discusses multidimensional databases. It defines multidimensional databases as systems designed to efficiently store and retrieve large volumes of related data that can be viewed from different perspectives or dimensions. It provides an example using automobile sales data that can be analyzed based on dimensions like model, color, dealership, and time. Multidimensional databases allow for interactive analysis of data from multiple angles, unlike relational databases that are slower for such analyses.
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET Journal
This document discusses using a K-means clustering algorithm to extract concepts from ambiguous text documents. It involves preprocessing the text by tokenizing, removing stop words, and stemming words. The words are then represented as vectors and dimensionality reduction using PCA is applied. Finally, K-means clustering is used to group similar words into clusters to identify the overall concepts in the document without reading the entire text. The aim is to help users understand the key topics in a document in a time-efficient manner without having to read the full text.
The growing number of datasets published on the Web as linked data brings both opportunities for high data
availability of data. As the data increases challenges for querying also increases. It is very difficult to search
linked data using structured languages. Hence, we use Keyword Query searching for linked data. In this paper,
we propose different approaches for keyword query routing through which the efficiency of keyword search can
be improved greatly. By routing the keywords to the relevant data sources the processing cost of keyword search
queries can be greatly reduced. In this paper, we contrast and compare four models – Keyword level, Element
level, Set level and query expansion using semantic and linguistic analysis. These models are used for keyword
query routing in keyword search.
Clustering of Deep WebPages: A Comparative Studyijcsit
The internethas massive amount of information. This information is stored in the form of zillions of
webpages. The information that can be retrieved by search engines is huge, and this information constitutes
the ‘surface web’.But the remaining information, which is not indexed by search engines – the ‘deep web’,
is much bigger in size than the ‘surface web’, and remains unexploited yet.
Several machine learning techniques have been commonly employed to access deep web content. Under
machine learning, topic models provide a simple way to analyze large volumes of unlabeled text. A ‘topic’is
a cluster of words that frequently occur together and topic models can connect words with similar
meanings and distinguish between words with multiple meanings. In this paper, we cluster deep web
databases employing several methods, and then perform a comparative study. In the first method, we apply
Latent Semantic Analysis (LSA) over the dataset. In the second method, we use a generative probabilistic
model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web
databases.Both these techniques are implemented after preprocessing the set of web pages to extract page
contents and form contents.Further, we propose another version of Latent Dirichlet Allocation (LDA) to the
dataset. Experimental results show that the proposed method outperforms the existing clustering methods.
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
Effective Feature Selection for Mining Text Data with Side-InformationIJTET Journal
Abstract— Many text documents contain side-information. Many web documents consist of meta-data with them which correspond to different kinds of attributes such as the source or other information related to the origin of the document. Data such as location, ownership or even temporal information may be considered as side-information. This huge amount of information may be used for performing text clustering. This information can either improve the quality of the representation for the mining process, or can add noise to the process. When the information is noisy it can be a risky approach for performing mining process along with the side-information. These noises can reduce the quality of clustering while if the side-information is informative then it can improve the quality of clustering. In existing system, Gini index is used as the feature selection method to filter the informative side-information from text documents. It is effective to a certain extent but the remaining number of features is still huge. It is important to use feature selection methods to handle the high dimensionality of data for effective text categorization. In the proposed system, In order to improve the document clustering and classification accuracy as well as reduce the number of selected features, a novel feature selection method was proposed. To improve the accuracy and purity of document clustering with less time complexity a new method called Effective Feature Selection (EFS) is introduced. This three-stage procedure includes feature subset selection, feature ranking and feature re-ranking.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
The document discusses multidimensional databases. It defines multidimensional databases as systems designed to efficiently store and retrieve large volumes of related data that can be viewed from different perspectives or dimensions. It provides an example using automobile sales data that can be analyzed based on dimensions like model, color, dealership, and time. Multidimensional databases allow for interactive analysis of data from multiple angles, unlike relational databases that are slower for such analyses.
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET Journal
This document discusses using a K-means clustering algorithm to extract concepts from ambiguous text documents. It involves preprocessing the text by tokenizing, removing stop words, and stemming words. The words are then represented as vectors and dimensionality reduction using PCA is applied. Finally, K-means clustering is used to group similar words into clusters to identify the overall concepts in the document without reading the entire text. The aim is to help users understand the key topics in a document in a time-efficient manner without having to read the full text.
The growing number of datasets published on the Web as linked data brings both opportunities for high data
availability of data. As the data increases challenges for querying also increases. It is very difficult to search
linked data using structured languages. Hence, we use Keyword Query searching for linked data. In this paper,
we propose different approaches for keyword query routing through which the efficiency of keyword search can
be improved greatly. By routing the keywords to the relevant data sources the processing cost of keyword search
queries can be greatly reduced. In this paper, we contrast and compare four models – Keyword level, Element
level, Set level and query expansion using semantic and linguistic analysis. These models are used for keyword
query routing in keyword search.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
This document discusses exploiting Wikipedia and Twitter for text mining applications. It explores using Wikipedia's category-article structure for text classification, subjectivity analysis, and keyword extraction. It evaluates classifying tweets as relevant/irrelevant to entities or brands and classifying tweets into topical dimensions like workplace or innovation. Features used include relatedness scores between tweet text and Wikipedia categories, topic modeling scores, and Twitter-specific features. Experimental results show the Wikipedia framework based on its category-article structure outperforms standard text mining techniques.
Semantically indexed hypermedia linking information disciplinesunyil96
This document discusses semantic indexing as a way to facilitate access to information in large hypertexts and the semantic web. Semantic indexing represents semantic knowledge about a domain through a controlled vocabulary of index terms with semantic relationships. This allows indirect navigation between information items based on queries to the semantic index space. Reasoning over the semantic relationships can provide intelligent navigation support through techniques like expanding queries, filtering options, and ranking results based on semantic distance in the index.
Challenging Issues and Similarity Measures for Web Document ClusteringIOSR Journals
This document discusses challenging issues and similarity measures for web document clustering. It begins with an introduction to text mining and document clustering. It then reviews related work on similarity approaches and measures. Some key challenging issues in web document clustering are discussed, such as measuring semantic similarity between words and evaluating cluster validity. Various types of similarity measures are also described, including string-based measures like Jaro-Winkler distance and corpus-based measures like latent semantic analysis. The conclusion states that accurate clustering requires a precise definition of similarity between document pairs and discusses different similarity measures that can be used.
A Soft Set-based Co-occurrence for Clustering Web User TransactionsTELKOMNIKA JOURNAL
This document proposes a soft set-based approach for clustering web user transactions to achieve lower computational complexity and higher clustering purity compared to previous rough set approaches. Unlike rough set approaches that use similarity, the proposed approach uses a co-occurrence approach based on soft set theory. The soft set representation of web user transactions allows modeling as a binary-valued information system. The approach is evaluated in comparison to two previous rough set-based approaches, demonstrating better performance with over 100% lower computational complexity and higher cluster purity.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
This document discusses document classification using a k-nearest neighbors algorithm with dynamic attribute weighting and bootstrap sampling. It begins with an introduction to text mining and document classification. It then describes k-nearest neighbors classification and how bootstrap sampling can be used to improve k-NN by assigning different weightings to attributes. The document evaluates this approach and compares its performance to traditional k-NN classification.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
This document summarizes compressed full-text indexes. Compressed full-text indexes aim to provide fast substring search over large text collections while using space close to the compressed text size. The document discusses how text compression and regularities in text indexing structures are related. It covers major self-indexes, which are compressed indexes that can also reproduce text portions, replacing the text. These self-indexes exploit text compressibility to achieve compact structures for efficient search.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...cseij
This document discusses text categorization and automatic text classification. It begins by explaining how digitization has increased the need to organize and classify text for more efficient information retrieval. It then provides mathematical and graphical models for text mining, text categorization, and automatic text classification. These models are intended to improve understanding of these techniques and shorten response times for text retrieval. The document defines key terms and concepts related to text categorization and mining, and explains knowledge engineering and machine learning approaches to automatic text classification.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Visualization approaches in text mining emphasize making large amounts of data easily accessible and identifying patterns within the data. Common visualization tools include simple concept graphs, histograms, line graphs, and circle graphs. These tools allow users to quickly explore relationships within text data and gain insights that may not be apparent from raw text alone. Architecturally, visualization tools are layered on top of text mining systems' core algorithms and allow for modular integration of different visualization front ends.
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
This document discusses topic modeling on 350 million documents from Mendeley. It describes how topic modeling can be used to categorize documents into topics and subcategories, though categorization is imperfect and topics change over time. It also discusses how topic modeling and metrics can help with fact discovery and reproducibility of research to build more robust datasets.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
In this paper we present an approach to Word Sense Disambiguation based on Topic Modeling (LDA). Our approach consists of two different steps, where first a binary classifier is applied to decide whether the most frequent sense applies or not, and then another classifier deals with the non most frequent sense cases. An exhaustive evaluation is performed on the Spanish corpus Ancora, to analyze the performance of our two-step system and the impact of the context and the different parameters in the system. Our best experiment reaches an accuracy of 74.53, which is 6 points over the highest baseline. All the software developed for these experiments has been made freely available, to enable reproducibility and allow the re-usage of the software.
Experience aware Item Recommendation in Evolving Review CommunitiesSubhabrata Mukherjee
Experience aware Item Recommendation in Evolving Review Communities
Subhabrata Mukherjee, Hemank Lamba and Gerhard Weikum
IEEE International Conference in Data Mining (ICDM) 2015
Joint Author Sentiment Topic Model, Subhabrata Mukherjee, Gaurab Basu and Sachindra Joshi, In Proc. of the SIAM International Conference in Data Mining (SDM 2014), Pennsylvania, USA, Apr 24-26, 2014 [http://people.mpi-inf.mpg.de/~smukherjee/jast.pdf]
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
This is presentation slide files in machine learning summer school in Korea.
http://prml.yonsei.ac.kr/
I talked about dirichlet distribution, dirichlet process and HDP.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
This document discusses exploiting Wikipedia and Twitter for text mining applications. It explores using Wikipedia's category-article structure for text classification, subjectivity analysis, and keyword extraction. It evaluates classifying tweets as relevant/irrelevant to entities or brands and classifying tweets into topical dimensions like workplace or innovation. Features used include relatedness scores between tweet text and Wikipedia categories, topic modeling scores, and Twitter-specific features. Experimental results show the Wikipedia framework based on its category-article structure outperforms standard text mining techniques.
Semantically indexed hypermedia linking information disciplinesunyil96
This document discusses semantic indexing as a way to facilitate access to information in large hypertexts and the semantic web. Semantic indexing represents semantic knowledge about a domain through a controlled vocabulary of index terms with semantic relationships. This allows indirect navigation between information items based on queries to the semantic index space. Reasoning over the semantic relationships can provide intelligent navigation support through techniques like expanding queries, filtering options, and ranking results based on semantic distance in the index.
Challenging Issues and Similarity Measures for Web Document ClusteringIOSR Journals
This document discusses challenging issues and similarity measures for web document clustering. It begins with an introduction to text mining and document clustering. It then reviews related work on similarity approaches and measures. Some key challenging issues in web document clustering are discussed, such as measuring semantic similarity between words and evaluating cluster validity. Various types of similarity measures are also described, including string-based measures like Jaro-Winkler distance and corpus-based measures like latent semantic analysis. The conclusion states that accurate clustering requires a precise definition of similarity between document pairs and discusses different similarity measures that can be used.
A Soft Set-based Co-occurrence for Clustering Web User TransactionsTELKOMNIKA JOURNAL
This document proposes a soft set-based approach for clustering web user transactions to achieve lower computational complexity and higher clustering purity compared to previous rough set approaches. Unlike rough set approaches that use similarity, the proposed approach uses a co-occurrence approach based on soft set theory. The soft set representation of web user transactions allows modeling as a binary-valued information system. The approach is evaluated in comparison to two previous rough set-based approaches, demonstrating better performance with over 100% lower computational complexity and higher cluster purity.
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
This document discusses document classification using a k-nearest neighbors algorithm with dynamic attribute weighting and bootstrap sampling. It begins with an introduction to text mining and document classification. It then describes k-nearest neighbors classification and how bootstrap sampling can be used to improve k-NN by assigning different weightings to attributes. The document evaluates this approach and compares its performance to traditional k-NN classification.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
This document summarizes compressed full-text indexes. Compressed full-text indexes aim to provide fast substring search over large text collections while using space close to the compressed text size. The document discusses how text compression and regularities in text indexing structures are related. It covers major self-indexes, which are compressed indexes that can also reproduce text portions, replacing the text. These self-indexes exploit text compressibility to achieve compact structures for efficient search.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...cseij
This document discusses text categorization and automatic text classification. It begins by explaining how digitization has increased the need to organize and classify text for more efficient information retrieval. It then provides mathematical and graphical models for text mining, text categorization, and automatic text classification. These models are intended to improve understanding of these techniques and shorten response times for text retrieval. The document defines key terms and concepts related to text categorization and mining, and explains knowledge engineering and machine learning approaches to automatic text classification.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Visualization approaches in text mining emphasize making large amounts of data easily accessible and identifying patterns within the data. Common visualization tools include simple concept graphs, histograms, line graphs, and circle graphs. These tools allow users to quickly explore relationships within text data and gain insights that may not be apparent from raw text alone. Architecturally, visualization tools are layered on top of text mining systems' core algorithms and allow for modular integration of different visualization front ends.
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
This document discusses topic modeling on 350 million documents from Mendeley. It describes how topic modeling can be used to categorize documents into topics and subcategories, though categorization is imperfect and topics change over time. It also discusses how topic modeling and metrics can help with fact discovery and reproducibility of research to build more robust datasets.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
In this paper we present an approach to Word Sense Disambiguation based on Topic Modeling (LDA). Our approach consists of two different steps, where first a binary classifier is applied to decide whether the most frequent sense applies or not, and then another classifier deals with the non most frequent sense cases. An exhaustive evaluation is performed on the Spanish corpus Ancora, to analyze the performance of our two-step system and the impact of the context and the different parameters in the system. Our best experiment reaches an accuracy of 74.53, which is 6 points over the highest baseline. All the software developed for these experiments has been made freely available, to enable reproducibility and allow the re-usage of the software.
Experience aware Item Recommendation in Evolving Review CommunitiesSubhabrata Mukherjee
Experience aware Item Recommendation in Evolving Review Communities
Subhabrata Mukherjee, Hemank Lamba and Gerhard Weikum
IEEE International Conference in Data Mining (ICDM) 2015
Joint Author Sentiment Topic Model, Subhabrata Mukherjee, Gaurab Basu and Sachindra Joshi, In Proc. of the SIAM International Conference in Data Mining (SDM 2014), Pennsylvania, USA, Apr 24-26, 2014 [http://people.mpi-inf.mpg.de/~smukherjee/jast.pdf]
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
This is presentation slide files in machine learning summer school in Korea.
http://prml.yonsei.ac.kr/
I talked about dirichlet distribution, dirichlet process and HDP.
Boston Dataswap Topic Modeling by Alice OhAlice Oh
This document provides an overview of topic modeling and computational social science. It begins with defining topic modeling and its applications in natural language processing. The document then discusses how topic models can be used to analyze large corpora of text data like Twitter conversations to discover latent topics and emotion dynamics. Examples are given of applying topic models with domain knowledge to model different emotions and analyzing emotion transitions between users in conversations. The document highlights some open research questions around defining influence between users based on emotional expressions.
Topic modeling of Emergency Department Triage notes for characterising pain-r...Karin Verspoor
Background
Pain is a feature of approximately 70% of all Emergency Department (ED) presentations. It has been demonstrated that mandating recording of a patient’s feeling of pain can improve service delivery for ED patients. However, there is a substantial group of patients (approximately 21% of ED visits in our 12-month sample) for which there exists an inconsistency between pain score and the Australian Triage Scale (ATS) score assigned by the nurse; where a patient reports high levels of pain but they are assigned a lower-urgency triage category. It has been unclear until now whether this “inconsistent” group of patients has been receiving optimal care.
Methods
To better understand the characteristics in this inconsistent group, we performed topic modeling of the clinical notes collected during ED triage assessments. We divided the notes into two subgroups, according to whether or not the patient’s self-reported level of pain was consistent with the triage urgency recorded in the ATS score. We performed topic modeling of these two subgroups separately, using the implementation of Latent Dirichlet Allocation (LDA) in the Mallet toolkit. We have experimented with several representations of the notes, including unigrams (tokens), bigrams, and the medical concepts contained in each note, as determined with the MetaMap medical concept recognition tool. An ED nurse reviewed the topics generated in each case and assigned a descriptor to them.
Results
When considering the token-based presentation of the notes, the labels in the consistent group are related to road trauma, cardiac pain, change of consciousness, ongoing chest pain, limb injury, renal illness and pain due to illness. In the inconsistent group, we find topics related to either conditions related to ongoing conditions (including postoperative complications or worsening abdominal pain), urinary and respiratory problems, infections and injury related complications.
When considering the concept-based representation of the notes, the labels in the consistent set denote gastrointestinal diseases, neurological illness, dizziness, chest pain, testicular pain, shortness of breath and trauma. The labels in the inconsistent set denote different issues caused by trauma and distress due to pain, infection and urinary condition. This includes injuries in several body parts like in the limbs and back. The latter topic containing body parts appears to have been enabled by the abstraction of individual terms into concepts.
Conclusions
Topic modeling of Emergency Department data shows substantial promise for helping to characterise particular subpopulations of interest, and incorporating pre-processing of clinical notes to capture variation in clinical terminology appears to have value. While this initial work has focused on the pain-related chief complaints, we have also recently begun to explore temporal characteristics of the data through analysis of how derived topics change ove
1) ChronoSAGE is a topic modeling method that presents research trends in an academic paper corpus in a chronological and readable manner.
2) It does this by extracting the trending words for each time epoch and displaying them chronologically using a modified version of the SAGE topic modeling algorithm.
3) An evaluation shows ChronoSAGE outperforms LDA and can extract chronological topic trends as top words for each time period, providing an analysis of trends over time that traditional topic modeling cannot.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Semi-supervised Bibliographic Element Segmentation with Latent PermutationsTomonari Masada
The document describes a semi-supervised approach to segmenting bibliographic elements in document references using latent permutations. It introduces an approach that models each document as a mixture of topics, and arranges the topic assignments using a latent permutation to capture element order. The approach is evaluated on synthetic datasets, achieving around 82% accuracy on noisy data with random element shuffling.
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Charalampos Chelmis
This document discusses generative models for tripartite graphs in social media that model users, resources, and tags. It presents three models:
1) The User-Concept model that models users based on their tag usage but ignores resources and social aspects.
2) The User-Resource model that models resources as vocabulary terms but ignores tags and social aspects.
3) The User-Resource-Concept model that models both resources and users' interests using a topic-based representation and models the social generation of annotations.
The models are evaluated on their ability to predict tags/resources for new users, recommend social ties, and compare to baseline similarity metrics, with the ensemble approach achieving the best performance.
StreamGrid: Summarization of large-scale Events using Topic Modeling and Temp...Symeon Papadopoulos
The document summarizes the StreamGrid approach for event summarization using social media data. StreamGrid uses topic modeling to identify main topics within a dataset, creates a timeline for each topic, and builds a structure to relate topics and time. It then generates summaries for specific time frames by selecting important messages from active topics while minimizing redundancy. An experimental study on tweets from the 2013 Sundance Film Festival found that StreamGrid outperformed baselines in covering multiple aspects but could be improved by integrating popularity.
Generalization of Tensor Factorization and ApplicationsKohei Hayashi
This document presents two tensor factorization methods: Exponential Family Tensor Factorization (ETF) and Full-Rank Tensor Completion (FTC). ETF generalizes Tucker decomposition by allowing for different noise distributions in the tensor and handles mixed discrete and continuous values. FTC completes missing tensor values without reducing dimensionality by kernelizing Tucker decomposition. The document outlines these methods and their motivations, discusses Tucker decomposition, and provides an example applying ETF to anomaly detection in time series sensor data.
Konstantin Vorontsov - BigARTM: Open Source Library for Regularized Multimoda...AIST
The document describes BigARTM, an open source library for regularized multimodal topic modeling of large collections. It discusses probabilistic topic modeling and how additive regularization of topic models (ARTM) handles ill-posed inverse problems in topic modeling. ARTM allows various regularizers to be combined. BigARTM provides a parallel implementation for improved time and memory performance. Experiments show how ARTM can combine regularizers and be used for classification and multi-language topic modeling. Multimodal topic modeling binds topics to terms, authors, images, links and other modalities.
Topic Modeling via Tensor Factorization Use Case for Apache REEF FrameworkDataWorks Summit
This document summarizes the Cloud and Information Services Lab (CISL) at Microsoft. It describes CISL's work on the Apache REEF (REsource Execution and Fabric) project, which provides an application development framework for distributed computing on YARN and other resource managers. REEF aims to simplify development of big data applications with an event-based programming model and by handling communication, storage, and fault tolerance. The document outlines REEF's logical abstraction, architecture, and programming model. It also provides examples of how tasks can be configured and submitted in REEF and how data can be broadcast between tasks.
Text modeling with R, Python, and SparkFrank Evans
This presentation was given to ODSC Boston in Feb. 2016. It outlines how to perform text clustering in R, as well Latent Dirichlet Allocation Topic Modeling on both small text data in Python and large text data using Apache Spark. Code from slides: https://github.com/frankdevans/odsc_meetup_text_processing
Linked Data Generation for the University Data From Legacy Database dannyijwest
Web was developed to share information among the users through internet as some hyperlinked documents.
If someone wants to collect some data from the web he has to search and crawl through the documents to
fulfil his needs. Concept of Linked Data creates a breakthrough at this stage by enabling the links within
data. So, besides the web of connected documents a new web developed both for humans and machines, i.e.,
the web of connected data, simply known as Linked Data Web. Since it is a very new domain, still a very
few works has been done, specially the publication of legacy data within a University domain as Linked
Data.
Here are the key points about using content-based filtering techniques:
- Content-based filtering relies on analyzing the content or description of items to recommend items similar to what the user has liked in the past. It looks for patterns and regularities in item attributes/descriptions to distinguish highly rated items.
- The item content/descriptions are analyzed automatically by extracting information from sources like web pages, or entered manually from product databases.
- It focuses on objective attributes about items that can be extracted algorithmically, like text analysis of documents.
- However, personal preferences and what makes an item appealing are often subjective qualities not easily extracted algorithmically, like writing style or taste.
- So while content-based filtering can
Semantic Query Optimisation with Ontology Simulationdannyijwest
Semantic Web is, without a doubt, gaining momentum in both industry and academia. The word “Semantic” refers to “meaning” – a semantic web is a web of meaning. In this fast changing and result oriented practical world, gone are the days where an individual had to struggle for finding information on the Internet where knowledge management was the major issue. The semantic web has a vision of linking, integrating and analysing data from various data sources and forming a new information stream, hence a web of databases connected with each other and machines interacting with other machines to yield results which are user oriented and accurate. With the emergence of Semantic Web framework the naïve approach of searching information on the syntactic web is cliché. This paper proposes an optimised semantic searching of keywords exemplified by simulation an ontology of Indian universities with a proposed algorithm which ramifies the effective semantic retrieval of information which is easy to access and time saving.
Maximum Spanning Tree Model on Personalized Web Based Collaborative Learning ...ijcseit
Web 3.0 is an evolving extension of the current web environme bnt. Information in web 3.0 can be collaborated and communicated when queried. Web 3.0 architecture provides an excellent learning experience to the students. Web 3.0 is 3D, media centric and semantic. Web based learning has been on
high in recent days. Web 3.0 has intelligent agents as tutors to collect and disseminate the answers to the queries by the students. Completely Interactive learner’s query determine the customization of the intelligent tutor. This paper analyses the Web 3.0 learning environment attributes. A Maximum spanning
tree model for the personalized web based collaborative learning is designed.
Maximum Spanning Tree Model on Personalized Web Based Collaborative Learning ...ijcseit
Web 3.0 is an evolving extension of the current web environme bnt. Information in web 3.0 can be
collaborated and communicated when queried. Web 3.0 architecture provides an excellent learning
experience to the students. Web 3.0 is 3D, media centric and semantic. Web based learning has been on
high in recent days. Web 3.0 has intelligent agents as tutors to collect and disseminate the answers to the
queries by the students. Completely Interactive learner’s query determine the customization of the
intelligent tutor. This paper analyses the Web 3.0 learning environment attributes. A Maximum spanning
tree model for the personalized web based collaborative learning is designed.
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGIJDKP
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near
Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages
plays an important role in the performance degradation while integrating data from heterogeneous
sources. These pages either increase the index storage space or increase the serving costs. Detecting these
pages has many potential applications for example may indicate plagiarism or copyright infringement.
This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are
used to perform clustering of documents .We demonstrated our approach in web news articles domain. The
experimental results show that our algorithm outperforms in terms of similarity measures. The near
duplicate and duplicate document identification has resulted reduced memory in repositories.
Data Mining in Multi-Instance and Multi-Represented Objectsijsrd.com
This document discusses multi-instance learning and data mining. It begins by introducing multi-instance learning, where training data consists of labeled bags containing unlabeled instances. Each web page is treated as a bag and its linked pages are instances. Classification algorithms are adapted to handle this type of data representation. The document then evaluates algorithms for web index recommendation as a multi-instance problem. It compares algorithms that do and do not account for multi-instance characteristics. Finally, it discusses approaches for identifying multi-instance outliers based on single-instance outlier detection methods.
This document discusses web document clustering using a hybrid approach in data mining. It begins with an abstract describing the huge amount of data on the internet and need to organize web documents into clusters. It then discusses requirements for document clustering like scalability, noise tolerance, and ability to present concise cluster summaries. Different existing document clustering approaches are described, including text-based and link-based approaches. The proposed approach uses a concept-based mining model along with hierarchical agglomerative clustering and link-based algorithms to cluster web documents based on both their content and hyperlinks. This hybrid approach aims to provide more relevant clustered documents to users than previous methods.
Abstract— Relationships are there between objects in Wikipedia. Emphases on determining relationships are there between pairs of objects in Wikipedia whose pages can be regarded as separate objects. Two classes of relationships between two objects exist there in Wikipedia, an explicit relationship is illustrated by a single link between the two pages for the objects, and the other implicit relationship is illustrated by a link structure containing the two pages. Some of the before proposed techniques for determining relationships are cohesion-based techniques, and this technique which underestimate objects containing higher degree values and also such objects could be significant in constituting relationships in Wikipedia. The other techniques are inadequate for determining implicit relationships because they use only one or two of the following three important factor such as the distance, the connectivity and the cocitation.
1. The document proposes techniques to improve search performance by matching schemas between structured and unstructured data sources.
2. It involves constructing schema mappings using named entities and schema structures. It also uses strategies to narrow the search space to relevant documents.
3. The techniques were shown to improve search accuracy and reduce time/space complexity compared to existing methods.
An Incremental Method For Meaning Elicitation Of A Domain OntologyAudrey Britton
This document describes MELIS (Meaning Elicitation and Lexical Integration System), a tool developed to support the annotation of data sources with conceptual and lexical information during the ontology generation process in MOMIS (Mediator envirOnment for Multiple Information Systems). MELIS improves upon MOMIS by enabling more automated annotation of data sources and by extracting relationships between lexical elements using lexical and domain knowledge to provide a richer semantics. The document outlines the MOMIS architecture integrated with MELIS and explains how MELIS supports key steps in the ontology creation process, including source annotation, common thesaurus generation, and relationship extraction.
Discovering Resume Information using linked data dannyijwest
In spite of having different web applications to create and collect resumes, these web applications suffer
mainly from a common standard data model, data sharing, and data reusing. Though, different web
applications provide same quality of resume information, but internally there are many differences in terms
of data structure and storage which makes computer difficult to process and analyse the information from
different sources. The concept of Linked Data has enabled the web to share data among different data
sources and to discover any kind of information while resolving the issues like heterogeneity,
interoperability, and data reusing between different data sources and allowing machine process-able data
on the web.
A web content mining application for detecting relevant pages using Jaccard ...IJECEIAES
The tremendous growth in the availability of enormous text data from a variety of sources creates a slew of concerns and obstacles to discovering meaningful information. This advancement of technology in the digital realm has resulted in the dispersion of texts over millions of web sites. Unstructured texts are densely packed with textual information. The discovery of valuable and intriguing relationships in unstructured texts demands more computer processing. So, text mining has developed into an attractive area of study for obtaining organized and useful data. One of the purposes of this research is to discuss text pre-processing of automobile marketing domains in order to create a structured database. Regular expressions were used to extract data from unstructured vehicle advertisements, resulting in a well-organized database. We manually develop unique rule-based ways of extracting structured data from unstructured web pages. As a result of the information retrieved from these advertisements, a systematic search for certain noteworthy qualities is performed. There are numerous approaches for query recommendation, and it is vital to understand which one should be employed. Additionally, this research attempts to determine the optimal value similarity for query suggestions based on user-supplied parameters by comparing MySQL pattern matching and Jaccard similarity.
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a pre-
processing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
The huge volume of text documents available on the internet has made it difficult to find valuable
information for specific users. In fact, the need for efficient applications to extract interested knowledge
from textual documents is vitally important. This paper addresses the problem of responding to user
queries by fetching the most relevant documents from a clustered set of documents. For this purpose, a
cluster-based information retrieval framework was proposed in this paper, in order to design and develop
a system for analysing and extracting useful patterns from text documents. In this approach, a preprocessing step is first performed to find frequent and high-utility patterns in the data set. Then a Vector
Space Model (VSM) is performed to represent the dataset. The system was implemented through two main
phases. In phase 1, the clustering analysis process is designed and implemented to group documents into
several clusters, while in phase 2, an information retrieval process was implemented to rank clusters
according to the user queries in order to retrieve the relevant documents from specific clusters deemed
relevant to the query. Then the results are evaluated according to evaluation criteria. Recall and Precision
(P@5, P@10) of the retrieved results. P@5 was 0.660 and P@10 was 0.655.
Extraction and Retrieval of Web based Content in Web EngineeringIRJET Journal
The document discusses a proposed architecture for parallelizing natural language processing (NLP) operations and web content crawling using Apache Hadoop and MapReduce. The system extracts keywords and key phrases from online articles using NLP techniques like part-of-speech tagging in a Hadoop cluster. Evaluation of the system showed improved storage capacity, faster data processing, shorter search times and accurate information retrieval from large datasets stored in HBase.
Sentimental classification analysis of polarity multi-view textual data using...IJECEIAES
The data and information available in most community environments is complex in nature. Sentimental data resources may possibly consist of textual data collected from multiple information sources with different representations and usually handled by different analytical models. These types of data resource characteristics can form multi-view polarity textual data. However, knowledge creation from this type of sentimental textual data requires considerable analytical efforts and capabilities. In particular, data mining practices can provide exceptional results in handling textual data formats. Besides, in the case of the textual data exists as multi-view or unstructured data formats, the hybrid and integrated analysis efforts of text data mining algorithms are vital to get helpful results. The objective of this research is to enhance the knowledge discovery from sentimental multi-view textual data which can be considered as unstructured data format to classify the polarity information documents in the form of two different categories or types of useful information. A proposed framework with integrated data mining algorithms has been discussed in this paper, which is achieved through the application of X-means algorithm for clustering and HotSpot algorithm of association rules. The analysis results have shown improved accuracies of classifying the sentimental multi-view textual data into two categories through the application of the proposed framework on online polarity user-reviews dataset upon a given topics.
Similar to Topic Modeling : Clustering of Deep Webpages (20)
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
2. 20 Computer Science & Information Technology (CS & IT)
Wide Web content is not explored by any standard search engines[1]. This content is “masked”
and is referred to as the deep web[2].
The deep web is significantly gigantic. According to a July 2000 white paper [3], the deep web is
500 times larger than the surface web and indeed, continues to proliferate this magnitude[3].From
the surveys presented in [4],[5] we can get a lucid understanding of the surface web. There are
links buried far down on sites thriving on the surface web, which direct us to the news and history
of the Deep web. It is quite common to come across deep web pages that have links terminating
with the extension ‘.onion’. Such web pages require us to access them through browsers named
‘Tor’, which connect to the servers containing the repository and get access consent[6].
Today, the 60 largest Deep Web sites contain around 750 terabytes of data, surpassing the size of
the entire Surface Web 40 times. 95% of the Deep Web is publically accessible, which is free of
cost. This is why search engines, such as Google, index well over a trillion pages on the World
Wide Web, but there is information on the web that common search engines don’t explore. Most
of this constitutes databases of information that need to be searched directly from the specific
website. A small pocket of the deep web is filled with hyper-secret communities who flock there
to escape identification from authorities [7].
1.2 Topic Modeling
Topic models provide a simple way to analyze large volumes of unlabeled text. A topic consists
of a cluster of words that frequently occur together. Using contextual clues, topic models can
connect words with similar meanings and distinguish between uses of words with multiple
meanings. Topic models express the semantic information of words and documents by
‘topics’[8].
1.1 Clustering
Clustering is a division of data into groups of similar objects where each group consists of objects
that are similar between themselves and dissimilar to objects of other groups.
From the machine learning perspective, clustering can be viewed as unsupervised learning of
concepts. We can cluster images, patterns, shopping items, feet, words, documents and many
more fields. Clustering has wide scope of application in data mining, text mining, information
retrieval, statistical computational linguistics, and corpus based computational lexicography.
Clustering has the following advantages:
1. Clustering improves precision and recall in information retrieval.
2. It organizes the results provided by search engines.
3. It generates document taxonomies.
4. It also generates ontologies and helps in classifying collocations.
A good clustering algorithm needs to have the characteristics, mentioned below:
1. Scalability
2. High dimensionality
3. Computer Science & Information Technology (CS & IT) 21
3. Ability to discover clusters of arbitrary shape
Considering the above factors and enormous size of documents, we have analyzed that Latent
Dirichlet allocation outperforms the efficiency of other existing clustering algorithms [9].
2. RELATED WORK
Several deep web clustering approaches exist. Here, we briefly discuss the motivation for our
work. The content in [11] proposes to cluster deep web pages by extracting the page content and
the form content of a deep web page. Considering the extracted parts as words in a document,
clustering is done. The results prove to be better than those of several previous models. However,
for some datasets, words were clustered under irrelevant topics.
3. PROPOSED METHOD
A topic consists of a cluster of words that frequently occur together. Topic models can connect
words with similar meanings and distinguish between uses of words with multiple meanings.A
variety of topic models have been used to analyze the content of documents and classify the
documents. For example, Latent Semantic Analysis compares documents by representing them as
vectors, and computing their dot product. In our paper, we have used another such algorithm:
Latent Dirichlet Allocation [10].
3.1 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a generative probabilistic topic model for collections of
discrete data. The generative model describes that the documents are generated using the
following algorithm. For each document:
1. Firstly, a distribution over topics is chosen randomly.
2. Now, for each word to be generated in the document,
a. A topic is chosen from the distribution created in Step 1.
b. Next, a word is chosen randomly from the corresponding distribution over
vocabulary for the topic.
The Dirichlet prior on per document topic distribution is given by
pሺθ|αሻ =
Гሺ∑
ౡ
సభ ሻ
∏ Гሺሻౡ
సభ
θଵ
ିଵ
….θ୩
ౡିଵ
(1)
Joint distribution of topic mixture θ, a set of N topic z, a set of N words w
pሺθ, z, w|α, βሻ = pሺθ|αሻ ∏ pሺz୬|θሻpሺw୬|z୬, βሻ
୬ୀଵ (2)
α -hyper parameter on the mixing proportions (K-vector or scalar if symmetric).
β -hyper parameter on the mixture components (V-vector or scalar if symmetric).
4. 22 Computer Science & Information Technology (CS & IT)
Θm-parameter notation for p(z|d=m), the topic mixture proportion for document m. One
proportion for each document, Θ = {ϑ~ m} M m=1 (M × K matrix).
Φk-parameter notation for p(t|z=k), the mixture component of topic k. One component for each
topic, Φ = {ϕ~k} K k=1 (K × V matrix).
Nm- Document length (document-specific), here modeled with a Poisson distribution with
constant parameter ξ.
Zm,n- Mixture indicator that chooses the topic for the nth word in document m.
Wm,n - Term indicator for the nth word in document m.
In order to retrieve topics in a corpus, this generative process is reversed. The two distributions:
distribution over topics and distribution over vocabulary for the topics are discovered by the
reverse process. This process uses Gibbs Sampling, which is commonly used as a means of
statistical inference. It is a randomized algorithm (i.e. an algorithm that makes use of random
numbers, and hence may produce different results each time it is run). Gibbs sampling generates
a Markov chain of samples, each of which is correlated with nearby samples.
3.2 Parsing
LDA is a document clustering algorithm. To give input to LDA as documents, we need to parse
the input XML/HTML files. As mentioned earlier, the Page Contents and the Form Contents are
extracted from the web pages.
1. Page Content: Values and textual content visible on the web page.
2. Form Content: The value of the attributes in the form.
Most of the deep web pages are blocked by forms (e.g. login forms). To take advantage of this,
we take into consideration the form attribute values, which can reveal important information
about the type of deep web pages. Thus, this approach covers a large portion of deep web
database for clustering.
3.3 Visiting Inner Links
For some web pages, it is possible that there is very little information related to the topic it comes
under. However, the webpage may contain links to other related sites, or even its home page. If
we were able to visit the inner HTML links in the webpage, we could gather further information
about the topic the webpage talks about. For this purpose, we perform parsing individually on all
the links inside the webpage. This way, we also gather more relevant test data than what is
already available.
However, is also possible to have irrelevant links inside the webpage. For example, there might
be a quick link to a search engine, or advertisements. Since the web pages discovered by visiting
the inner links are not so reliable, we must give them lesser weightage than the original data. We
achieve this by increasing the frequency of the words in the original webpage, i.e., we repeat the
words in the original webpage for a specific number of times.
5. Computer Science & Information Technology (CS & IT) 23
3.4 Gibbs sampling
Gibbs sampling is one of the class of sampling methods known as Markov Chain Monte Carlo.
We use it to sample from the posterior distribution, p(Z|W), given the training data W represented
in the form of
Gibbs sampling is commonly used as a means of statistical inference, especially Bayesian
inference. It is a randomized algorithm (i.e. an algorithm that makes use of random numbers, and
hence may produce different results each time it is run), and is an alternative to deterministic
algorithms for statistical inference such as variational Bayes or the expectation-maximization
algorithm (EM). Gibbs sampling generates a Markov chain of samples, each of which is
correlated with nearby samples.
3. IMPLEMENTATION
Fig 2: Module Diagram
We split the implementation into two parts:
1. Parsing
2. LDA
4.1 Parsing
Input: Deep web interface webpage (HTML/XML)
Output: Document with Form Contents and Page Contents as words.
4.1.1 Algorithm
1. From the input HTML/XML, gather all the textual content. Textual content refers to all the
text that is visible on the webpage.
2. For each <form> element, gather all the values of the attributes for each of the child tags. For
example, in the following we wish to extract the "Search by Departure City" and "departure
city".
6. 24 Computer Science & Information Technology (CS & IT)
<form>
<attrgroup>
<attr name="Search by Departure City" ename="departure city">
…
…
</attrgroup>
</form>
3. The words extracted from Step 1 and Step 2 constitute the extracted words. Write the
extracted words into the output document.
a. If the webpage is the original webpage interface, repeat the extracted word for a fixed
number of times (say N), and write to the output document.
b. If the webpage is among the ones discovered by visiting the inner links, write them to
the document as it is.
4. Remove all the stop-words. Stop-words constitute certain regularly used words which do not
convey any meaning, like as, the, of, a, on,etc.
4.1.2 Output Format
Google App Engine has been used for creating the web-tool for this project. The parser code
written in Python takes in multiple HTML/XML files for input and gives the output after
extracting the required words. Further, in order to remove grammatical constructs like ‘a’, ‘the’,
‘on’, we include a file of stop-words. All the words in this file are checked against the words
obtained from the App Engine (Python code), and removed.
The input format for LDA is as follows:
<Number of documents: N>
<Document 1 - space separated words>
<Document 2 - space separated words>
…
…
<Document N - space separated words>
This output file from the python parser code produces the output in the above mentioned format.
This output file is then given as input to the LDA code.
4.2 Gibbs LDA
Input:
1. A single file constituting the set of documents received after the Parsing stage. Multiple web
pages parsed and given as input to LDA.
2. Number of Topics to be discovered.
Output: Clusters - A set of topics with most occurring words in each topic.
7. Computer Science & Information Technology (CS & IT) 25
4.2.1Algorithm
As described earlier, we feed the Gibbs sampled data to our LDA model to extract topics from
the given corpus.
5. EXPERIMENTAL RESULTS
In order to evaluate the performance of our approach we tested it over dataset of TEL-8 as
mentioned in [8], UIUC Web integration repository [12]. The repository contains web search
forms of 447 sources originally classified into 8 domains. The term TEL-8 refers to eight
different web source domains belonging to three major categories (Travel, Entertainment and
Living). Travel group is related to car rentals, hotels and airfares. Entertainment group has books,
movies and music records interfaces. Finally the living group contains jobs and automobiles
related query interfaces.
5.1 Performance measure
To evaluate the performance of our method, we calculated Precision of the outputs generated.In
a classification task, the precision for a class is the number of true positives divided by the total
number of elements labeled as belonging to the positive.
The definition of related terms is shown below:
1. TN / True Negative: case was negative and predicted negative
2. FP / False Positive: case was negative but predicted positive
[13] Precision = True Positives / (True positives + False Positives)
We tried various combinations of these 8 domains, and created 4 small datasets and 2 large
datasets. We compared the approach used in [8] and our approach over the same dataset, and
found that our method performs well in most cases.
Fig 3: Individual Topic Distribution Analysis. Small refers to small dataset, and Large refers to large
dataset. Several topics are discovered within for each dataset.
8. 26 Computer Science & Information Technology (CS & IT)
Though, this analysis clearly shows that our method performs better, this does not give the
complete picture. For a better overall picture of the total topic distribution in each of the datasets,
we use overall precision for each dataset.
We use the same formula for precision used above, and average it for the over the number of
topics. Therefore, we get an averaged precision value for each dataset.
Averaged Precision = Sum of all precision values / Number of topics discovered
Fig 4: Performance evaluation for datasets
The above chart further strengthens the proof that the new method of visiting inner HTML links
works better.
6. SUMMARY AND CONCLUSION
The requirement for an efficient and scalable clustering algorithm for deep web pages is fulfilled
by our approach. The results show that sampling of data is useful for clustering heterogeneous
deep Web sources. Our proposed Gibbs-LDA based technique is more efficient than the existing
techniques.
It gives more accurate results when links in the web pages are parsed and their web page content
are also taken into account. As LDA produce soft clusters it assigns probability to each document
for each cluster. Hence, our tool is suitable for the scenario where the sources are sparsely
distributed over the web.
In order to increase the weightage for the content in our base webpage over the one that we parse
in our web page, we have increased the frequency of all the texts that appear in base webpage,
which increases the space complexity. New innovative methods could be proposed in future
which could possibly reduce space.
9. Computer Science & Information Technology (CS & IT) 27
ACKNOWLEDGEMENTS
We would like to thank our Project guide, Dr. E. Sivasankar and Ms.Selvi Chandran, Research
Scholar, CSE, NIT-Trichy. Their input, guidance and encouragement made this project possible
and a success.
REFERENCES
[1] The Deep Web: Surfacing hidden value. Accessible at http://brightplanet.com, (2000).
[2] Madhavan, J., Cohen, S., Dong, X. L., Halevy, A. Y., Jeffery S. R.,Ko, D. and Yu, C (2007),”Web
scale data integration”, Conference on Innovative Data System Research (CIDR), 342–350.
[3] Tor: The Second-Generation Onion Router. Accessible at www.onion-router.net/Publications/tor-
design.pdf
[4] Vincenzo Ciancaglini, Marco Balduzzi, Max Goncharov, and Robert McArdle, “Deepweb and
Cybercrime: It’s Not All About TOR”, Trend Micro
[5] David M. Blei, Probabilistic topic models (2012), communications of the acm, Vol.55, No.4
[6] Estivill-Castro, Vladimir (2002), "Why so many clustering algorithms: a position paper", ACM
SIGKDD Explorations Newsletter 4.1
[7] Blei, D., M., Ng, Y., A. and Jordan, M., I. (2003),” Latent Dirichlet Allocation”, Journal of Machine
Learning Research.
[8] Umara Noor, Ali Daud, Ayesha Manzoor (2013),” Latent Dirichlet Allocation based Semantic
Clustering of Heterogeneous Deep Web Sources”, In: Intelligent Networking and Collaborative
Systems (INCoS), 132- 138.
[9] Unsupervised Learning. Accessible at http://mlg.eng.cam.ac.uk/zoubin/papers/ul.pdf, (2004).
[10] Thanaa M. Ghanem and Walid G. Aref. (2004), “Databases deepen the web”. IEEE Computer, 37(1),
116–117.
[11] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener (2004) “A large-scale study of the
evolution of web pages.” In Proceedings of the 12th International World Wide Web Conference, 669–
678.
[12] Steve Lawrence and C. Lee Giles (1999), “Accessibility of information on the web.” Nature,
400(6740), 107–109
[13] A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling.
Accessible at http://u.cs.biu.ac.il/~89-680/darling-lda.pdf, (2011)
[14] The ElementTree XML API. Accessible at
https://docs.python.org/2/library/xml.etree.elementtree.html
[15] The Bayesian Approach. Accessible at https://www.cs.cmu.edu/~scohen/psnlp-lecture6.pdf
[16] Google App Engine: Platform as a service. Accessible at https://cloud.google.com/appengine/docs
[17] Stop Words. Accessible at http://www.webconfs.com/stop-words.php
[18] The UIUC Web integration repository. Accessible at http://metaquerier.cs.uiuc.edu/repository
[19] Precision and recall .Accessible at en.wikipedia.org/wiki/Precision_and_recall.
[20] Clustering: Overview and K-means algorithm. Accessible at
http://www.cs.princeton.edu/courses/archive/spr11/cos435/Notes/clustering_intro_kmeans_topost.pdf