On the use of side information for mining text data

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 26, NO. 6, JUNE 2014
On the use of side information for mining text data

Abstract
 In many text mining applications, side-information is available along with
the text documents. Such side-information may be of different kinds, such
as document provenance information, the links in the document, user-access
behavior from web logs, or other non-textual attributes which are
embedded into the text document. Such attributes may contain a
tremendous amount of information for clustering purposes.
 However, the relative importance of this side-information may be difficult
to estimate, especially when some of the information is noisy. In such
cases, it can be risky to incorporate side-information into the mining
process, because it can either improve the quality of the representation for
the mining process, or can add noise to the process.
 Therefore, we need a principled way to perform the mining process, so as
to maximize the advantages from using this side information. In this
paper, we design an algorithm which combines classical partitioning
algorithms with probabilistic models in order to create an effective
clustering approach.
 We then show how to extend the approach to the classification problem.
We present experimental results on a number of real data sets in order to
illustrate the advantages of using such an approach.

Existing System
 TheThe term text analytics describes a set of linguistic, statistical, and machine
learning techniques that model and structure the information content of
textual sources for business intelligence, exploratory data analysis, research, or
investigation.
 stop words are words which are filtered out prior to, or after, processing of
natural language data (text). There is not one definite list of stop words which
all tools use and such a filter is not always used. Some tools specifically avoid
removing them to support phrase search.
 In most cases, morphological variants of words have similar semantic
interpretations and can be considered as equivalent for the purpose of IR
applications. For this reason, a number of so-called stemming Algorithms, or
stemmers, have been developed, which attempt to reduce a word to its stem or
root form. Thus, the key terms of a query or document are represented by
stems rather than by the original words.
 This not only means that different variants of a term can be conflated to a
single representative form – it also reduces the dictionary size, that is, the
number of distinct terms needed for representing a set of documents. A
smaller dictionary size results in a saving of storage space and processing time.
 Classification systems are used in many different areas. When you look on store
shelves, you have a classification system that sorts products. In a filing cabinet
you have a classification system that sorts files, and in a library you have a
classification system that sorts books based on their genre. What other
examples of classification have you seen

Proposed System
 Having the compare to analysis between the URL and the
document. Supporting links will be crawled by analyzing the
url
 The application of data mining techniques to discover
patterns from the Web. According to analysis targets, web
mining can be divided into three different types, which are
Web usage mining, Web content mining and Web structure
mining.
 Any group of words can be chosen as the stop words for a
given purpose. For some search machines, these are some of
the most common, short function words, such as the, is, at,
which, and on.
 In this case, stop words can cause problems when searching
for phrases that include them, particularly in names such as
'The Who', 'The', or 'Take That'. Other search engines
remove some of the most common words—including lexical
words

System Architecture
• HARWARE REQUIREMENT:
Processor : Core 2 duo
Speed : 2.2GHZ
RAM : 2GB
Hard Disk : 160GB
• SOFTWARE REQUIREMENT:
Platform : DOTNET (VS2010) , ASP.NET Dotnet
framework 4.0
Database : SQL Server 2008 R2

On the use of side information for mining text data

Recommended

Recommended

More Related Content

More from KaashivInfoTech Company

More from KaashivInfoTech Company (10)

Recently uploaded

Recently uploaded (20)

On the use of side information for mining text data