A Comprehensive Study On Data Mining Process With Distribution
1. A Comprehensive Study on Data Mining Process with
Distribution
Kethavarapu Uma Pavan Kumar1
umapawan.dwh@gmail.com--9014998455
Muppaneni Satyanvesh2
anvesh.chowdaryit123@gmail.com
1
M.Tech CSE Nalanda institute of Engineering and technology
Sattenapalli, Guntur (ANDHRA PRADESH)
2
B.Tech IT EVM College of Engineering and technology
Narsaraopet, Guntur (ANDHRA PRADESH)
ABSTRACT
In this paper we are going to present the concept of distributed
data mining and the advantages of combined architecture which involves the
concept of distribution as a backbone and the process of mining. To
demonstrate the Distributed Data Mining concept we are presenting
architecture with various levels of users, data, and networks. In this we also
have had the usage of DMQL (Distributed Mining Query Language) and
converter mechanism. After getting the required data we mentioned the
security layer in the architecture so as to oppose the malicious software or data.
The entire architecture depends on the usage of cloud computing which
involves the migration of the different kinds of network data mining process
2. KEYWORDS:-
Distribution, Mining, DMQL
(Distributed Mining Query Language),
Cloud Computing, Converter.
DISTRIBUTION INTRODUCTION:
The term distribution
playing a vital role while handling the
common data between varieties of
users. With distribution it is possible to
integrate different locations of the data
and also possible to broad cast the
common data to multiple flavors of the
users. Nowadays the distribution is
mostly used along with general
networks, databases, operating systems,
datawarehouse and in datamining.
In case of networks the
individual networks are just sending and
receiving the information. But if we
adopt the property of distribution for
this network that will be converted as
distributed systems. A distributed
system allows the user to process the
applications under a single system
image. Because of this it is possible to
interact with any of the system in the
communication network and the user is
having the feel of common usage of the
systems event though he is interacting
with different systems.
In case of databases a
single system is taking the responsibility
of all the systems those are connected to
it, such system is known as server. But
the problem with this type of centralized
databases is lack of reliability and
availability of the data. If the property
of the distribution is integrated with
normal databases then it possible to
overcome the lack of reliability and
availability of the data. The distributed
databases now playing an important role
in the management of the data such as
transactions, concurrency control and
query optimization.
The generic operating
system may be integrated with the
property of distribution so as to get
benefits of versatile process
management, memory management and
maintaining communication with the
help of distributed algorithms.
Data warehouse is a large
collection of data repository which
provides the uniform data to the user by
integrating various formats of the data
from different sources such as XML,
ERP, FLAT files (XL worksheets,
COBAL files, documents). The
distribution of datawarehouse yields
more benefits to the users who are
scattered geographically as a result it is
possible to get the required data by
different users located at different
continents.
3. DATA MINING:
Mining is a process of
searching for the required data from
larger databases. This is also known as
knowledge discovery in databases.
Because of this mining only it is
possible to get the most interestingness
patterns form the repositories. The
repository may be a database or else
may be datawarehouse. Getting only the
required data by avoiding the
unnecessary data is typical aspect in
case of searching for knowledge. The
mining process allows the user to
minimize the complexity the search
process in such a way that by providing
number of algorithms. Accessing the
required data in the fastest manner is the
most striking advantage of mining
process. Searching for the required data
in databases and datawarehouse is done
by online analytical processing (OLAP).
The same thing may be done through
datamining also.
The usage of OLAP
requires the functional knowledge to the
user. For example the company CEO
may want to access the previous season
sales then the CEO must know about the
season information and as well as the
products that which he wants to get the
sales information. OLAP having some
limitations while generating reports
with respect to the user requirements.
The limitations is, the user need to have
the idea about the context of the query.
This limitation can be solved by the
usage of datamining process.
COMBINING MINING PROCESS
WITH DISTRIBUTION:-
The distribution
concentrates on sharing the data from
various contents and it is possible to
replicate the same data to multiple
systems located at various client places
so the process of remote login, remote
accessing and remote computation are
done through this distribution. The
backbone of distribution is basically
LAN (Local Area Network) and it may
ranges from LAN to cloud computing.
The mining process is meant for
grabbing the most related data with
respect to the user given query. The
basic mining process involves a limited
environment such as a single server
with single or multiple databases or data
warehouses. If we implement the
mining process in case of distribution it
will give the distributed search patterns
and those patterns are more valuable
and most useful when compared with
normal mining process.
4.
5. In the reference architecture the
data is gathered from various sources
such as databases, data warehouses,
FLAT files and ERP’s and that data is
avoid to distributed network
environment. The user may vary from
normal end user to MD’s, CEO’s of the
company. The user initially send his
request to the DMQL (Data Mining
Query Language) interface which is
very much similar to SQL (Structured
Query Language). So a converter is
required so as to serve the purpose of
different kinds of users. The converter
just transforms the user given query
according to the requirements of the
mining process and after that it will
search for all the available sources so as
to get the most interestingness pattern.
Cloud is representation of different
topologies of the network and it will
facilitate the integration these many
kinds of network so as to filter the
interestingness pattern from the
available source. The architecture
involving the security mechanism so as
to oppose the malicious program, code,
software’s or data into the system by the
means of antispyware and other
mechanism.
COMPLEXITIES:-
• Mixing up of mining process with
distribution is some what difficult
process.
• Gathering the required data from
unlimited source is also a tough task.
• Conversion of various formats is
also complex.
• Integrating different kinds of data
and presenting that data into the user
requested format is also not that
much of easy.
REQUIREMENTS
• Cloud computing
• Data bases or data warehouses or
Flat files or ERP’s
• Converters
• DMQL interface
• Users of various levels
•
TYPES OF MINING PROCESS
The process of mining
supports various formats of the
search process which are involving
text mining, web mining, web
content mining web usage mining,
spatial mining, multimedia data
mining depending on the
requirement it is possible to use the
corresponding mining process.
ADDED BENEFITS IN MINING
PROCESS IN CASE OF
DISTRIBUTION
The main benefit of mixing up the
distribution with respect to mining process
is the availability and reliability of the data
as a single system image
REAL TIME USAGE OF DATA
MINING WITH DISTRIBUTION
In general mining process such
as search engines like Google and other
browsing techniques while using
internet and other public networks
generally the mining process involves
the distribution by default, so almost all
6. general net based search process follows
the distribution mechanism so as to
access the required pattern from bulk
source.
PROS AND CONS
PROS
• The main purpose of Distributed
data mining is grabbing the
interestingness patterns from
variety of sources which is not
possible in normal mining
process
• Serving various levels of users is
possible through this distributed
data mining.
• Working with various kinds of
data formats is also possible
CONS
• Its architecture it is not possible
to locate where the exact data is.
• Unnecessary data is crept into
the user requested query.
• Sometimes it may not possible
to the converter to transform the
source data into user required
format.
CONCLUSION:-
Finally we conclude that the
discussion regarding with distributed
data mining provides the user to get the
interestingness patterns from the both
sources and it also provides handling of
various formats of the data and as well
integration of those data into a common
format. We also conclude that the
architecture basically provides the
required information by combining with
various sources and to process this there
may have some complexities and other
problem for getting the exact data.
REFERNCES
1. Data Mining concepts and
principles paulraj ponnaiah
2. Data warehousing and mining by
Alex berson
3. Data Warehousing techniques by
Michel han and kamber
4. www.altavista.com
5. Gathered from user groups and
blogs