SlideShare a Scribd company logo
1 of 48
Download to read offline
DISTRIBUTED DATABASE SYSTEM 1
CSIT Dept’s SGBAU Amravati.
Practical No: 1
Aim: Study of Rapid Miner Tools.
Tool: Rapid Miner 7.6.001
Theory:
Rapid Miner provides data mining and machine learning procedures including:
data loading and transformation (Extract, transform, load, a.k.a. ETL), data
preprocessing and visualization, modelling, evaluation, and deployment. Rapid Miner
is written in the Java programming language. It uses learning schemes and attributes
evaluators from the Weka machine learning environment and statistical modelling
schemes from R-Project.
Rapid Miner can define analytical steps (similar to R) and be used for
analyzing data generated by high throughput instruments such as those used in
genotyping, proteomics, and mass spectrometry. It can be used for text mining,
multimedia mining, feature engineering, data stream mining, development of
ensemble methods, and distributed data mining. Rapid Miner functionality can be
extended with additional plugins.
Rapid Miner provides a GUI to design an analytical pipeline (the "operator
tree"). The GUI generates an XML (extensible Markup Language) file that defines the
analytical processes the user wishes to apply to the data. Alternatively, the engine can
be called from other programs or used as an API. Individual functions can be called
from the command line.
Rapid Miner is open-source and is offered free of charge as a Community
Edition released under the GNU AGPL.
Rapid Miner is the world-wide leading open-source data mining solution due
to the combination of its leading-edge technologies and its functional range.
Applications of Rapid Miner cover a wide range of real-world data mining tasks.
Using Rapid Miner one can explore data! Simplify the construction of analysis
processes and the evaluation of different approaches. Try to find the best combination
of preprocessing and learning steps or let Rapid Miner do that automatically.
DISTRIBUTED DATABASE SYSTEM 2
CSIT Dept’s SGBAU Amravati.
More than 400 data mining operators can be used and almost arbitrarily
combined. The setup is described by XML files which can easily be created with a
graphical user interface (GUI). This XML based scripting language turns Rapid Miner
into an integrated development environment (IDE) for machine learning and data
mining. Rapid Miner follows the concept of rapid prototyping leading very quickly to
the desired results. Furthermore, Rapid Miner can be used as a Java data mining
library.
The development of most of the Rapid Miner concepts started in 2001 at the
Artificial Intelligence Unit of the University of Dortmund. Several members of the
unit started to implement and realize these concepts which led to a first version of
Rapid Miner in 2002. Since 2004, the open-source version of Rapid Miner (GPL) is
hosted by Source Forge. Since then, a large number of suggestions and extensions by
external developers were also embedded into Rapid Miner. Today, both the open-
source version and a close-source version of Rapid Miner are maintained by Rapid-I.
Although Rapid Miner is totally free and open-source, it offers a huge amount
of methods and possibilities not covered by other data mining suites, both open-
source and proprietary ones.
Features of Rapid miner:
• Freely available open-source knowledge discovery environment
• 100% pure Java (runs on every major platform and operating system)
• KD processes are modelled as simple operator trees which is both intuitive and
powerful
• Operator trees or sub trees can be saved as building blocks for later re-use
• Internal XML representation ensures standardized interchange format of data
mining experiments
• Simple scripting language allowing for automatic large-scale experiments
• Multi-layered data view concept ensures efficient and transparent data
handling
• Flexibility in using Rapid Miner:
• Graphical user interface (GUI) for interactive prototyping
• Command line mode (batch mode) for automated large-scale applications
DISTRIBUTED DATABASE SYSTEM 3
CSIT Dept’s SGBAU Amravati.
• Java API (application programming interface) to ease usage of Rapid Miner
from your own programs
• Simple plug-in and extension mechanisms, a broad variety of plugins already
exists and you can easily add your own
• Powerful plotting facility offering a large set of sophisticated high-
dimensional visualization techniques for data and models
• More than 400 machine learning, evaluation, in- and output, pre- and post-
processing, and visualization operators plus numerous meta optimization
schemes
• Rapid Miner was successfully applied on a wide range of applications where
its rapid prototyping abilities demonstrated their usefulness, including text
mining, multimedia mining, feature engineering, data stream mining and
tracking drifting concepts, development of ensemble methods, and distributed
data mining.
DISTRIBUTED DATABASE SYSTEM 4
CSIT Dept’s SGBAU Amravati.
Homepage:
Fig 1.1 Operator and Parameter in Rapid Miner
Fig 1.2 Repository Windows
Conclusion: Hence, we have studied the Rapid Miner Tool.
DISTRIBUTED DATABASE SYSTEM 5
CSIT Dept’s SGBAU Amravati.
Practical No. 2
Aim:- Demonstration of Pre-processing on given Data set using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
Data pre-processing describes any type of processing performed on raw
data to prepare it for another processing procedure. Commonly used as a
preliminary data mining practice, data pre-processing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user --
for example, in a neural network. There are a number of different tools and methods
used for pre-processing, including: sampling, which selects a representative subset
from a large population of data; transformation, which manipulates raw data to
produce a single input; denoising, which removes noise from data; normalization,
which organizes data for more efficient access; and feature extraction, which pulls out
specified data that is significant in some particular context.
Procedure:-
1. Create .arff file
2. Go to Repository Select import .CSV file Data import wizard Steps Name
the data repository.
3. Learn the graphical statistical output of example.
DISTRIBUTED DATABASE SYSTEM 6
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 4.1 Pre-processing on given Data set
DISTRIBUTED DATABASE SYSTEM 7
CSIT Dept’s SGBAU Amravati.
Fig 4.2 Text view
Fig 4.3 Decision Tree view
Conclusion:-Thus, we have learned decision trees using Rapid Miner.
DISTRIBUTED DATABASE SYSTEM 8
CSIT Dept’s SGBAU Amravati.
Practical No. 3
Aim:-Demonstration of DBSCAN clustering algorithm using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
DBSCAN's definition of a cluster is based on the notion of density
reachability. Basically, a point q is directly density-reachable from a point p if it is not
farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood)
and if p is surrounded by sufficiently many points such that one may consider p and q
to be part of a cluster. q is called density-reachable (note the distinction from "directly
density-reachable") from p if there is a sequence p(1),…,p(n) of points with p(1) = p
and p(n) = q where each p(i+1) is directly density-reachable from p(i).
Procedure:-
1. Go to repository  Go to modeling Go to clustering and segmentation
2. Drag & drop selected DBSCAN on main process
3. Drag & drop selected DB on main process.
4. Connect the respective nodes.
5. Run the program.
DISTRIBUTED DATABASE SYSTEM 9
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 3.1 DBSCAN clustering algorithm
Fig 3.2 Text View of DBSCAN
DISTRIBUTED DATABASE SYSTEM 10
CSIT Dept’s SGBAU Amravati.
Fig 3.3 Graph View of DBSCAN
Conclusion:- Thus, we have learned DBSCAN clustering algorithm using Rapid
Miner.
DISTRIBUTED DATABASE SYSTEM 11
CSIT Dept’s SGBAU Amravati.
Practical No. 4
Aim:- Demonstration of decision tree using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
Decision tree induction in learning of decision trees from class-labeled
training tuples. A decision tree is flow chart like tree structure, where each internal
node (non- leaf node) denotes a test on an attribute, each branch represent an outcome
of the test, and each leaf node (or terminal node) holds a class label. The topmost
node in a tree is the root node.
A decision Tree for the concept buy computer, indicating whether a customer
at All Electronics is likely to purchase a computer. Each internal (non-leaf) node
represents a class (either buys computer = yes or buys- computer = no).
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals(Golf)
2. Drag & drop selected DB on main process.
3. Select operator  Modelling  Select classification and regression Select Tree
induction Select decision tree drag and drop on main process.
4. Connect the respective nodes.
5. Run the program.
6. Learn the classifications diagrammatic representation of output of given example.
DISTRIBUTED DATABASE SYSTEM 12
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 4.1 Decision tree using Rapid Miner
DISTRIBUTED DATABASE SYSTEM 13
CSIT Dept’s SGBAU Amravati.
Fig 4.2 Text View of Decision Tree
Fig 4.3 Tree View of Decision Tree
Conclusion:-Thus, we have learned decision trees using Rapid Miner.
DISTRIBUTED DATABASE SYSTEM 14
CSIT Dept’s SGBAU Amravati.
Practical No. 5
Aim: - Demonstration of NaĂŻve Bayes classification on a given data set using Rapid
Miner.
Tool :- Rapid Miner 5.3.000
Theory:-
A Naive Bayes classifier is a simple probabilistic classifier based on applying
Bayes' theorem (from Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying probability model would be
'independent feature model'. In simple terms, a Naive Bayes classifier assumes that
the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated
to the presence (or absence) of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 4 inches in diameter. Even if
these features depend on each other or upon the existence of the other features, a
Naive Bayes classifier considers all of these properties to independently contribute to
the probability that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small
amount of training data to estimate the means and variances of the variables necessary
for classification. Because independent variables are assumed, only the variances of
the variables for each label need to be determined and not the entire covariance
matrix.
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals( Weight)
2. Drag & drop selected DB on main process.
3. Select operator  Modelling Cluster segmentation Select Also
KNIME Drag and drop on main process.
4. Connect the respective nodes.
5. Run the program.
6. Learn the clustering diagrammatic representation. of output of given example.
DISTRIBUTED DATABASE SYSTEM 15
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 5.1 NaĂŻve Bayes classification
DISTRIBUTED DATABASE SYSTEM 16
CSIT Dept’s SGBAU Amravati.
Fig 5.2 Text View of NaĂŻve Bayes classification
Fig 5.3 Plot View of NaĂŻve Bayes classification
Conclusion:-
Thus, we have learned NaĂŻve Bayes classification using Rapid Miner.
DISTRIBUTED DATABASE SYSTEM 17
CSIT Dept’s SGBAU Amravati.
Practical No. 6
Aim:- Demonstration of k-means clustering algorithm using Rapid Miner.
Tool:- Rapid Miner 7.6.001
Theory:-
This operator performs clustering using the k-means algorithm. Clustering is
concerned with grouping objects together that are similar to each other and dissimilar
to the objects belonging to other clusters. Clustering is a technique for extracting
information from unlabeled data. k-means clustering is an exclusive clustering
algorithm i.e. each object is assigned to precisely one of a set of clusters. This
operator performs clustering using the k-means algorithm. k-means clustering is an
exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of
clusters. Objects in one cluster are similar to each other. The similarity between
objects is based on a measure of the distance between them.
Clustering is concerned with grouping together objects that are similar to each
other and dissimilar to the objects belonging to other clusters. Clustering is a
technique for extracting information from unlabeled data. Clustering can be very
useful in many different scenarios e.g. in a marketing application we may be
interested in finding clusters of customers with similar buying behavior.
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals(DB)
2. Drag & drop selected DB on main process.
3. Select operator  Modelling Cluster segmentation Select Also
KNIMEdrag and drop on main process.
4. Connect the respective nodes.
5. Run the program.
6. Learn the clustering diagrammatic representation. of output of given example.
DISTRIBUTED DATABASE SYSTEM 18
CSIT Dept’s SGBAU Amravati.
Output :-
Fig 6.1 k-means clustering algorithm
Fig 6.2 Text View of k-means clustering algorithm
DISTRIBUTED DATABASE SYSTEM 19
CSIT Dept’s SGBAU Amravati.
Fig 6.3 Graph View of k-means clustering algorithm
Fig 6.4 Centroid Plot View of k-means clustering algorithm
Conclusion:-Thus, we have learned k-means clustering algorithm using Rapid Miner.
DISTRIBUTED DATABASE SYSTEM 20
CSIT Dept’s SGBAU Amravati.
Practical No. 7
Aim:- Demonstration of Market Basket analysis using Association rule mining in
Rapid Miner.
Tool:- Rapid Miner 7.6.001
Theory:-
These models build upon the association rule mining framework, but provide
additional analytic capabilities beyond simple associations. The first model allows to
mine transactional database for negative patterns represented as dissociation item sets
and dissociation rules. The second model of substitutive item sets filters items and
item sets that can be used interchangeably as substitutes, i.e., item sets that appear in
the transactional database in very similar contexts. Finally, the third model of
recommendation rules uses an additional item set interestingness measure, namely
coverage, to construct a set of recommended items using a greedy search procedure.
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals( Weight)
2. Drag & drop selected DB on main process.
3. Select operator  Modelling Generate data DiscretizeNominal to
BinominalFP GrowthMultiplyCreate Dissociation RulesGenerate
Current Selection Context MultiplyCreate Substitutive Sets Create
Recommendation Sets Multiply.
4. Drag and drop on main process.
5. Connect the respective nodes.
6. Run the program.
7. Learn the clustering diagrammatic representation of output of given example.
DISTRIBUTED DATABASE SYSTEM 21
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 7.1 Market Basket analysis using Association rule
Fig 7.2 Data Table Of Market Basket analysis using Association rule
Conclusion:-
Thus, we have learned Market Basket analysis using Association rule mining
in Rapid Miner.
DISTRIBUTED DATABASE SYSTEM 22
CSIT Dept’s SGBAU Amravati.
Practical No: 8
Aim: Study of KNIME Analytical Platform.
Tool: KNIME 3.4.1
Theory:
KNIME:
Konstanz Information Miner, is an open source data analytics, reporting and
integration platform. It has been used in pharmaceutical research, but is also used in
other areas like CRM customer data analysis, business intelligence and financial data
analysis. It is based on the Eclipse platform and, through its modular API, and is
easily extensible. Custom nodes and types can be implemented in KNIME within
hours thus extending KNIME to comprehend and provide first tier support for highly
domain-specific data format.
1) Technical Specification:
 Released on 2004.
 Latest version available is KNIME2.9
 Licensed By GNU General Public License
 Compatible with Linux ,OS X, Windows
 Written in java
 www.knime.org
2) General Features:
 Knime, pronounced “naim”, is a nicely designed data mining tool that runs
inside the IBM’s Eclipse development environment.
 It is a modular data exploration platform that enables the user to visually
create data flows (often referred to as pipelines), selectively execute some or
all analysis steps, and later investigate the results through interactive views on
data and models.
 The Knime base version already incorporates over 100 processing nodes for
data I/O, pre-processing and cleansing, modelling, analysis and data mining as
well as various interactive views, such as scatter plots, parallel coordinates and
others.
DISTRIBUTED DATABASE SYSTEM 23
CSIT Dept’s SGBAU Amravati.
3) Specification:
 Integration of the Chemistry Development Kit with additional nodes for the
processing of chemical structures, compounds, etc.
 Specialized for Enterprise reporting, Business Intelligence, data mining
Advantages:
 It integrates all analysis modules of the well-known. Weka data mining
environment and additional plugins allow R-scripts to be run, offering access
to a vast library of statistical routines.
 It is easy to try out because it requires no installation besides downloading and
un archiving.
 The one aspect of KNIME that truly sets it apart from other data mining
packages is its ability to interface with programs that allow for the
visualization and analysis of molecular data
Limitations:
 Have only limited error measurement methods.
 Has no wrapper method for descriptor selection.
 Does not have automatic facility for Parameter optimization of machine
learning/statistical methods.
Homepage:
Fig: Main Window of KNIME
Conclusion: Hence we have studied the KNIME Analytical Platform.
DISTRIBUTED DATABASE SYSTEM 24
CSIT Dept’s SGBAU Amravati.
Practical No. 9
Aim:-Demonstration of pre-processing on a given data set using KNIME analytical
platform.
Tool:-KNIME 3.4.1
Theory:-
Data pre-processing is an important step in the data mining process. The
phrase "garbage in, garbage out" is particularly applicable to data mining and machine
learning projects. Data-gathering methods are often loosely controlled, resulting
in out-of-range values (e.g., Income: −100), impossible data combinations, missing
values, etc. Analyzing data that has not been carefully screened for such problems can
produce misleading results. Thus, the representation and quality of data is first and
foremost before running an analysis.
If there is much irrelevant and redundant information present or noisy and unreliable
data, then knowledge discovery during the training phase is more difficult. Data
preparation and filtering steps can take considerable amount of processing time. Data
pre-processingincludes cleaning, normalization, transformation, feature
extraction and selection, etc. The product of data pre-processing is the final training
set. Kotsiantis et al. present a well-known algorithm for each step of data pre-
processing.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.
DISTRIBUTED DATABASE SYSTEM 25
CSIT Dept’s SGBAU Amravati.
Output:-
Figure 9.1 Pre-processing OF KNIME
Conclusion:-
Thus, we have learned pre-processing on a given data set using KNIME analytical
platform
DISTRIBUTED DATABASE SYSTEM 26
CSIT Dept’s SGBAU Amravati.
Practical No. 10
Aim:- Demonstration of decision tree learning and predicting using KNIME
analytical platform.
Tool:-KNIME 3.4.1
Theory:-
Decision tree learning uses a decision tree as a predictive model which maps
observations about an item (represented in the branches) to conclusions about the
item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where
the target variable can take a finite set of values are called classification trees; in these
tree structures, leaves represent class labels and branches represent conjunctions of
features that lead to those class labels. Decision trees where the target variable can
take continuous values (typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree describes
data (but the resulting classification tree can be an input for decision making). This
page deals with decision trees in data mining.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.
DISTRIBUTED DATABASE SYSTEM 27
CSIT Dept’s SGBAU Amravati.
Output:-
Figure 10.1 Decision Tree learning and predicting using KNIME analytical
Conclusion:-
Thus, we have learned the decision tree learning and predicting using KNIME
analytical platform.
DISTRIBUTED DATABASE SYSTEM 28
CSIT Dept’s SGBAU Amravati.
Practical No. 11
Aim:-Demonstration of k-means clustering algorithm using KNIME analytical
platform.
Tool:-KNIME 3.4.1
Theory:-
K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces exactly k
different clusters of greatest possible distinction. The best number of clusters k
leading to the greatest separation (distance) is not known as a priori and must be
computed from the data. The objective of K-Means clustering is to minimize total
intra-cluster variance.
Algorithm:-
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean
distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
K-Means is relatively an efficient method. However, we need to specify the number
of clusters, in advance and the final results are sensitive to initialization and often
terminates at a local optimum. Unfortunately there is no global theoretical method to
find the optimal number of clusters. A practical approach is to compare the outcomes
of multiple runs with different k and choose the best one based on a predefined
criterion. In general, a large k probably decreases the error but increases the risk of
over fitting.
DISTRIBUTED DATABASE SYSTEM 29
CSIT Dept’s SGBAU Amravati.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.
Output:-
Figure11.1 k-means clustering algorithm using KNIME analytical
Figure11.2 Data Table of k-means clustering algorithm using KNIME analytical
Conclusion:- Thus, we have learned the k-means clustering algorithm using KNIME
analytical platform.
DISTRIBUTED DATABASE SYSTEM 30
CSIT Dept’s SGBAU Amravati.
Practical No. 12
Aim:-. Study of Orange mining tool.
Tool:- Orange 3.5
Theory:- Orange is an open-source data visualization, machine learning and data
mining toolkit. It features a visual programming front-end for explorative data
analysis and interactive data visualization, and can also be used as a Python library.
Orange is a component-based visual programming software package for data
visualization, machine learning, data mining and data analysis.
Orange components are called widgets and they range from simple data visualization,
subset selection and pre-processing, to empirical evaluation of learning algorithms
and predictive modeling.
Visual programming is implemented through an interface in which workflows are
created by linking predefined or user-designed widgets, while advanced users can use
Orange as a Python library Orange is an open-source software package released under
GPL. Versions up to 3.0 include core components in C++ with wrappers in Python are
available on github. From version 3.0 onwards, Orange uses common Python open-
source libraries for scientific computing, such as numpy, scipy and scikit-learn, while
its graphical user interface operates within the cross-platformQt framework. Orange3
has a separate github.
The default installation includes a number of machine learning, pre-processing and
data visualization algorithms in 6 widget sets (data, visualize, classify, regression,
evaluate and unsupervised). Additional functionalities are available as add-ons
(bioinformatics, data fusion and text-mining).
Orange is supported on macOS, Windows and Linux and can also be installed from
the Python Package Index repository (pip install Orange). As of 2016 the stable
version is 3.3 and runs with Python 3, while the legacy version 2.7 that runs with
Python 2.7 is still available for data manipulation and widget alteration.
Features
Orange consists of a canvas interface onto which the user places widgets and creates a
data analysis workflow. Widgets offer basic functionalities such as reading the data,
showing a data table, selecting features, training predictors, comparing learning
algorithms, visualizing data elements, etc. The user can interactively explore
visualizations or feed the selected subset into other widgets.
Classification Tree widget in Orange 3.0
 Canvas: graphical front-end for data analysis
 Widgets:
o Data: widgets for data input, data filtering, sampling, imputation,
feature manipulation and feature selection
DISTRIBUTED DATABASE SYSTEM 31
CSIT Dept’s SGBAU Amravati.
o Visualize: widgets for common visualization (box plot, histograms,
scatter plot) and multivariate visualization (mosaic display, sieve
diagram).
o Classify: a set of supervised machine learning algorithms for
classification
o Regression: a set of supervised machine learning algorithms for
regression
o Evaluate: cross-validation, sampling-based procedures, reliability
estimation and scoring of prediction methods
o Unsupervised: unsupervised learning algorithms for clustering (k-
means, hierarchical clustering) and data projection techniques
(multidimensional scaling, principal component analysis,
correspondence analysis).
Add-ons:
 Associate: widgets for mining frequent item sets and association rule learning
 Bioinformatics: widgets for gene set analysis, enrichment, and access to
pathway libraries
 Data fusion: widgets for fusing different data sets, collective matrix
factorization, and exploration of latent factors
 Educational: widgets for teaching machine learning concepts, such as k-
means clustering, polynomial regression, stochastic gradient descent, ...
 Geo: widgets for working with geospatial data
 Image analytics: widgets for working with images and Image Net
embeddings
 Network: widgets for graph and network analysis
 Text mining: widgets for natural language processing and text mining
 Time series: widgets for time series analysis and modeling
DISTRIBUTED DATABASE SYSTEM 32
CSIT Dept’s SGBAU Amravati.
Fig 12.1 Main Page of Orange mining tool
Conclusion:-Thus, we have studied Orange mining tool.
DISTRIBUTED DATABASE SYSTEM 33
CSIT Dept’s SGBAU Amravati.
Practical No: 13
Aim: Demonstration of Data visualization using Orange.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
 Data : An input data set
 Data Subset : A subset of data instances
Outputs:
 Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
 Axes in the projection that are displayed and other available axes.
 Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
 Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
 Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
 When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
 Save Image saves the created image to your computer in a .svg or .png format.
DISTRIBUTED DATABASE SYSTEM 34
CSIT Dept’s SGBAU Amravati.
Output:
Fig 13.1 Data visualization using Orange
Fig 13.2 Scatter Plot of Data visualization using Orange
DISTRIBUTED DATABASE SYSTEM 35
CSIT Dept’s SGBAU Amravati.
Fig 13.3 Classification tree of Data visualization using Orange
Conclusion:
Hence, Data visualization using Orange has been demonstrated.
DISTRIBUTED DATABASE SYSTEM 36
CSIT Dept’s SGBAU Amravati.
Practical No: 14
Aim: Demonstration of classification using Orange Mining Tool.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
 Data : An input data set
 Data Subset : A subset of data instances
Outputs:
 Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
 Axes in the projection that are displayed and other available axes.
 Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
 Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
 Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
 When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
 Save Image saves the created image to your computer in a .svg or .png format.
DISTRIBUTED DATABASE SYSTEM 37
CSIT Dept’s SGBAU Amravati.
Output:
Fig 14.1 classification using Orange Mining Tool
Fig 14.2 Point Data View of classification using Orange Mining Tool
Conclusion:
Hence, we have demonstrated classification using Orange Mining Tool.
DISTRIBUTED DATABASE SYSTEM 38
CSIT Dept’s SGBAU Amravati.
Practical No: 15
Aim: Demonstration of text mining using Orange
Tool: Orange3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
 Data : An input data set
 Data Subset : A subset of data instances
Outputs:
 Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
 Axes in the projection that are displayed and other available axes.
 Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
 Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
 Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
 When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
 Save Image saves the created image to your computer in a .svg or .png forma
DISTRIBUTED DATABASE SYSTEM 39
CSIT Dept’s SGBAU Amravati.
Fig 15.1 Text mining scenario
Fig 15.3 Query Window of Wikipedia
DISTRIBUTED DATABASE SYSTEM 40
CSIT Dept’s SGBAU Amravati.
Output:
Fig 15.2 Data Table of text mining using Orange
Conclusion:
Hence, we have demonstrated text mining using Orange Mining Tool.
DISTRIBUTED DATABASE SYSTEM 41
CSIT Dept’s SGBAU Amravati.
Practical No: 16
Aim: Demonstration of Linear Projection using Orange Mining Tool.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
 Data: An input data set
 Data Subset : A subset of data instances
Outputs:
 Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
 Axes in the projection that are displayed and other available axes.
 Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
 Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
 Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
 When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
 Save Image saves the created image to your computer in a .svg or .png format.
DISTRIBUTED DATABASE SYSTEM 42
CSIT Dept’s SGBAU Amravati.
Output:
Fig 16.1 Main window of Linear Projection using Orange Mining Tool
Fig 16.2 Paint Data view of Linear Projection using Orange Mining Tool
DISTRIBUTED DATABASE SYSTEM 43
CSIT Dept’s SGBAU Amravati.
Fig 16.3 Linear Projection using Orange Mining Tool
Fig 16.4 Rank Table of Linear Projection using Orange Mining Tool
Conclusion: Hence Linear Projection using Orange Mining Tool has been
demonstrated.
DISTRIBUTED DATABASE SYSTEM 44
CSIT Dept’s SGBAU Amravati.
Practical No: 17
Aim: Study of Net Tool Spider.
Tool: Net Tool Spider
Theory:
Net Tool Spider:
A web spider is a software program that searches the Internet for information. The
basic process of a web spider is to download a web page and to search the web page
for links to other web pages. It then repeats this behavior in all of the new pages that it
found. By repeating this process a web spider can find all of the pages within a web
site and all of the pages on the Internet.
Web spiders can have many purposes. The most common spiders are used by search
engines like Google, Yahoo and AltaVista. Their web spiders search the Internet for
web pages and then create indexes of all the words found on the pages. This allows us
to search the Internet quickly and easily.
Net Tools Spider is multi-functional web spider that supports:
Web Site Downloading Spidering web sites and saving the web pages and
files that it finds to your hard drive.
Web Mining Spidering web sites and extracting pieces of
information to be used for other purposes.
Link Checking
Spidering web sites and searching for broken links.
Web Site Searching Spidering web sites and searching for files that
contain certain keywords.
Today, most Internet users limit their searches to the Web, so we'll limit this article to
search engines that focus on the contents of Web pages.
Before a search engine can tell you where a file or document is, it must be found. To
find information on the hundreds of millions of Web pages that exist, a search engine
employs special software robots, called spiders, to build lists of the words found on
Web sites. When a spider is building its lists, the process is called Web crawling.
(There are some disadvantages to calling part of the Internet the World Wide Web -- a
large set of arachnid-centric names for tools is one of them.) In order to build and
maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
DISTRIBUTED DATABASE SYSTEM 45
CSIT Dept’s SGBAU Amravati.
How does any spider start its travels over the Web? The usual starting points are lists
of heavily used servers and very popular pages. The spider will begin with a popular
site, indexing the words on its pages and following every link found within the site. In
this way, the spidering system quickly begins to travel, spreading out across the most
widely used portions of the Web.
Google began as an academic search engine. In the paper that describes how the
system was built, Sergey Brin and Lawrence Page give an example of how quickly
their spiders can work. They built their initial system to use multiple spiders, usually
three at one time. Each spider could keep about 300 connections to Web pages open at
a time. At its peak performance, using four spiders, their system could crawl over 100
pages per second, generating around 600 kilobytes of data each second.
Keeping everything running quickly meant building a system to feed necessary
information to the spiders. The early Google system had a server dedicated to
providing URLs to the spiders. Rather than depending on an Internet service provider
for the domain name server (DNS) that translates a server's name into an address,
Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML page, it took note of two things:
¡ The words within the page
¡ Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative
importance were noted for special consideration during a subsequent user search. The
Google spider was built to index every significant word on a page, leaving out the
articles "a," "an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate faster, allow
users to search more efficiently, or both. For example, some spiders will keep track of
the words in the title, sub-headings and links, along with the 100 most frequently used
words on the page and each word in the first 20 lines of text. Lycos is said to use this
approach to spidering the Web.
Other systems, such as AltaVista, go in the other direction, indexing every single
word on a page, including "a," "an," "the" and other "insignificant" words. The push
to completeness in this approach is matched by other systems in the attention given to
the unseen portion of the Web page, the meta tags.
DISTRIBUTED DATABASE SYSTEM 46
CSIT Dept’s SGBAU Amravati.
DISTRIBUTED DATABASE SYSTEM 47
CSIT Dept’s SGBAU Amravati.
Mining Web:
¡ Project Summary
DISTRIBUTED DATABASE SYSTEM 48
CSIT Dept’s SGBAU Amravati.
¡ After running Web Miner Result will be seen as follows,
Conclusion: In this practical Net tool spider is studied and website is mined up to two
levels.

More Related Content

What's hot

Transaction management DBMS
Transaction  management DBMSTransaction  management DBMS
Transaction management DBMSMegha Patel
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitEric Wendelin
 
MULTI THREADING IN JAVA
MULTI THREADING IN JAVAMULTI THREADING IN JAVA
MULTI THREADING IN JAVAVINOTH R
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMSkoolkampus
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Data management with ado
Data management with adoData management with ado
Data management with adoDinesh kumar
 
JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...
JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...
JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...Pallepati Vasavi
 
java interface and packages
java interface and packagesjava interface and packages
java interface and packagesVINOTH R
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactionsNilu Desai
 
JDBC – Java Database Connectivity
JDBC – Java Database ConnectivityJDBC – Java Database Connectivity
JDBC – Java Database ConnectivityInformation Technology
 
Looping statement in vb.net
Looping statement in vb.netLooping statement in vb.net
Looping statement in vb.netilakkiya
 
Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...
Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...
Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...Edureka!
 
Sql a practical introduction
Sql   a practical introductionSql   a practical introduction
Sql a practical introductionHasan Kata
 
Abstract class and interface
Abstract class and interfaceAbstract class and interface
Abstract class and interfaceMazharul Sabbir
 
Networking in Java
Networking in JavaNetworking in Java
Networking in JavaTushar B Kute
 

What's hot (20)

Transaction management DBMS
Transaction  management DBMSTransaction  management DBMS
Transaction management DBMS
 
JAVA AWT
JAVA AWTJAVA AWT
JAVA AWT
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
MULTI THREADING IN JAVA
MULTI THREADING IN JAVAMULTI THREADING IN JAVA
MULTI THREADING IN JAVA
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Data management with ado
Data management with adoData management with ado
Data management with ado
 
JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...
JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...
JDBC,Types of JDBC,Resultset, statements,PreparedStatement,CallableStatements...
 
java interface and packages
java interface and packagesjava interface and packages
java interface and packages
 
management of distributed transactions
management of distributed transactionsmanagement of distributed transactions
management of distributed transactions
 
JDBC – Java Database Connectivity
JDBC – Java Database ConnectivityJDBC – Java Database Connectivity
JDBC – Java Database Connectivity
 
Java rmi
Java rmiJava rmi
Java rmi
 
Looping statement in vb.net
Looping statement in vb.netLooping statement in vb.net
Looping statement in vb.net
 
Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...
Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...
Java Tutorial For Beginners - Step By Step | Java Basics | Java Certification...
 
Sql a practical introduction
Sql   a practical introductionSql   a practical introduction
Sql a practical introduction
 
MYSQL - PHP Database Connectivity
MYSQL - PHP Database ConnectivityMYSQL - PHP Database Connectivity
MYSQL - PHP Database Connectivity
 
Abstract class and interface
Abstract class and interfaceAbstract class and interface
Abstract class and interface
 
Networking in Java
Networking in JavaNetworking in Java
Networking in Java
 
Jdbc ppt
Jdbc pptJdbc ppt
Jdbc ppt
 
Active database
Active databaseActive database
Active database
 

Similar to Distributed Database practicals

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Toolsijsrd.com
 
B040101007012
B040101007012B040101007012
B040101007012ijceronline
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDeltares
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructureFernando Lopez Aguilar
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersIRJET Journal
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
grid mining
grid mininggrid mining
grid miningARNOLD
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 

Similar to Distributed Database practicals (20)

Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
 
B040101007012
B040101007012B040101007012
B040101007012
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private CloudDistributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo Project
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
grid mining
grid mininggrid mining
grid mining
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 

More from Vrushali Lanjewar

Best performance evaluation metrics for image Classification.docx
Best performance evaluation metrics for image Classification.docxBest performance evaluation metrics for image Classification.docx
Best performance evaluation metrics for image Classification.docxVrushali Lanjewar
 
Studies based on Deep learning in recent years.pptx
Studies based on Deep learning in recent years.pptxStudies based on Deep learning in recent years.pptx
Studies based on Deep learning in recent years.pptxVrushali Lanjewar
 
Comparison of thresholding methods
Comparison of thresholding methodsComparison of thresholding methods
Comparison of thresholding methodsVrushali Lanjewar
 
Software Engineering Testing & Research
Software Engineering Testing & Research Software Engineering Testing & Research
Software Engineering Testing & Research Vrushali Lanjewar
 
Real Time Embedded System
Real Time Embedded SystemReal Time Embedded System
Real Time Embedded SystemVrushali Lanjewar
 
Performance Anaysis for Imaging System
Performance Anaysis for Imaging SystemPerformance Anaysis for Imaging System
Performance Anaysis for Imaging SystemVrushali Lanjewar
 
Advance Computer Architecture
Advance Computer ArchitectureAdvance Computer Architecture
Advance Computer ArchitectureVrushali Lanjewar
 
Wireless Communication Network Communication
Wireless Communication Network CommunicationWireless Communication Network Communication
Wireless Communication Network CommunicationVrushali Lanjewar
 
Cryptographic protocols
Cryptographic protocolsCryptographic protocols
Cryptographic protocolsVrushali Lanjewar
 

More from Vrushali Lanjewar (13)

Best performance evaluation metrics for image Classification.docx
Best performance evaluation metrics for image Classification.docxBest performance evaluation metrics for image Classification.docx
Best performance evaluation metrics for image Classification.docx
 
Studies based on Deep learning in recent years.pptx
Studies based on Deep learning in recent years.pptxStudies based on Deep learning in recent years.pptx
Studies based on Deep learning in recent years.pptx
 
Word art1
Word art1Word art1
Word art1
 
My Dissertation 2016
My Dissertation 2016My Dissertation 2016
My Dissertation 2016
 
Comparison of thresholding methods
Comparison of thresholding methodsComparison of thresholding methods
Comparison of thresholding methods
 
Software Engineering Testing & Research
Software Engineering Testing & Research Software Engineering Testing & Research
Software Engineering Testing & Research
 
Real Time Embedded System
Real Time Embedded SystemReal Time Embedded System
Real Time Embedded System
 
Performance Anaysis for Imaging System
Performance Anaysis for Imaging SystemPerformance Anaysis for Imaging System
Performance Anaysis for Imaging System
 
Advance Computer Architecture
Advance Computer ArchitectureAdvance Computer Architecture
Advance Computer Architecture
 
Wireless Communication Network Communication
Wireless Communication Network CommunicationWireless Communication Network Communication
Wireless Communication Network Communication
 
Pmgdisha
PmgdishaPmgdisha
Pmgdisha
 
Cryptographic protocols
Cryptographic protocolsCryptographic protocols
Cryptographic protocols
 
Distributed system
Distributed systemDistributed system
Distributed system
 

Recently uploaded

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage examplePragyanshuParadkar1
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
DATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage exampleDATA ANALYTICS PPT definition usage example
DATA ANALYTICS PPT definition usage example
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 

Distributed Database practicals

  • 1. DISTRIBUTED DATABASE SYSTEM 1 CSIT Dept’s SGBAU Amravati. Practical No: 1 Aim: Study of Rapid Miner Tools. Tool: Rapid Miner 7.6.001 Theory: Rapid Miner provides data mining and machine learning procedures including: data loading and transformation (Extract, transform, load, a.k.a. ETL), data preprocessing and visualization, modelling, evaluation, and deployment. Rapid Miner is written in the Java programming language. It uses learning schemes and attributes evaluators from the Weka machine learning environment and statistical modelling schemes from R-Project. Rapid Miner can define analytical steps (similar to R) and be used for analyzing data generated by high throughput instruments such as those used in genotyping, proteomics, and mass spectrometry. It can be used for text mining, multimedia mining, feature engineering, data stream mining, development of ensemble methods, and distributed data mining. Rapid Miner functionality can be extended with additional plugins. Rapid Miner provides a GUI to design an analytical pipeline (the "operator tree"). The GUI generates an XML (extensible Markup Language) file that defines the analytical processes the user wishes to apply to the data. Alternatively, the engine can be called from other programs or used as an API. Individual functions can be called from the command line. Rapid Miner is open-source and is offered free of charge as a Community Edition released under the GNU AGPL. Rapid Miner is the world-wide leading open-source data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of Rapid Miner cover a wide range of real-world data mining tasks. Using Rapid Miner one can explore data! Simplify the construction of analysis processes and the evaluation of different approaches. Try to find the best combination of preprocessing and learning steps or let Rapid Miner do that automatically.
  • 2. DISTRIBUTED DATABASE SYSTEM 2 CSIT Dept’s SGBAU Amravati. More than 400 data mining operators can be used and almost arbitrarily combined. The setup is described by XML files which can easily be created with a graphical user interface (GUI). This XML based scripting language turns Rapid Miner into an integrated development environment (IDE) for machine learning and data mining. Rapid Miner follows the concept of rapid prototyping leading very quickly to the desired results. Furthermore, Rapid Miner can be used as a Java data mining library. The development of most of the Rapid Miner concepts started in 2001 at the Artificial Intelligence Unit of the University of Dortmund. Several members of the unit started to implement and realize these concepts which led to a first version of Rapid Miner in 2002. Since 2004, the open-source version of Rapid Miner (GPL) is hosted by Source Forge. Since then, a large number of suggestions and extensions by external developers were also embedded into Rapid Miner. Today, both the open- source version and a close-source version of Rapid Miner are maintained by Rapid-I. Although Rapid Miner is totally free and open-source, it offers a huge amount of methods and possibilities not covered by other data mining suites, both open- source and proprietary ones. Features of Rapid miner: • Freely available open-source knowledge discovery environment • 100% pure Java (runs on every major platform and operating system) • KD processes are modelled as simple operator trees which is both intuitive and powerful • Operator trees or sub trees can be saved as building blocks for later re-use • Internal XML representation ensures standardized interchange format of data mining experiments • Simple scripting language allowing for automatic large-scale experiments • Multi-layered data view concept ensures efficient and transparent data handling • Flexibility in using Rapid Miner: • Graphical user interface (GUI) for interactive prototyping • Command line mode (batch mode) for automated large-scale applications
  • 3. DISTRIBUTED DATABASE SYSTEM 3 CSIT Dept’s SGBAU Amravati. • Java API (application programming interface) to ease usage of Rapid Miner from your own programs • Simple plug-in and extension mechanisms, a broad variety of plugins already exists and you can easily add your own • Powerful plotting facility offering a large set of sophisticated high- dimensional visualization techniques for data and models • More than 400 machine learning, evaluation, in- and output, pre- and post- processing, and visualization operators plus numerous meta optimization schemes • Rapid Miner was successfully applied on a wide range of applications where its rapid prototyping abilities demonstrated their usefulness, including text mining, multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining.
  • 4. DISTRIBUTED DATABASE SYSTEM 4 CSIT Dept’s SGBAU Amravati. Homepage: Fig 1.1 Operator and Parameter in Rapid Miner Fig 1.2 Repository Windows Conclusion: Hence, we have studied the Rapid Miner Tool.
  • 5. DISTRIBUTED DATABASE SYSTEM 5 CSIT Dept’s SGBAU Amravati. Practical No. 2 Aim:- Demonstration of Pre-processing on given Data set using Rapid Miner. Tool:- Rapid Miner 5.3.000 Theory:- Data pre-processing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data pre-processing transforms the data into a format that will be more easily and effectively processed for the purpose of the user -- for example, in a neural network. There are a number of different tools and methods used for pre-processing, including: sampling, which selects a representative subset from a large population of data; transformation, which manipulates raw data to produce a single input; denoising, which removes noise from data; normalization, which organizes data for more efficient access; and feature extraction, which pulls out specified data that is significant in some particular context. Procedure:- 1. Create .arff file 2. Go to Repository Select import .CSV file Data import wizard Steps Name the data repository. 3. Learn the graphical statistical output of example.
  • 6. DISTRIBUTED DATABASE SYSTEM 6 CSIT Dept’s SGBAU Amravati. Output:- Fig 4.1 Pre-processing on given Data set
  • 7. DISTRIBUTED DATABASE SYSTEM 7 CSIT Dept’s SGBAU Amravati. Fig 4.2 Text view Fig 4.3 Decision Tree view Conclusion:-Thus, we have learned decision trees using Rapid Miner.
  • 8. DISTRIBUTED DATABASE SYSTEM 8 CSIT Dept’s SGBAU Amravati. Practical No. 3 Aim:-Demonstration of DBSCAN clustering algorithm using Rapid Miner. Tool:- Rapid Miner 5.3.000 Theory:- DBSCAN's definition of a cluster is based on the notion of density reachability. Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood) and if p is surrounded by sufficiently many points such that one may consider p and q to be part of a cluster. q is called density-reachable (note the distinction from "directly density-reachable") from p if there is a sequence p(1),…,p(n) of points with p(1) = p and p(n) = q where each p(i+1) is directly density-reachable from p(i). Procedure:- 1. Go to repository  Go to modeling Go to clustering and segmentation 2. Drag & drop selected DBSCAN on main process 3. Drag & drop selected DB on main process. 4. Connect the respective nodes. 5. Run the program.
  • 9. DISTRIBUTED DATABASE SYSTEM 9 CSIT Dept’s SGBAU Amravati. Output:- Fig 3.1 DBSCAN clustering algorithm Fig 3.2 Text View of DBSCAN
  • 10. DISTRIBUTED DATABASE SYSTEM 10 CSIT Dept’s SGBAU Amravati. Fig 3.3 Graph View of DBSCAN Conclusion:- Thus, we have learned DBSCAN clustering algorithm using Rapid Miner.
  • 11. DISTRIBUTED DATABASE SYSTEM 11 CSIT Dept’s SGBAU Amravati. Practical No. 4 Aim:- Demonstration of decision tree using Rapid Miner. Tool:- Rapid Miner 5.3.000 Theory:- Decision tree induction in learning of decision trees from class-labeled training tuples. A decision tree is flow chart like tree structure, where each internal node (non- leaf node) denotes a test on an attribute, each branch represent an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. A decision Tree for the concept buy computer, indicating whether a customer at All Electronics is likely to purchase a computer. Each internal (non-leaf) node represents a class (either buys computer = yes or buys- computer = no). Procedure:- 1. Go to repository  Go to sample Go to data Go to deals(Golf) 2. Drag & drop selected DB on main process. 3. Select operator  Modelling  Select classification and regression Select Tree induction Select decision tree drag and drop on main process. 4. Connect the respective nodes. 5. Run the program. 6. Learn the classifications diagrammatic representation of output of given example.
  • 12. DISTRIBUTED DATABASE SYSTEM 12 CSIT Dept’s SGBAU Amravati. Output:- Fig 4.1 Decision tree using Rapid Miner
  • 13. DISTRIBUTED DATABASE SYSTEM 13 CSIT Dept’s SGBAU Amravati. Fig 4.2 Text View of Decision Tree Fig 4.3 Tree View of Decision Tree Conclusion:-Thus, we have learned decision trees using Rapid Miner.
  • 14. DISTRIBUTED DATABASE SYSTEM 14 CSIT Dept’s SGBAU Amravati. Practical No. 5 Aim: - Demonstration of NaĂŻve Bayes classification on a given data set using Rapid Miner. Tool :- Rapid Miner 5.3.000 Theory:- A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be 'independent feature model'. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix. Procedure:- 1. Go to repository  Go to sample Go to data Go to deals( Weight) 2. Drag & drop selected DB on main process. 3. Select operator  Modelling Cluster segmentation Select Also KNIME Drag and drop on main process. 4. Connect the respective nodes. 5. Run the program. 6. Learn the clustering diagrammatic representation. of output of given example.
  • 15. DISTRIBUTED DATABASE SYSTEM 15 CSIT Dept’s SGBAU Amravati. Output:- Fig 5.1 NaĂŻve Bayes classification
  • 16. DISTRIBUTED DATABASE SYSTEM 16 CSIT Dept’s SGBAU Amravati. Fig 5.2 Text View of NaĂŻve Bayes classification Fig 5.3 Plot View of NaĂŻve Bayes classification Conclusion:- Thus, we have learned NaĂŻve Bayes classification using Rapid Miner.
  • 17. DISTRIBUTED DATABASE SYSTEM 17 CSIT Dept’s SGBAU Amravati. Practical No. 6 Aim:- Demonstration of k-means clustering algorithm using Rapid Miner. Tool:- Rapid Miner 7.6.001 Theory:- This operator performs clustering using the k-means algorithm. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabeled data. k-means clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters. This operator performs clustering using the k-means algorithm. k-means clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters. Objects in one cluster are similar to each other. The similarity between objects is based on a measure of the distance between them. Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabeled data. Clustering can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior. Procedure:- 1. Go to repository  Go to sample Go to data Go to deals(DB) 2. Drag & drop selected DB on main process. 3. Select operator  Modelling Cluster segmentation Select Also KNIMEdrag and drop on main process. 4. Connect the respective nodes. 5. Run the program. 6. Learn the clustering diagrammatic representation. of output of given example.
  • 18. DISTRIBUTED DATABASE SYSTEM 18 CSIT Dept’s SGBAU Amravati. Output :- Fig 6.1 k-means clustering algorithm Fig 6.2 Text View of k-means clustering algorithm
  • 19. DISTRIBUTED DATABASE SYSTEM 19 CSIT Dept’s SGBAU Amravati. Fig 6.3 Graph View of k-means clustering algorithm Fig 6.4 Centroid Plot View of k-means clustering algorithm Conclusion:-Thus, we have learned k-means clustering algorithm using Rapid Miner.
  • 20. DISTRIBUTED DATABASE SYSTEM 20 CSIT Dept’s SGBAU Amravati. Practical No. 7 Aim:- Demonstration of Market Basket analysis using Association rule mining in Rapid Miner. Tool:- Rapid Miner 7.6.001 Theory:- These models build upon the association rule mining framework, but provide additional analytic capabilities beyond simple associations. The first model allows to mine transactional database for negative patterns represented as dissociation item sets and dissociation rules. The second model of substitutive item sets filters items and item sets that can be used interchangeably as substitutes, i.e., item sets that appear in the transactional database in very similar contexts. Finally, the third model of recommendation rules uses an additional item set interestingness measure, namely coverage, to construct a set of recommended items using a greedy search procedure. Procedure:- 1. Go to repository  Go to sample Go to data Go to deals( Weight) 2. Drag & drop selected DB on main process. 3. Select operator  Modelling Generate data DiscretizeNominal to BinominalFP GrowthMultiplyCreate Dissociation RulesGenerate Current Selection Context MultiplyCreate Substitutive Sets Create Recommendation Sets Multiply. 4. Drag and drop on main process. 5. Connect the respective nodes. 6. Run the program. 7. Learn the clustering diagrammatic representation of output of given example.
  • 21. DISTRIBUTED DATABASE SYSTEM 21 CSIT Dept’s SGBAU Amravati. Output:- Fig 7.1 Market Basket analysis using Association rule Fig 7.2 Data Table Of Market Basket analysis using Association rule Conclusion:- Thus, we have learned Market Basket analysis using Association rule mining in Rapid Miner.
  • 22. DISTRIBUTED DATABASE SYSTEM 22 CSIT Dept’s SGBAU Amravati. Practical No: 8 Aim: Study of KNIME Analytical Platform. Tool: KNIME 3.4.1 Theory: KNIME: Konstanz Information Miner, is an open source data analytics, reporting and integration platform. It has been used in pharmaceutical research, but is also used in other areas like CRM customer data analysis, business intelligence and financial data analysis. It is based on the Eclipse platform and, through its modular API, and is easily extensible. Custom nodes and types can be implemented in KNIME within hours thus extending KNIME to comprehend and provide first tier support for highly domain-specific data format. 1) Technical Specification:  Released on 2004.  Latest version available is KNIME2.9  Licensed By GNU General Public License  Compatible with Linux ,OS X, Windows  Written in java  www.knime.org 2) General Features:  Knime, pronounced “naim”, is a nicely designed data mining tool that runs inside the IBM’s Eclipse development environment.  It is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.  The Knime base version already incorporates over 100 processing nodes for data I/O, pre-processing and cleansing, modelling, analysis and data mining as well as various interactive views, such as scatter plots, parallel coordinates and others.
  • 23. DISTRIBUTED DATABASE SYSTEM 23 CSIT Dept’s SGBAU Amravati. 3) Specification:  Integration of the Chemistry Development Kit with additional nodes for the processing of chemical structures, compounds, etc.  Specialized for Enterprise reporting, Business Intelligence, data mining Advantages:  It integrates all analysis modules of the well-known. Weka data mining environment and additional plugins allow R-scripts to be run, offering access to a vast library of statistical routines.  It is easy to try out because it requires no installation besides downloading and un archiving.  The one aspect of KNIME that truly sets it apart from other data mining packages is its ability to interface with programs that allow for the visualization and analysis of molecular data Limitations:  Have only limited error measurement methods.  Has no wrapper method for descriptor selection.  Does not have automatic facility for Parameter optimization of machine learning/statistical methods. Homepage: Fig: Main Window of KNIME Conclusion: Hence we have studied the KNIME Analytical Platform.
  • 24. DISTRIBUTED DATABASE SYSTEM 24 CSIT Dept’s SGBAU Amravati. Practical No. 9 Aim:-Demonstration of pre-processing on a given data set using KNIME analytical platform. Tool:-KNIME 3.4.1 Theory:- Data pre-processing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: −100), impossible data combinations, missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processingincludes cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Kotsiantis et al. present a well-known algorithm for each step of data pre- processing. Procedure:- 1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick next 2. See your project in KNIME explorer. 3. Run the project.
  • 25. DISTRIBUTED DATABASE SYSTEM 25 CSIT Dept’s SGBAU Amravati. Output:- Figure 9.1 Pre-processing OF KNIME Conclusion:- Thus, we have learned pre-processing on a given data set using KNIME analytical platform
  • 26. DISTRIBUTED DATABASE SYSTEM 26 CSIT Dept’s SGBAU Amravati. Practical No. 10 Aim:- Demonstration of decision tree learning and predicting using KNIME analytical platform. Tool:-KNIME 3.4.1 Theory:- Decision tree learning uses a decision tree as a predictive model which maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making). This page deals with decision trees in data mining. Procedure:- 1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick next 2. See your project in KNIME explorer. 3. Run the project.
  • 27. DISTRIBUTED DATABASE SYSTEM 27 CSIT Dept’s SGBAU Amravati. Output:- Figure 10.1 Decision Tree learning and predicting using KNIME analytical Conclusion:- Thus, we have learned the decision tree learning and predicting using KNIME analytical platform.
  • 28. DISTRIBUTED DATABASE SYSTEM 28 CSIT Dept’s SGBAU Amravati. Practical No. 11 Aim:-Demonstration of k-means clustering algorithm using KNIME analytical platform. Tool:-KNIME 3.4.1 Theory:- K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction. The best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data. The objective of K-Means clustering is to minimize total intra-cluster variance. Algorithm:- 1. Clusters the data into k groups where k is predefined. 2. Select k points at random as cluster centers. 3. Assign objects to their closest cluster center according to the Euclidean distance function. 4. Calculate the centroid or mean of all objects in each cluster. 5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds. K-Means is relatively an efficient method. However, we need to specify the number of clusters, in advance and the final results are sensitive to initialization and often terminates at a local optimum. Unfortunately there is no global theoretical method to find the optimal number of clusters. A practical approach is to compare the outcomes of multiple runs with different k and choose the best one based on a predefined criterion. In general, a large k probably decreases the error but increases the risk of over fitting.
  • 29. DISTRIBUTED DATABASE SYSTEM 29 CSIT Dept’s SGBAU Amravati. Procedure:- 1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick next 2. See your project in KNIME explorer. 3. Run the project. Output:- Figure11.1 k-means clustering algorithm using KNIME analytical Figure11.2 Data Table of k-means clustering algorithm using KNIME analytical Conclusion:- Thus, we have learned the k-means clustering algorithm using KNIME analytical platform.
  • 30. DISTRIBUTED DATABASE SYSTEM 30 CSIT Dept’s SGBAU Amravati. Practical No. 12 Aim:-. Study of Orange mining tool. Tool:- Orange 3.5 Theory:- Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative data analysis and interactive data visualization, and can also be used as a Python library. Orange is a component-based visual programming software package for data visualization, machine learning, data mining and data analysis. Orange components are called widgets and they range from simple data visualization, subset selection and pre-processing, to empirical evaluation of learning algorithms and predictive modeling. Visual programming is implemented through an interface in which workflows are created by linking predefined or user-designed widgets, while advanced users can use Orange as a Python library Orange is an open-source software package released under GPL. Versions up to 3.0 include core components in C++ with wrappers in Python are available on github. From version 3.0 onwards, Orange uses common Python open- source libraries for scientific computing, such as numpy, scipy and scikit-learn, while its graphical user interface operates within the cross-platformQt framework. Orange3 has a separate github. The default installation includes a number of machine learning, pre-processing and data visualization algorithms in 6 widget sets (data, visualize, classify, regression, evaluate and unsupervised). Additional functionalities are available as add-ons (bioinformatics, data fusion and text-mining). Orange is supported on macOS, Windows and Linux and can also be installed from the Python Package Index repository (pip install Orange). As of 2016 the stable version is 3.3 and runs with Python 3, while the legacy version 2.7 that runs with Python 2.7 is still available for data manipulation and widget alteration. Features Orange consists of a canvas interface onto which the user places widgets and creates a data analysis workflow. Widgets offer basic functionalities such as reading the data, showing a data table, selecting features, training predictors, comparing learning algorithms, visualizing data elements, etc. The user can interactively explore visualizations or feed the selected subset into other widgets. Classification Tree widget in Orange 3.0  Canvas: graphical front-end for data analysis  Widgets: o Data: widgets for data input, data filtering, sampling, imputation, feature manipulation and feature selection
  • 31. DISTRIBUTED DATABASE SYSTEM 31 CSIT Dept’s SGBAU Amravati. o Visualize: widgets for common visualization (box plot, histograms, scatter plot) and multivariate visualization (mosaic display, sieve diagram). o Classify: a set of supervised machine learning algorithms for classification o Regression: a set of supervised machine learning algorithms for regression o Evaluate: cross-validation, sampling-based procedures, reliability estimation and scoring of prediction methods o Unsupervised: unsupervised learning algorithms for clustering (k- means, hierarchical clustering) and data projection techniques (multidimensional scaling, principal component analysis, correspondence analysis). Add-ons:  Associate: widgets for mining frequent item sets and association rule learning  Bioinformatics: widgets for gene set analysis, enrichment, and access to pathway libraries  Data fusion: widgets for fusing different data sets, collective matrix factorization, and exploration of latent factors  Educational: widgets for teaching machine learning concepts, such as k- means clustering, polynomial regression, stochastic gradient descent, ...  Geo: widgets for working with geospatial data  Image analytics: widgets for working with images and Image Net embeddings  Network: widgets for graph and network analysis  Text mining: widgets for natural language processing and text mining  Time series: widgets for time series analysis and modeling
  • 32. DISTRIBUTED DATABASE SYSTEM 32 CSIT Dept’s SGBAU Amravati. Fig 12.1 Main Page of Orange mining tool Conclusion:-Thus, we have studied Orange mining tool.
  • 33. DISTRIBUTED DATABASE SYSTEM 33 CSIT Dept’s SGBAU Amravati. Practical No: 13 Aim: Demonstration of Data visualization using Orange. Tool: Orange 3.5 Theory: A linear projection method with explorative data analysis. Signals Inputs:  Data : An input data set  Data Subset : A subset of data instances Outputs:  Selected Data : A data subset that the user has manually selected in the projection. Steps to demonstrate Linear Projection using Orange  Axes in the projection that are displayed and other available axes.  Set the color of the displayed dots (you will get colored dots for discrete values and grey-scale dots for continuous). Set opacity, shape and size to differentiate between instances.  Set jittering to prevent the dots from overlapping (especially for discrete attributes).  Select, zoom, pan and zoom to fit options for exploring the graph. Manual selection of data instances works as a non-angular/free-hand selection tool. Double click to move the projection. Scroll in or out for zoom.  When the box is ticked (Auto commit is on), the widget will communicate the changes automatically. Alternatively, click Commit.  Save Image saves the created image to your computer in a .svg or .png format.
  • 34. DISTRIBUTED DATABASE SYSTEM 34 CSIT Dept’s SGBAU Amravati. Output: Fig 13.1 Data visualization using Orange Fig 13.2 Scatter Plot of Data visualization using Orange
  • 35. DISTRIBUTED DATABASE SYSTEM 35 CSIT Dept’s SGBAU Amravati. Fig 13.3 Classification tree of Data visualization using Orange Conclusion: Hence, Data visualization using Orange has been demonstrated.
  • 36. DISTRIBUTED DATABASE SYSTEM 36 CSIT Dept’s SGBAU Amravati. Practical No: 14 Aim: Demonstration of classification using Orange Mining Tool. Tool: Orange 3.5 Theory: A linear projection method with explorative data analysis. Signals Inputs:  Data : An input data set  Data Subset : A subset of data instances Outputs:  Selected Data : A data subset that the user has manually selected in the projection. Steps to demonstrate Linear Projection using Orange  Axes in the projection that are displayed and other available axes.  Set the color of the displayed dots (you will get colored dots for discrete values and grey-scale dots for continuous). Set opacity, shape and size to differentiate between instances.  Set jittering to prevent the dots from overlapping (especially for discrete attributes).  Select, zoom, pan and zoom to fit options for exploring the graph. Manual selection of data instances works as a non-angular/free-hand selection tool. Double click to move the projection. Scroll in or out for zoom.  When the box is ticked (Auto commit is on), the widget will communicate the changes automatically. Alternatively, click Commit.  Save Image saves the created image to your computer in a .svg or .png format.
  • 37. DISTRIBUTED DATABASE SYSTEM 37 CSIT Dept’s SGBAU Amravati. Output: Fig 14.1 classification using Orange Mining Tool Fig 14.2 Point Data View of classification using Orange Mining Tool Conclusion: Hence, we have demonstrated classification using Orange Mining Tool.
  • 38. DISTRIBUTED DATABASE SYSTEM 38 CSIT Dept’s SGBAU Amravati. Practical No: 15 Aim: Demonstration of text mining using Orange Tool: Orange3.5 Theory: A linear projection method with explorative data analysis. Signals Inputs:  Data : An input data set  Data Subset : A subset of data instances Outputs:  Selected Data : A data subset that the user has manually selected in the projection. Steps to demonstrate Linear Projection using Orange  Axes in the projection that are displayed and other available axes.  Set the color of the displayed dots (you will get colored dots for discrete values and grey-scale dots for continuous). Set opacity, shape and size to differentiate between instances.  Set jittering to prevent the dots from overlapping (especially for discrete attributes).  Select, zoom, pan and zoom to fit options for exploring the graph. Manual selection of data instances works as a non-angular/free-hand selection tool. Double click to move the projection. Scroll in or out for zoom.  When the box is ticked (Auto commit is on), the widget will communicate the changes automatically. Alternatively, click Commit.  Save Image saves the created image to your computer in a .svg or .png forma
  • 39. DISTRIBUTED DATABASE SYSTEM 39 CSIT Dept’s SGBAU Amravati. Fig 15.1 Text mining scenario Fig 15.3 Query Window of Wikipedia
  • 40. DISTRIBUTED DATABASE SYSTEM 40 CSIT Dept’s SGBAU Amravati. Output: Fig 15.2 Data Table of text mining using Orange Conclusion: Hence, we have demonstrated text mining using Orange Mining Tool.
  • 41. DISTRIBUTED DATABASE SYSTEM 41 CSIT Dept’s SGBAU Amravati. Practical No: 16 Aim: Demonstration of Linear Projection using Orange Mining Tool. Tool: Orange 3.5 Theory: A linear projection method with explorative data analysis. Signals Inputs:  Data: An input data set  Data Subset : A subset of data instances Outputs:  Selected Data : A data subset that the user has manually selected in the projection. Steps to demonstrate Linear Projection using Orange  Axes in the projection that are displayed and other available axes.  Set the color of the displayed dots (you will get colored dots for discrete values and grey-scale dots for continuous). Set opacity, shape and size to differentiate between instances.  Set jittering to prevent the dots from overlapping (especially for discrete attributes).  Select, zoom, pan and zoom to fit options for exploring the graph. Manual selection of data instances works as a non-angular/free-hand selection tool. Double click to move the projection. Scroll in or out for zoom.  When the box is ticked (Auto commit is on), the widget will communicate the changes automatically. Alternatively, click Commit.  Save Image saves the created image to your computer in a .svg or .png format.
  • 42. DISTRIBUTED DATABASE SYSTEM 42 CSIT Dept’s SGBAU Amravati. Output: Fig 16.1 Main window of Linear Projection using Orange Mining Tool Fig 16.2 Paint Data view of Linear Projection using Orange Mining Tool
  • 43. DISTRIBUTED DATABASE SYSTEM 43 CSIT Dept’s SGBAU Amravati. Fig 16.3 Linear Projection using Orange Mining Tool Fig 16.4 Rank Table of Linear Projection using Orange Mining Tool Conclusion: Hence Linear Projection using Orange Mining Tool has been demonstrated.
  • 44. DISTRIBUTED DATABASE SYSTEM 44 CSIT Dept’s SGBAU Amravati. Practical No: 17 Aim: Study of Net Tool Spider. Tool: Net Tool Spider Theory: Net Tool Spider: A web spider is a software program that searches the Internet for information. The basic process of a web spider is to download a web page and to search the web page for links to other web pages. It then repeats this behavior in all of the new pages that it found. By repeating this process a web spider can find all of the pages within a web site and all of the pages on the Internet. Web spiders can have many purposes. The most common spiders are used by search engines like Google, Yahoo and AltaVista. Their web spiders search the Internet for web pages and then create indexes of all the words found on the pages. This allows us to search the Internet quickly and easily. Net Tools Spider is multi-functional web spider that supports: Web Site Downloading Spidering web sites and saving the web pages and files that it finds to your hard drive. Web Mining Spidering web sites and extracting pieces of information to be used for other purposes. Link Checking Spidering web sites and searching for broken links. Web Site Searching Spidering web sites and searching for files that contain certain keywords. Today, most Internet users limit their searches to the Web, so we'll limit this article to search engines that focus on the contents of Web pages. Before a search engine can tell you where a file or document is, it must be found. To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. (There are some disadvantages to calling part of the Internet the World Wide Web -- a large set of arachnid-centric names for tools is one of them.) In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
  • 45. DISTRIBUTED DATABASE SYSTEM 45 CSIT Dept’s SGBAU Amravati. How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web. Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second. Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum. When the Google spider looked at an HTML page, it took note of two things: ¡ The words within the page ¡ Where the words were found Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches. These different approaches usually attempt to make the spider operate faster, allow users to search more efficiently, or both. For example, some spiders will keep track of the words in the title, sub-headings and links, along with the 100 most frequently used words on the page and each word in the first 20 lines of text. Lycos is said to use this approach to spidering the Web. Other systems, such as AltaVista, go in the other direction, indexing every single word on a page, including "a," "an," "the" and other "insignificant" words. The push to completeness in this approach is matched by other systems in the attention given to the unseen portion of the Web page, the meta tags.
  • 46. DISTRIBUTED DATABASE SYSTEM 46 CSIT Dept’s SGBAU Amravati.
  • 47. DISTRIBUTED DATABASE SYSTEM 47 CSIT Dept’s SGBAU Amravati. Mining Web: ¡ Project Summary
  • 48. DISTRIBUTED DATABASE SYSTEM 48 CSIT Dept’s SGBAU Amravati. ¡ After running Web Miner Result will be seen as follows, Conclusion: In this practical Net tool spider is studied and website is mined up to two levels.