Distributed Database practicals

DISTRIBUTED DATABASE SYSTEM 1
CSIT Dept’s SGBAU Amravati.
Practical No: 1
Aim: Study of Rapid Miner Tools.
Tool: Rapid Miner 7.6.001
Theory:
Rapid Miner provides data mining and machine learning procedures including:
data loading and transformation (Extract, transform, load, a.k.a. ETL), data
preprocessing and visualization, modelling, evaluation, and deployment. Rapid Miner
is written in the Java programming language. It uses learning schemes and attributes
evaluators from the Weka machine learning environment and statistical modelling
schemes from R-Project.
Rapid Miner can define analytical steps (similar to R) and be used for
analyzing data generated by high throughput instruments such as those used in
genotyping, proteomics, and mass spectrometry. It can be used for text mining,
multimedia mining, feature engineering, data stream mining, development of
ensemble methods, and distributed data mining. Rapid Miner functionality can be
extended with additional plugins.
Rapid Miner provides a GUI to design an analytical pipeline (the "operator
tree"). The GUI generates an XML (extensible Markup Language) file that defines the
analytical processes the user wishes to apply to the data. Alternatively, the engine can
be called from other programs or used as an API. Individual functions can be called
from the command line.
Rapid Miner is open-source and is offered free of charge as a Community
Edition released under the GNU AGPL.
Rapid Miner is the world-wide leading open-source data mining solution due
to the combination of its leading-edge technologies and its functional range.
Applications of Rapid Miner cover a wide range of real-world data mining tasks.
Using Rapid Miner one can explore data! Simplify the construction of analysis
processes and the evaluation of different approaches. Try to find the best combination
of preprocessing and learning steps or let Rapid Miner do that automatically.

More than 400 data mining operators can be used and almost arbitrarily
combined. The setup is described by XML files which can easily be created with a
graphical user interface (GUI). This XML based scripting language turns Rapid Miner
into an integrated development environment (IDE) for machine learning and data
mining. Rapid Miner follows the concept of rapid prototyping leading very quickly to
the desired results. Furthermore, Rapid Miner can be used as a Java data mining
library.
The development of most of the Rapid Miner concepts started in 2001 at the
Artificial Intelligence Unit of the University of Dortmund. Several members of the
unit started to implement and realize these concepts which led to a first version of
Rapid Miner in 2002. Since 2004, the open-source version of Rapid Miner (GPL) is
hosted by Source Forge. Since then, a large number of suggestions and extensions by
external developers were also embedded into Rapid Miner. Today, both the open-
source version and a close-source version of Rapid Miner are maintained by Rapid-I.
Although Rapid Miner is totally free and open-source, it offers a huge amount
of methods and possibilities not covered by other data mining suites, both open-
source and proprietary ones.
Features of Rapid miner:
• Freely available open-source knowledge discovery environment
• 100% pure Java (runs on every major platform and operating system)
• KD processes are modelled as simple operator trees which is both intuitive and
powerful
• Operator trees or sub trees can be saved as building blocks for later re-use
• Internal XML representation ensures standardized interchange format of data
mining experiments
• Simple scripting language allowing for automatic large-scale experiments
• Multi-layered data view concept ensures efficient and transparent data
handling
• Flexibility in using Rapid Miner:
• Graphical user interface (GUI) for interactive prototyping
• Command line mode (batch mode) for automated large-scale applications

• Java API (application programming interface) to ease usage of Rapid Miner
from your own programs
• Simple plug-in and extension mechanisms, a broad variety of plugins already
exists and you can easily add your own
• Powerful plotting facility offering a large set of sophisticated high-
dimensional visualization techniques for data and models
• More than 400 machine learning, evaluation, in- and output, pre- and post-
processing, and visualization operators plus numerous meta optimization
schemes
• Rapid Miner was successfully applied on a wide range of applications where
its rapid prototyping abilities demonstrated their usefulness, including text
mining, multimedia mining, feature engineering, data stream mining and
tracking drifting concepts, development of ensemble methods, and distributed
data mining.

Homepage:
Fig 1.1 Operator and Parameter in Rapid Miner
Fig 1.2 Repository Windows
Conclusion: Hence, we have studied the Rapid Miner Tool.

Practical No. 2
Aim:- Demonstration of Pre-processing on given Data set using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
Data pre-processing describes any type of processing performed on raw
data to prepare it for another processing procedure. Commonly used as a
preliminary data mining practice, data pre-processing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user --
for example, in a neural network. There are a number of different tools and methods
used for pre-processing, including: sampling, which selects a representative subset
from a large population of data; transformation, which manipulates raw data to
produce a single input; denoising, which removes noise from data; normalization,
which organizes data for more efficient access; and feature extraction, which pulls out
specified data that is significant in some particular context.
Procedure:-
1. Create .arff file
2. Go to Repository Select import .CSV file Data import wizard Steps Name
the data repository.
3. Learn the graphical statistical output of example.

Output:-
Fig 4.1 Pre-processing on given Data set

Fig 4.2 Text view
Fig 4.3 Decision Tree view
Conclusion:-Thus, we have learned decision trees using Rapid Miner.

Practical No. 3
Aim:-Demonstration of DBSCAN clustering algorithm using Rapid Miner.
Theory:-
DBSCAN's definition of a cluster is based on the notion of density
reachability. Basically, a point q is directly density-reachable from a point p if it is not
farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood)
and if p is surrounded by sufficiently many points such that one may consider p and q
to be part of a cluster. q is called density-reachable (note the distinction from "directly
density-reachable") from p if there is a sequence p(1),…,p(n) of points with p(1) = p
and p(n) = q where each p(i+1) is directly density-reachable from p(i).
Procedure:-
1. Go to repository  Go to modeling Go to clustering and segmentation
2. Drag & drop selected DBSCAN on main process
3. Drag & drop selected DB on main process.
4. Connect the respective nodes.
5. Run the program.

Output:-
Fig 3.1 DBSCAN clustering algorithm
Fig 3.2 Text View of DBSCAN

Fig 3.3 Graph View of DBSCAN
Conclusion:- Thus, we have learned DBSCAN clustering algorithm using Rapid
Miner.

Practical No. 4
Aim:- Demonstration of decision tree using Rapid Miner.
Theory:-
Decision tree induction in learning of decision trees from class-labeled
training tuples. A decision tree is flow chart like tree structure, where each internal
node (non- leaf node) denotes a test on an attribute, each branch represent an outcome
of the test, and each leaf node (or terminal node) holds a class label. The topmost
node in a tree is the root node.
A decision Tree for the concept buy computer, indicating whether a customer
at All Electronics is likely to purchase a computer. Each internal (non-leaf) node
represents a class (either buys computer = yes or buys- computer = no).
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals(Golf)
3. Select operator  Modelling  Select classification and regression Select Tree
induction Select decision tree drag and drop on main process.
5. Run the program.
6. Learn the classifications diagrammatic representation of output of given example.

Output:-
Fig 4.1 Decision tree using Rapid Miner

Fig 4.2 Text View of Decision Tree
Fig 4.3 Tree View of Decision Tree
Conclusion:-Thus, we have learned decision trees using Rapid Miner.

Practical No. 5
Aim: - Demonstration of Naïve Bayes classification on a given data set using Rapid
Miner.
Tool :- Rapid Miner 5.3.000
Theory:-
A Naive Bayes classifier is a simple probabilistic classifier based on applying
Bayes' theorem (from Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying probability model would be
'independent feature model'. In simple terms, a Naive Bayes classifier assumes that
the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated
to the presence (or absence) of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 4 inches in diameter. Even if
these features depend on each other or upon the existence of the other features, a
Naive Bayes classifier considers all of these properties to independently contribute to
the probability that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small
amount of training data to estimate the means and variances of the variables necessary
for classification. Because independent variables are assumed, only the variances of
the variables for each label need to be determined and not the entire covariance
matrix.
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals( Weight)
3. Select operator  Modelling Cluster segmentation Select Also
KNIME Drag and drop on main process.
5. Run the program.
6. Learn the clustering diagrammatic representation. of output of given example.

Output:-
Fig 5.1 Naïve Bayes classification

Fig 5.2 Text View of Naïve Bayes classification
Fig 5.3 Plot View of Naïve Bayes classification
Conclusion:-
Thus, we have learned Naïve Bayes classification using Rapid Miner.

Practical No. 6
Aim:- Demonstration of k-means clustering algorithm using Rapid Miner.
Theory:-
This operator performs clustering using the k-means algorithm. Clustering is
concerned with grouping objects together that are similar to each other and dissimilar
to the objects belonging to other clusters. Clustering is a technique for extracting
information from unlabeled data. k-means clustering is an exclusive clustering
algorithm i.e. each object is assigned to precisely one of a set of clusters. This
operator performs clustering using the k-means algorithm. k-means clustering is an
exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of
clusters. Objects in one cluster are similar to each other. The similarity between
objects is based on a measure of the distance between them.
Clustering is concerned with grouping together objects that are similar to each
other and dissimilar to the objects belonging to other clusters. Clustering is a
technique for extracting information from unlabeled data. Clustering can be very
useful in many different scenarios e.g. in a marketing application we may be
interested in finding clusters of customers with similar buying behavior.
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals(DB)
3. Select operator  Modelling Cluster segmentation Select Also
KNIMEdrag and drop on main process.
5. Run the program.
6. Learn the clustering diagrammatic representation. of output of given example.

Output :-
Fig 6.1 k-means clustering algorithm
Fig 6.2 Text View of k-means clustering algorithm

Fig 6.3 Graph View of k-means clustering algorithm
Fig 6.4 Centroid Plot View of k-means clustering algorithm
Conclusion:-Thus, we have learned k-means clustering algorithm using Rapid Miner.

Practical No. 7
Aim:- Demonstration of Market Basket analysis using Association rule mining in
Rapid Miner.
Theory:-
These models build upon the association rule mining framework, but provide
additional analytic capabilities beyond simple associations. The first model allows to
mine transactional database for negative patterns represented as dissociation item sets
and dissociation rules. The second model of substitutive item sets filters items and
item sets that can be used interchangeably as substitutes, i.e., item sets that appear in
the transactional database in very similar contexts. Finally, the third model of
recommendation rules uses an additional item set interestingness measure, namely
coverage, to construct a set of recommended items using a greedy search procedure.
Procedure:-
1. Go to repository  Go to sample Go to data Go to deals( Weight)
3. Select operator  Modelling Generate data DiscretizeNominal to
BinominalFP GrowthMultiplyCreate Dissociation RulesGenerate
Current Selection Context MultiplyCreate Substitutive Sets Create
Recommendation Sets Multiply.
4. Drag and drop on main process.
6. Run the program.
7. Learn the clustering diagrammatic representation of output of given example.

Output:-
Fig 7.1 Market Basket analysis using Association rule
Fig 7.2 Data Table Of Market Basket analysis using Association rule
Conclusion:-
Thus, we have learned Market Basket analysis using Association rule mining
in Rapid Miner.

Practical No: 8
Aim: Study of KNIME Analytical Platform.
Tool: KNIME 3.4.1
Theory:
KNIME:
Konstanz Information Miner, is an open source data analytics, reporting and
integration platform. It has been used in pharmaceutical research, but is also used in
other areas like CRM customer data analysis, business intelligence and financial data
analysis. It is based on the Eclipse platform and, through its modular API, and is
easily extensible. Custom nodes and types can be implemented in KNIME within
hours thus extending KNIME to comprehend and provide first tier support for highly
domain-specific data format.
1) Technical Specification:
 Released on 2004.
 Latest version available is KNIME2.9
 Licensed By GNU General Public License
 Compatible with Linux ,OS X, Windows
 Written in java
 www.knime.org
2) General Features:
 Knime, pronounced “naim”, is a nicely designed data mining tool that runs
inside the IBM’s Eclipse development environment.
 It is a modular data exploration platform that enables the user to visually
create data flows (often referred to as pipelines), selectively execute some or
all analysis steps, and later investigate the results through interactive views on
data and models.
 The Knime base version already incorporates over 100 processing nodes for
data I/O, pre-processing and cleansing, modelling, analysis and data mining as
well as various interactive views, such as scatter plots, parallel coordinates and
others.

3) Specification:
 Integration of the Chemistry Development Kit with additional nodes for the
processing of chemical structures, compounds, etc.
 Specialized for Enterprise reporting, Business Intelligence, data mining
Advantages:
 It integrates all analysis modules of the well-known. Weka data mining
environment and additional plugins allow R-scripts to be run, offering access
to a vast library of statistical routines.
 It is easy to try out because it requires no installation besides downloading and
un archiving.
 The one aspect of KNIME that truly sets it apart from other data mining
packages is its ability to interface with programs that allow for the
visualization and analysis of molecular data
Limitations:
 Have only limited error measurement methods.
 Has no wrapper method for descriptor selection.
 Does not have automatic facility for Parameter optimization of machine
learning/statistical methods.
Homepage:
Fig: Main Window of KNIME
Conclusion: Hence we have studied the KNIME Analytical Platform.

Practical No. 9
Aim:-Demonstration of pre-processing on a given data set using KNIME analytical
platform.
Tool:-KNIME 3.4.1
Theory:-
Data pre-processing is an important step in the data mining process. The
phrase "garbage in, garbage out" is particularly applicable to data mining and machine
learning projects. Data-gathering methods are often loosely controlled, resulting
in out-of-range values (e.g., Income: −100), impossible data combinations, missing
values, etc. Analyzing data that has not been carefully screened for such problems can
produce misleading results. Thus, the representation and quality of data is first and
foremost before running an analysis.
If there is much irrelevant and redundant information present or noisy and unreliable
data, then knowledge discovery during the training phase is more difficult. Data
preparation and filtering steps can take considerable amount of processing time. Data
pre-processingincludes cleaning, normalization, transformation, feature
extraction and selection, etc. The product of data pre-processing is the final training
set. Kotsiantis et al. present a well-known algorithm for each step of data pre-
processing.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.

Output:-
Figure 9.1 Pre-processing OF KNIME
Conclusion:-
Thus, we have learned pre-processing on a given data set using KNIME analytical
platform

Practical No. 10
Aim:- Demonstration of decision tree learning and predicting using KNIME
analytical platform.
Tool:-KNIME 3.4.1
Theory:-
Decision tree learning uses a decision tree as a predictive model which maps
observations about an item (represented in the branches) to conclusions about the
item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where
the target variable can take a finite set of values are called classification trees; in these
tree structures, leaves represent class labels and branches represent conjunctions of
features that lead to those class labels. Decision trees where the target variable can
take continuous values (typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree describes
data (but the resulting classification tree can be an input for decision making). This
page deals with decision trees in data mining.
Procedure:-
next
3. Run the project.

Output:-
Figure 10.1 Decision Tree learning and predicting using KNIME analytical
Conclusion:-
Thus, we have learned the decision tree learning and predicting using KNIME

Practical No. 11
Aim:-Demonstration of k-means clustering algorithm using KNIME analytical
platform.
Tool:-KNIME 3.4.1
Theory:-
K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces exactly k
different clusters of greatest possible distinction. The best number of clusters k
leading to the greatest separation (distance) is not known as a priori and must be
computed from the data. The objective of K-Means clustering is to minimize total
intra-cluster variance.
Algorithm:-
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean
distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
K-Means is relatively an efficient method. However, we need to specify the number
of clusters, in advance and the final results are sensitive to initialization and often
terminates at a local optimum. Unfortunately there is no global theoretical method to
find the optimal number of clusters. A practical approach is to compare the outcomes
of multiple runs with different k and choose the best one based on a predefined
criterion. In general, a large k probably decreases the error but increases the risk of
over fitting.

Procedure:-
next
3. Run the project.
Output:-
Figure11.1 k-means clustering algorithm using KNIME analytical
Figure11.2 Data Table of k-means clustering algorithm using KNIME analytical
Conclusion:- Thus, we have learned the k-means clustering algorithm using KNIME

Practical No. 12
Aim:-. Study of Orange mining tool.
Tool:- Orange 3.5
Theory:- Orange is an open-source data visualization, machine learning and data
mining toolkit. It features a visual programming front-end for explorative data
analysis and interactive data visualization, and can also be used as a Python library.
Orange is a component-based visual programming software package for data
visualization, machine learning, data mining and data analysis.
Orange components are called widgets and they range from simple data visualization,
subset selection and pre-processing, to empirical evaluation of learning algorithms
and predictive modeling.
Visual programming is implemented through an interface in which workflows are
created by linking predefined or user-designed widgets, while advanced users can use
Orange as a Python library Orange is an open-source software package released under
GPL. Versions up to 3.0 include core components in C++ with wrappers in Python are
available on github. From version 3.0 onwards, Orange uses common Python open-
source libraries for scientific computing, such as numpy, scipy and scikit-learn, while
its graphical user interface operates within the cross-platformQt framework. Orange3
has a separate github.
The default installation includes a number of machine learning, pre-processing and
data visualization algorithms in 6 widget sets (data, visualize, classify, regression,
evaluate and unsupervised). Additional functionalities are available as add-ons
(bioinformatics, data fusion and text-mining).
Orange is supported on macOS, Windows and Linux and can also be installed from
the Python Package Index repository (pip install Orange). As of 2016 the stable
version is 3.3 and runs with Python 3, while the legacy version 2.7 that runs with
Python 2.7 is still available for data manipulation and widget alteration.
Features
Orange consists of a canvas interface onto which the user places widgets and creates a
data analysis workflow. Widgets offer basic functionalities such as reading the data,
showing a data table, selecting features, training predictors, comparing learning
algorithms, visualizing data elements, etc. The user can interactively explore
visualizations or feed the selected subset into other widgets.
Classification Tree widget in Orange 3.0
 Canvas: graphical front-end for data analysis
 Widgets:
o Data: widgets for data input, data filtering, sampling, imputation,
feature manipulation and feature selection

o Visualize: widgets for common visualization (box plot, histograms,
scatter plot) and multivariate visualization (mosaic display, sieve
diagram).
o Classify: a set of supervised machine learning algorithms for
classification
o Regression: a set of supervised machine learning algorithms for
regression
o Evaluate: cross-validation, sampling-based procedures, reliability
estimation and scoring of prediction methods
o Unsupervised: unsupervised learning algorithms for clustering (k-
means, hierarchical clustering) and data projection techniques
(multidimensional scaling, principal component analysis,
correspondence analysis).
Add-ons:
 Associate: widgets for mining frequent item sets and association rule learning
 Bioinformatics: widgets for gene set analysis, enrichment, and access to
pathway libraries
 Data fusion: widgets for fusing different data sets, collective matrix
factorization, and exploration of latent factors
 Educational: widgets for teaching machine learning concepts, such as k-
means clustering, polynomial regression, stochastic gradient descent, ...
 Geo: widgets for working with geospatial data
 Image analytics: widgets for working with images and Image Net
embeddings
 Network: widgets for graph and network analysis
 Text mining: widgets for natural language processing and text mining
 Time series: widgets for time series analysis and modeling

Fig 12.1 Main Page of Orange mining tool
Conclusion:-Thus, we have studied Orange mining tool.

Practical No: 13
Aim: Demonstration of Data visualization using Orange.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
 Data : An input data set
 Data Subset : A subset of data instances
Outputs:
 Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
 Axes in the projection that are displayed and other available axes.
 Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
 Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
 Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
 When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
 Save Image saves the created image to your computer in a .svg or .png format.

Output:
Fig 13.1 Data visualization using Orange
Fig 13.2 Scatter Plot of Data visualization using Orange

Fig 13.3 Classification tree of Data visualization using Orange
Conclusion:
Hence, Data visualization using Orange has been demonstrated.

Practical No: 14
Aim: Demonstration of classification using Orange Mining Tool.
Tool: Orange 3.5
Theory:
Signals
Inputs:
Outputs:
projection.
between instances.
attributes).

Output:
Fig 14.1 classification using Orange Mining Tool
Fig 14.2 Point Data View of classification using Orange Mining Tool
Conclusion:
Hence, we have demonstrated classification using Orange Mining Tool.

Practical No: 15
Aim: Demonstration of text mining using Orange
Tool: Orange3.5
Theory:
Signals
Inputs:
Outputs:
projection.
between instances.
attributes).
 Save Image saves the created image to your computer in a .svg or .png forma

Fig 15.1 Text mining scenario
Fig 15.3 Query Window of Wikipedia

Output:
Fig 15.2 Data Table of text mining using Orange
Conclusion:
Hence, we have demonstrated text mining using Orange Mining Tool.

Practical No: 16
Aim: Demonstration of Linear Projection using Orange Mining Tool.
Tool: Orange 3.5
Theory:
Signals
Inputs:
 Data: An input data set
Outputs:
projection.
between instances.
attributes).

Output:
Fig 16.1 Main window of Linear Projection using Orange Mining Tool
Fig 16.2 Paint Data view of Linear Projection using Orange Mining Tool

Fig 16.3 Linear Projection using Orange Mining Tool
Fig 16.4 Rank Table of Linear Projection using Orange Mining Tool
Conclusion: Hence Linear Projection using Orange Mining Tool has been
demonstrated.

Practical No: 17
Aim: Study of Net Tool Spider.
Tool: Net Tool Spider
Theory:
Net Tool Spider:
A web spider is a software program that searches the Internet for information. The
basic process of a web spider is to download a web page and to search the web page
for links to other web pages. It then repeats this behavior in all of the new pages that it
found. By repeating this process a web spider can find all of the pages within a web
site and all of the pages on the Internet.
Web spiders can have many purposes. The most common spiders are used by search
engines like Google, Yahoo and AltaVista. Their web spiders search the Internet for
web pages and then create indexes of all the words found on the pages. This allows us
to search the Internet quickly and easily.
Net Tools Spider is multi-functional web spider that supports:
Web Site Downloading Spidering web sites and saving the web pages and
files that it finds to your hard drive.
Web Mining Spidering web sites and extracting pieces of
information to be used for other purposes.
Link Checking
Spidering web sites and searching for broken links.
Web Site Searching Spidering web sites and searching for files that
contain certain keywords.
Today, most Internet users limit their searches to the Web, so we'll limit this article to
search engines that focus on the contents of Web pages.
Before a search engine can tell you where a file or document is, it must be found. To
find information on the hundreds of millions of Web pages that exist, a search engine
employs special software robots, called spiders, to build lists of the words found on
Web sites. When a spider is building its lists, the process is called Web crawling.
(There are some disadvantages to calling part of the Internet the World Wide Web -- a
large set of arachnid-centric names for tools is one of them.) In order to build and
maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

How does any spider start its travels over the Web? The usual starting points are lists
of heavily used servers and very popular pages. The spider will begin with a popular
site, indexing the words on its pages and following every link found within the site. In
this way, the spidering system quickly begins to travel, spreading out across the most
widely used portions of the Web.
Google began as an academic search engine. In the paper that describes how the
system was built, Sergey Brin and Lawrence Page give an example of how quickly
their spiders can work. They built their initial system to use multiple spiders, usually
three at one time. Each spider could keep about 300 connections to Web pages open at
a time. At its peak performance, using four spiders, their system could crawl over 100
pages per second, generating around 600 kilobytes of data each second.
Keeping everything running quickly meant building a system to feed necessary
information to the spiders. The early Google system had a server dedicated to
providing URLs to the spiders. Rather than depending on an Internet service provider
for the domain name server (DNS) that translates a server's name into an address,
Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML page, it took note of two things:
· The words within the page
· Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative
importance were noted for special consideration during a subsequent user search. The
Google spider was built to index every significant word on a page, leaving out the
articles "a," "an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate faster, allow
users to search more efficiently, or both. For example, some spiders will keep track of
the words in the title, sub-headings and links, along with the 100 most frequently used
words on the page and each word in the first 20 lines of text. Lycos is said to use this
approach to spidering the Web.
Other systems, such as AltaVista, go in the other direction, indexing every single
word on a page, including "a," "an," "the" and other "insignificant" words. The push
to completeness in this approach is matched by other systems in the attention given to
the unseen portion of the Web page, the meta tags.

Mining Web:
· Project Summary

· After running Web Miner Result will be seen as follows,
Conclusion: In this practical Net tool spider is studied and website is mined up to two
levels.

Distributed Database practicals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Database practicals

Similar to Distributed Database practicals (20)

More from Vrushali Lanjewar

More from Vrushali Lanjewar (13)

Recently uploaded

Recently uploaded (20)

Distributed Database practicals