Rapid Miner is an open-source data mining software tool. It provides functionality for data loading, preprocessing, transformation, data mining, modeling, evaluation, and deployment. Rapid Miner uses learning schemes and attribute evaluators from Weka and statistical modeling schemes from R. It can be used for tasks like text mining, feature engineering, and distributed data mining. Rapid Miner includes a graphical user interface to design analytical workflows using operators. It can also be called as an API or from the command line.
DDBMS, characteristics, Centralized vs. Distributed Database, Homogeneous DDBMS, Heterogeneous DDBMS, Advantages, Disadvantages, What is parallel database, Data fragmentation, Replication, Distribution Transaction
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
DDBMS, characteristics, Centralized vs. Distributed Database, Homogeneous DDBMS, Heterogeneous DDBMS, Advantages, Disadvantages, What is parallel database, Data fragmentation, Replication, Distribution Transaction
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
Deadlocks-An Unconditional Waiting Situation in Operating System. We must make sure of This concept well before understanding deep in to Operating System. This PPT will understands you to get how the deadlocks Occur and how can we Detect, avoid and Prevent the deadlocks in Operating Systems.
This presentation several topics of subjects RDBMS and DBMS including Distributed Database Design,Architecture of Distributed database processing system,Data Communication concept,Concurrency control and recovery. All the topics are briefly described according to syllabus of BCA II and BCA III year subjects.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
Deadlocks-An Unconditional Waiting Situation in Operating System. We must make sure of This concept well before understanding deep in to Operating System. This PPT will understands you to get how the deadlocks Occur and how can we Detect, avoid and Prevent the deadlocks in Operating Systems.
This presentation several topics of subjects RDBMS and DBMS including Distributed Database Design,Architecture of Distributed database processing system,Data Communication concept,Concurrency control and recovery. All the topics are briefly described according to syllabus of BCA II and BCA III year subjects.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
This is a talk given at Eclipse Con Europe 2014 on how to use the open source project DAWN, Data Analysis Workbench. This project has two papers with more than three hundred citations of using the software.
Performance Evaluation of Open Source Data Mining Toolsijsrd.com
This is an attempt at evaluation of Open Source Data mining tools. Initially the paper deliberates on what can be and what cannot be the focus of inquiry, for the evaluation. Then it outlines the framework under which the evaluation is to be done. Next it defines the performance criteria to be measured. The tool selection strategy for the study is framed using various online resources and tools selected based on it. A table lists the different set of criteria and the findings of each tool against it. After capturing the findings of the study in a tabular fashion, a framework implementation strategy is made. This details the relative scaling for the evaluation. Based on the scorings, a conclusion remark with some suggestions summarizes the findings of the study. Lastly some assumptions/Limitations are discussed.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Distributed Framework for Data Mining As a Service on Private CloudIJERA Editor
Data mining research faces two great challenges: i. Automated mining ii. Mining of distributed data.
Conventional mining techniques are centralized and the data needs to be accumulated at central location. Mining
tool needs to be installed on the computer before performing data mining. Thus, extra time is incurred in
collecting the data. Mining is 4 done by specialized analysts who have access to mining tools. This technique is
not optimal when the data is distributed over the network. To perform data mining in distributed scenario, we
need to design a different framework to improve efficiency. Also, the size of accumulated data grows
exponentially with time and is difficult to mine using a single computer. Personal computers have limitations in
terms of computation capability and storage capacity.
Cloud computing can be exploited for compute-intensive and data intensive applications. Data mining
algorithms are both compute and data intensive, therefore cloud based tools can provide an infrastructure for
distributed data mining. This paper is intended to use cloud computing to support distributed data mining. We
propose a cloud based data mining model which provides the facility of mass data storage along with distributed
data mining facility. This paper provide a solution for distributed data mining on Hadoop framework using an
interface to run the algorithm on specified number of nodes without any user level configuration. Hadoop is
configured over private servers and clients can process their data through common framework from anywhere in
private network. Data to be mined can either be chosen from cloud data server or can be uploaded from private
computers on the network. It is observed that the framework is helpful in processing large size data in less time
as compared to single system.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM
Flamingo is a open-source Big Data Platform that combine a Ajax Rich Web Interface + Workflow Engine + Workflow Designer + MapReduce + Hive Editor + Pig Editor.
Movies : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=2064714
Screen Shots : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=2065069
Download : http://sourceforge.net/projects/hadoop-manager/files
Wiki : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819212
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Integration Patterns for Big Data ApplicationsMichael Häusler
Big Data technologies like distributed databases, queues, batch processors, and stream processors are fun and exciting to play with. Making them play nicely together can be challenging. Keeping it fun for engineers to continuously improve and operate them is hard. At ResearchGate, we run thousands of YARN applications every day to gain insights and to power user facing features. Of course, there are numerous integration challenges on the way:
* integrating batch and stream processors with operational systems
* ingesting data and playing back results while controlling performance crosstalk
* rolling out new versions of synchronous, stream, and batch applications and their respective data schemas
* controlling the amount of glue and adapter code between different technologies
* modeling cross-flow dependencies while handling failures gracefully and limiting their repercussions
We describe our ongoing journey in identifying patterns and principles to make our big data stack integrate well. Technologies to be covered will include MongoDB, Kafka, Hadoop (YARN), Hive (TEZ), Flink Batch, and Flink Streaming.
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
Watch full webinar here: https://bit.ly/3ohtRqm
Companies with corporate data lakes also need a strategy for how to best integrate them with their overall data fabric. To take full advantage of a data lake, data architects must determine what data belongs in the Lake vs. other sources, how end users are going to find and connect to the data they need as well as the best way to leverage the processing power of the data lake. This webinar will provide you with a deep dive look at how the Denodo Platform for data virtualization enables companies to maximize their investment in their corporate data lake.
Watch on-demand this webinar to learn:
- How to create a logical data fabric with Denodo
- How to leverage the a data lake for MPP Acceleration and Summary Views
- How to leverage Presto with Denodo for file based data lakes (ie. S3, ADLS, HDFS, etc.)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)Amazon Web Services
Learn how to leverage new workflow management tools to simplify complex data pipelines and ETL jobs spanning multiple systems. In this technical deep dive from Treasure Data, company founder and chief architect walks through the codebase of DigDag, our recently open-sourced workflow management project. He shows how workflows can break large, error-prone SQL statements into smaller blocks that are easier to maintain and reuse. He also demonstrates how a system using ‘last good’ checkpoints can save hours of computation when restarting failed jobs and how to use standard version control systems like Github to automate data lifecycle management across Amazon S3, Amazon EMR, Amazon Redshift, and Amazon Aurora. Finally, you see a few examples where SQL-as-pipeline-code gives data scientists both the right level of ownership over production processes and a comfortable abstraction from the underlying execution engines. This session is sponsored by Treasure Data.
AWS Competency Partner
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
Studies based on Deep learning in recent years.pptxVrushali Lanjewar
In Studies based on Deep learning in recent years I created detailed analysis of Studies on Deep Leaning Dataset collected from Google Scholar using Publish & Perish and Visualizations created using Tableau
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Distributed Database practicals
1. DISTRIBUTED DATABASE SYSTEM 1
CSIT Dept’s SGBAU Amravati.
Practical No: 1
Aim: Study of Rapid Miner Tools.
Tool: Rapid Miner 7.6.001
Theory:
Rapid Miner provides data mining and machine learning procedures including:
data loading and transformation (Extract, transform, load, a.k.a. ETL), data
preprocessing and visualization, modelling, evaluation, and deployment. Rapid Miner
is written in the Java programming language. It uses learning schemes and attributes
evaluators from the Weka machine learning environment and statistical modelling
schemes from R-Project.
Rapid Miner can define analytical steps (similar to R) and be used for
analyzing data generated by high throughput instruments such as those used in
genotyping, proteomics, and mass spectrometry. It can be used for text mining,
multimedia mining, feature engineering, data stream mining, development of
ensemble methods, and distributed data mining. Rapid Miner functionality can be
extended with additional plugins.
Rapid Miner provides a GUI to design an analytical pipeline (the "operator
tree"). The GUI generates an XML (extensible Markup Language) file that defines the
analytical processes the user wishes to apply to the data. Alternatively, the engine can
be called from other programs or used as an API. Individual functions can be called
from the command line.
Rapid Miner is open-source and is offered free of charge as a Community
Edition released under the GNU AGPL.
Rapid Miner is the world-wide leading open-source data mining solution due
to the combination of its leading-edge technologies and its functional range.
Applications of Rapid Miner cover a wide range of real-world data mining tasks.
Using Rapid Miner one can explore data! Simplify the construction of analysis
processes and the evaluation of different approaches. Try to find the best combination
of preprocessing and learning steps or let Rapid Miner do that automatically.
2. DISTRIBUTED DATABASE SYSTEM 2
CSIT Dept’s SGBAU Amravati.
More than 400 data mining operators can be used and almost arbitrarily
combined. The setup is described by XML files which can easily be created with a
graphical user interface (GUI). This XML based scripting language turns Rapid Miner
into an integrated development environment (IDE) for machine learning and data
mining. Rapid Miner follows the concept of rapid prototyping leading very quickly to
the desired results. Furthermore, Rapid Miner can be used as a Java data mining
library.
The development of most of the Rapid Miner concepts started in 2001 at the
Artificial Intelligence Unit of the University of Dortmund. Several members of the
unit started to implement and realize these concepts which led to a first version of
Rapid Miner in 2002. Since 2004, the open-source version of Rapid Miner (GPL) is
hosted by Source Forge. Since then, a large number of suggestions and extensions by
external developers were also embedded into Rapid Miner. Today, both the open-
source version and a close-source version of Rapid Miner are maintained by Rapid-I.
Although Rapid Miner is totally free and open-source, it offers a huge amount
of methods and possibilities not covered by other data mining suites, both open-
source and proprietary ones.
Features of Rapid miner:
• Freely available open-source knowledge discovery environment
• 100% pure Java (runs on every major platform and operating system)
• KD processes are modelled as simple operator trees which is both intuitive and
powerful
• Operator trees or sub trees can be saved as building blocks for later re-use
• Internal XML representation ensures standardized interchange format of data
mining experiments
• Simple scripting language allowing for automatic large-scale experiments
• Multi-layered data view concept ensures efficient and transparent data
handling
• Flexibility in using Rapid Miner:
• Graphical user interface (GUI) for interactive prototyping
• Command line mode (batch mode) for automated large-scale applications
3. DISTRIBUTED DATABASE SYSTEM 3
CSIT Dept’s SGBAU Amravati.
• Java API (application programming interface) to ease usage of Rapid Miner
from your own programs
• Simple plug-in and extension mechanisms, a broad variety of plugins already
exists and you can easily add your own
• Powerful plotting facility offering a large set of sophisticated high-
dimensional visualization techniques for data and models
• More than 400 machine learning, evaluation, in- and output, pre- and post-
processing, and visualization operators plus numerous meta optimization
schemes
• Rapid Miner was successfully applied on a wide range of applications where
its rapid prototyping abilities demonstrated their usefulness, including text
mining, multimedia mining, feature engineering, data stream mining and
tracking drifting concepts, development of ensemble methods, and distributed
data mining.
4. DISTRIBUTED DATABASE SYSTEM 4
CSIT Dept’s SGBAU Amravati.
Homepage:
Fig 1.1 Operator and Parameter in Rapid Miner
Fig 1.2 Repository Windows
Conclusion: Hence, we have studied the Rapid Miner Tool.
5. DISTRIBUTED DATABASE SYSTEM 5
CSIT Dept’s SGBAU Amravati.
Practical No. 2
Aim:- Demonstration of Pre-processing on given Data set using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
Data pre-processing describes any type of processing performed on raw
data to prepare it for another processing procedure. Commonly used as a
preliminary data mining practice, data pre-processing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user --
for example, in a neural network. There are a number of different tools and methods
used for pre-processing, including: sampling, which selects a representative subset
from a large population of data; transformation, which manipulates raw data to
produce a single input; denoising, which removes noise from data; normalization,
which organizes data for more efficient access; and feature extraction, which pulls out
specified data that is significant in some particular context.
Procedure:-
1. Create .arff file
2. Go to Repository Select import .CSV file Data import wizard Steps Name
the data repository.
3. Learn the graphical statistical output of example.
7. DISTRIBUTED DATABASE SYSTEM 7
CSIT Dept’s SGBAU Amravati.
Fig 4.2 Text view
Fig 4.3 Decision Tree view
Conclusion:-Thus, we have learned decision trees using Rapid Miner.
8. DISTRIBUTED DATABASE SYSTEM 8
CSIT Dept’s SGBAU Amravati.
Practical No. 3
Aim:-Demonstration of DBSCAN clustering algorithm using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
DBSCAN's definition of a cluster is based on the notion of density
reachability. Basically, a point q is directly density-reachable from a point p if it is not
farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood)
and if p is surrounded by sufficiently many points such that one may consider p and q
to be part of a cluster. q is called density-reachable (note the distinction from "directly
density-reachable") from p if there is a sequence p(1),…,p(n) of points with p(1) = p
and p(n) = q where each p(i+1) is directly density-reachable from p(i).
Procedure:-
1. Go to repository Go to modeling Go to clustering and segmentation
2. Drag & drop selected DBSCAN on main process
3. Drag & drop selected DB on main process.
4. Connect the respective nodes.
5. Run the program.
9. DISTRIBUTED DATABASE SYSTEM 9
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 3.1 DBSCAN clustering algorithm
Fig 3.2 Text View of DBSCAN
10. DISTRIBUTED DATABASE SYSTEM 10
CSIT Dept’s SGBAU Amravati.
Fig 3.3 Graph View of DBSCAN
Conclusion:- Thus, we have learned DBSCAN clustering algorithm using Rapid
Miner.
11. DISTRIBUTED DATABASE SYSTEM 11
CSIT Dept’s SGBAU Amravati.
Practical No. 4
Aim:- Demonstration of decision tree using Rapid Miner.
Tool:- Rapid Miner 5.3.000
Theory:-
Decision tree induction in learning of decision trees from class-labeled
training tuples. A decision tree is flow chart like tree structure, where each internal
node (non- leaf node) denotes a test on an attribute, each branch represent an outcome
of the test, and each leaf node (or terminal node) holds a class label. The topmost
node in a tree is the root node.
A decision Tree for the concept buy computer, indicating whether a customer
at All Electronics is likely to purchase a computer. Each internal (non-leaf) node
represents a class (either buys computer = yes or buys- computer = no).
Procedure:-
1. Go to repository Go to sample Go to data Go to deals(Golf)
2. Drag & drop selected DB on main process.
3. Select operator Modelling Select classification and regression Select Tree
induction Select decision tree drag and drop on main process.
4. Connect the respective nodes.
5. Run the program.
6. Learn the classifications diagrammatic representation of output of given example.
13. DISTRIBUTED DATABASE SYSTEM 13
CSIT Dept’s SGBAU Amravati.
Fig 4.2 Text View of Decision Tree
Fig 4.3 Tree View of Decision Tree
Conclusion:-Thus, we have learned decision trees using Rapid Miner.
14. DISTRIBUTED DATABASE SYSTEM 14
CSIT Dept’s SGBAU Amravati.
Practical No. 5
Aim: - Demonstration of Naïve Bayes classification on a given data set using Rapid
Miner.
Tool :- Rapid Miner 5.3.000
Theory:-
A Naive Bayes classifier is a simple probabilistic classifier based on applying
Bayes' theorem (from Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying probability model would be
'independent feature model'. In simple terms, a Naive Bayes classifier assumes that
the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated
to the presence (or absence) of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 4 inches in diameter. Even if
these features depend on each other or upon the existence of the other features, a
Naive Bayes classifier considers all of these properties to independently contribute to
the probability that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small
amount of training data to estimate the means and variances of the variables necessary
for classification. Because independent variables are assumed, only the variances of
the variables for each label need to be determined and not the entire covariance
matrix.
Procedure:-
1. Go to repository Go to sample Go to data Go to deals( Weight)
2. Drag & drop selected DB on main process.
3. Select operator Modelling Cluster segmentation Select Also
KNIME Drag and drop on main process.
4. Connect the respective nodes.
5. Run the program.
6. Learn the clustering diagrammatic representation. of output of given example.
16. DISTRIBUTED DATABASE SYSTEM 16
CSIT Dept’s SGBAU Amravati.
Fig 5.2 Text View of Naïve Bayes classification
Fig 5.3 Plot View of Naïve Bayes classification
Conclusion:-
Thus, we have learned Naïve Bayes classification using Rapid Miner.
17. DISTRIBUTED DATABASE SYSTEM 17
CSIT Dept’s SGBAU Amravati.
Practical No. 6
Aim:- Demonstration of k-means clustering algorithm using Rapid Miner.
Tool:- Rapid Miner 7.6.001
Theory:-
This operator performs clustering using the k-means algorithm. Clustering is
concerned with grouping objects together that are similar to each other and dissimilar
to the objects belonging to other clusters. Clustering is a technique for extracting
information from unlabeled data. k-means clustering is an exclusive clustering
algorithm i.e. each object is assigned to precisely one of a set of clusters. This
operator performs clustering using the k-means algorithm. k-means clustering is an
exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of
clusters. Objects in one cluster are similar to each other. The similarity between
objects is based on a measure of the distance between them.
Clustering is concerned with grouping together objects that are similar to each
other and dissimilar to the objects belonging to other clusters. Clustering is a
technique for extracting information from unlabeled data. Clustering can be very
useful in many different scenarios e.g. in a marketing application we may be
interested in finding clusters of customers with similar buying behavior.
Procedure:-
1. Go to repository Go to sample Go to data Go to deals(DB)
2. Drag & drop selected DB on main process.
3. Select operator Modelling Cluster segmentation Select Also
KNIMEdrag and drop on main process.
4. Connect the respective nodes.
5. Run the program.
6. Learn the clustering diagrammatic representation. of output of given example.
18. DISTRIBUTED DATABASE SYSTEM 18
CSIT Dept’s SGBAU Amravati.
Output :-
Fig 6.1 k-means clustering algorithm
Fig 6.2 Text View of k-means clustering algorithm
19. DISTRIBUTED DATABASE SYSTEM 19
CSIT Dept’s SGBAU Amravati.
Fig 6.3 Graph View of k-means clustering algorithm
Fig 6.4 Centroid Plot View of k-means clustering algorithm
Conclusion:-Thus, we have learned k-means clustering algorithm using Rapid Miner.
20. DISTRIBUTED DATABASE SYSTEM 20
CSIT Dept’s SGBAU Amravati.
Practical No. 7
Aim:- Demonstration of Market Basket analysis using Association rule mining in
Rapid Miner.
Tool:- Rapid Miner 7.6.001
Theory:-
These models build upon the association rule mining framework, but provide
additional analytic capabilities beyond simple associations. The first model allows to
mine transactional database for negative patterns represented as dissociation item sets
and dissociation rules. The second model of substitutive item sets filters items and
item sets that can be used interchangeably as substitutes, i.e., item sets that appear in
the transactional database in very similar contexts. Finally, the third model of
recommendation rules uses an additional item set interestingness measure, namely
coverage, to construct a set of recommended items using a greedy search procedure.
Procedure:-
1. Go to repository Go to sample Go to data Go to deals( Weight)
2. Drag & drop selected DB on main process.
3. Select operator Modelling Generate data DiscretizeNominal to
BinominalFP GrowthMultiplyCreate Dissociation RulesGenerate
Current Selection Context MultiplyCreate Substitutive Sets Create
Recommendation Sets Multiply.
4. Drag and drop on main process.
5. Connect the respective nodes.
6. Run the program.
7. Learn the clustering diagrammatic representation of output of given example.
21. DISTRIBUTED DATABASE SYSTEM 21
CSIT Dept’s SGBAU Amravati.
Output:-
Fig 7.1 Market Basket analysis using Association rule
Fig 7.2 Data Table Of Market Basket analysis using Association rule
Conclusion:-
Thus, we have learned Market Basket analysis using Association rule mining
in Rapid Miner.
22. DISTRIBUTED DATABASE SYSTEM 22
CSIT Dept’s SGBAU Amravati.
Practical No: 8
Aim: Study of KNIME Analytical Platform.
Tool: KNIME 3.4.1
Theory:
KNIME:
Konstanz Information Miner, is an open source data analytics, reporting and
integration platform. It has been used in pharmaceutical research, but is also used in
other areas like CRM customer data analysis, business intelligence and financial data
analysis. It is based on the Eclipse platform and, through its modular API, and is
easily extensible. Custom nodes and types can be implemented in KNIME within
hours thus extending KNIME to comprehend and provide first tier support for highly
domain-specific data format.
1) Technical Specification:
Released on 2004.
Latest version available is KNIME2.9
Licensed By GNU General Public License
Compatible with Linux ,OS X, Windows
Written in java
www.knime.org
2) General Features:
Knime, pronounced “naim”, is a nicely designed data mining tool that runs
inside the IBM’s Eclipse development environment.
It is a modular data exploration platform that enables the user to visually
create data flows (often referred to as pipelines), selectively execute some or
all analysis steps, and later investigate the results through interactive views on
data and models.
The Knime base version already incorporates over 100 processing nodes for
data I/O, pre-processing and cleansing, modelling, analysis and data mining as
well as various interactive views, such as scatter plots, parallel coordinates and
others.
23. DISTRIBUTED DATABASE SYSTEM 23
CSIT Dept’s SGBAU Amravati.
3) Specification:
Integration of the Chemistry Development Kit with additional nodes for the
processing of chemical structures, compounds, etc.
Specialized for Enterprise reporting, Business Intelligence, data mining
Advantages:
It integrates all analysis modules of the well-known. Weka data mining
environment and additional plugins allow R-scripts to be run, offering access
to a vast library of statistical routines.
It is easy to try out because it requires no installation besides downloading and
un archiving.
The one aspect of KNIME that truly sets it apart from other data mining
packages is its ability to interface with programs that allow for the
visualization and analysis of molecular data
Limitations:
Have only limited error measurement methods.
Has no wrapper method for descriptor selection.
Does not have automatic facility for Parameter optimization of machine
learning/statistical methods.
Homepage:
Fig: Main Window of KNIME
Conclusion: Hence we have studied the KNIME Analytical Platform.
24. DISTRIBUTED DATABASE SYSTEM 24
CSIT Dept’s SGBAU Amravati.
Practical No. 9
Aim:-Demonstration of pre-processing on a given data set using KNIME analytical
platform.
Tool:-KNIME 3.4.1
Theory:-
Data pre-processing is an important step in the data mining process. The
phrase "garbage in, garbage out" is particularly applicable to data mining and machine
learning projects. Data-gathering methods are often loosely controlled, resulting
in out-of-range values (e.g., Income: −100), impossible data combinations, missing
values, etc. Analyzing data that has not been carefully screened for such problems can
produce misleading results. Thus, the representation and quality of data is first and
foremost before running an analysis.
If there is much irrelevant and redundant information present or noisy and unreliable
data, then knowledge discovery during the training phase is more difficult. Data
preparation and filtering steps can take considerable amount of processing time. Data
pre-processingincludes cleaning, normalization, transformation, feature
extraction and selection, etc. The product of data pre-processing is the final training
set. Kotsiantis et al. present a well-known algorithm for each step of data pre-
processing.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.
25. DISTRIBUTED DATABASE SYSTEM 25
CSIT Dept’s SGBAU Amravati.
Output:-
Figure 9.1 Pre-processing OF KNIME
Conclusion:-
Thus, we have learned pre-processing on a given data set using KNIME analytical
platform
26. DISTRIBUTED DATABASE SYSTEM 26
CSIT Dept’s SGBAU Amravati.
Practical No. 10
Aim:- Demonstration of decision tree learning and predicting using KNIME
analytical platform.
Tool:-KNIME 3.4.1
Theory:-
Decision tree learning uses a decision tree as a predictive model which maps
observations about an item (represented in the branches) to conclusions about the
item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where
the target variable can take a finite set of values are called classification trees; in these
tree structures, leaves represent class labels and branches represent conjunctions of
features that lead to those class labels. Decision trees where the target variable can
take continuous values (typically real numbers) are called regression trees.
In decision analysis, a decision tree can be used to visually and explicitly
represent decisions and decision making. In data mining, a decision tree describes
data (but the resulting classification tree can be an input for decision making). This
page deals with decision trees in data mining.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.
27. DISTRIBUTED DATABASE SYSTEM 27
CSIT Dept’s SGBAU Amravati.
Output:-
Figure 10.1 Decision Tree learning and predicting using KNIME analytical
Conclusion:-
Thus, we have learned the decision tree learning and predicting using KNIME
analytical platform.
28. DISTRIBUTED DATABASE SYSTEM 28
CSIT Dept’s SGBAU Amravati.
Practical No. 11
Aim:-Demonstration of k-means clustering algorithm using KNIME analytical
platform.
Tool:-KNIME 3.4.1
Theory:-
K-Means clustering intends to partition n objects into k clusters in which each
object belongs to the cluster with the nearest mean. This method produces exactly k
different clusters of greatest possible distinction. The best number of clusters k
leading to the greatest separation (distance) is not known as a priori and must be
computed from the data. The objective of K-Means clustering is to minimize total
intra-cluster variance.
Algorithm:-
1. Clusters the data into k groups where k is predefined.
2. Select k points at random as cluster centers.
3. Assign objects to their closest cluster center according to the Euclidean
distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
K-Means is relatively an efficient method. However, we need to specify the number
of clusters, in advance and the final results are sensitive to initialization and often
terminates at a local optimum. Unfortunately there is no global theoretical method to
find the optimal number of clusters. A practical approach is to compare the outcomes
of multiple runs with different k and choose the best one based on a predefined
criterion. In general, a large k probably decreases the error but increases the risk of
over fitting.
29. DISTRIBUTED DATABASE SYSTEM 29
CSIT Dept’s SGBAU Amravati.
Procedure:-
1. Go to fileimport KNIME overflowBrowse for .rar fileselect fileclick
next
2. See your project in KNIME explorer.
3. Run the project.
Output:-
Figure11.1 k-means clustering algorithm using KNIME analytical
Figure11.2 Data Table of k-means clustering algorithm using KNIME analytical
Conclusion:- Thus, we have learned the k-means clustering algorithm using KNIME
analytical platform.
30. DISTRIBUTED DATABASE SYSTEM 30
CSIT Dept’s SGBAU Amravati.
Practical No. 12
Aim:-. Study of Orange mining tool.
Tool:- Orange 3.5
Theory:- Orange is an open-source data visualization, machine learning and data
mining toolkit. It features a visual programming front-end for explorative data
analysis and interactive data visualization, and can also be used as a Python library.
Orange is a component-based visual programming software package for data
visualization, machine learning, data mining and data analysis.
Orange components are called widgets and they range from simple data visualization,
subset selection and pre-processing, to empirical evaluation of learning algorithms
and predictive modeling.
Visual programming is implemented through an interface in which workflows are
created by linking predefined or user-designed widgets, while advanced users can use
Orange as a Python library Orange is an open-source software package released under
GPL. Versions up to 3.0 include core components in C++ with wrappers in Python are
available on github. From version 3.0 onwards, Orange uses common Python open-
source libraries for scientific computing, such as numpy, scipy and scikit-learn, while
its graphical user interface operates within the cross-platformQt framework. Orange3
has a separate github.
The default installation includes a number of machine learning, pre-processing and
data visualization algorithms in 6 widget sets (data, visualize, classify, regression,
evaluate and unsupervised). Additional functionalities are available as add-ons
(bioinformatics, data fusion and text-mining).
Orange is supported on macOS, Windows and Linux and can also be installed from
the Python Package Index repository (pip install Orange). As of 2016 the stable
version is 3.3 and runs with Python 3, while the legacy version 2.7 that runs with
Python 2.7 is still available for data manipulation and widget alteration.
Features
Orange consists of a canvas interface onto which the user places widgets and creates a
data analysis workflow. Widgets offer basic functionalities such as reading the data,
showing a data table, selecting features, training predictors, comparing learning
algorithms, visualizing data elements, etc. The user can interactively explore
visualizations or feed the selected subset into other widgets.
Classification Tree widget in Orange 3.0
Canvas: graphical front-end for data analysis
Widgets:
o Data: widgets for data input, data filtering, sampling, imputation,
feature manipulation and feature selection
31. DISTRIBUTED DATABASE SYSTEM 31
CSIT Dept’s SGBAU Amravati.
o Visualize: widgets for common visualization (box plot, histograms,
scatter plot) and multivariate visualization (mosaic display, sieve
diagram).
o Classify: a set of supervised machine learning algorithms for
classification
o Regression: a set of supervised machine learning algorithms for
regression
o Evaluate: cross-validation, sampling-based procedures, reliability
estimation and scoring of prediction methods
o Unsupervised: unsupervised learning algorithms for clustering (k-
means, hierarchical clustering) and data projection techniques
(multidimensional scaling, principal component analysis,
correspondence analysis).
Add-ons:
Associate: widgets for mining frequent item sets and association rule learning
Bioinformatics: widgets for gene set analysis, enrichment, and access to
pathway libraries
Data fusion: widgets for fusing different data sets, collective matrix
factorization, and exploration of latent factors
Educational: widgets for teaching machine learning concepts, such as k-
means clustering, polynomial regression, stochastic gradient descent, ...
Geo: widgets for working with geospatial data
Image analytics: widgets for working with images and Image Net
embeddings
Network: widgets for graph and network analysis
Text mining: widgets for natural language processing and text mining
Time series: widgets for time series analysis and modeling
32. DISTRIBUTED DATABASE SYSTEM 32
CSIT Dept’s SGBAU Amravati.
Fig 12.1 Main Page of Orange mining tool
Conclusion:-Thus, we have studied Orange mining tool.
33. DISTRIBUTED DATABASE SYSTEM 33
CSIT Dept’s SGBAU Amravati.
Practical No: 13
Aim: Demonstration of Data visualization using Orange.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
Data : An input data set
Data Subset : A subset of data instances
Outputs:
Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
Axes in the projection that are displayed and other available axes.
Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
Save Image saves the created image to your computer in a .svg or .png format.
34. DISTRIBUTED DATABASE SYSTEM 34
CSIT Dept’s SGBAU Amravati.
Output:
Fig 13.1 Data visualization using Orange
Fig 13.2 Scatter Plot of Data visualization using Orange
35. DISTRIBUTED DATABASE SYSTEM 35
CSIT Dept’s SGBAU Amravati.
Fig 13.3 Classification tree of Data visualization using Orange
Conclusion:
Hence, Data visualization using Orange has been demonstrated.
36. DISTRIBUTED DATABASE SYSTEM 36
CSIT Dept’s SGBAU Amravati.
Practical No: 14
Aim: Demonstration of classification using Orange Mining Tool.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
Data : An input data set
Data Subset : A subset of data instances
Outputs:
Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
Axes in the projection that are displayed and other available axes.
Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
Save Image saves the created image to your computer in a .svg or .png format.
37. DISTRIBUTED DATABASE SYSTEM 37
CSIT Dept’s SGBAU Amravati.
Output:
Fig 14.1 classification using Orange Mining Tool
Fig 14.2 Point Data View of classification using Orange Mining Tool
Conclusion:
Hence, we have demonstrated classification using Orange Mining Tool.
38. DISTRIBUTED DATABASE SYSTEM 38
CSIT Dept’s SGBAU Amravati.
Practical No: 15
Aim: Demonstration of text mining using Orange
Tool: Orange3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
Data : An input data set
Data Subset : A subset of data instances
Outputs:
Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
Axes in the projection that are displayed and other available axes.
Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
Save Image saves the created image to your computer in a .svg or .png forma
39. DISTRIBUTED DATABASE SYSTEM 39
CSIT Dept’s SGBAU Amravati.
Fig 15.1 Text mining scenario
Fig 15.3 Query Window of Wikipedia
40. DISTRIBUTED DATABASE SYSTEM 40
CSIT Dept’s SGBAU Amravati.
Output:
Fig 15.2 Data Table of text mining using Orange
Conclusion:
Hence, we have demonstrated text mining using Orange Mining Tool.
41. DISTRIBUTED DATABASE SYSTEM 41
CSIT Dept’s SGBAU Amravati.
Practical No: 16
Aim: Demonstration of Linear Projection using Orange Mining Tool.
Tool: Orange 3.5
Theory:
A linear projection method with explorative data analysis.
Signals
Inputs:
Data: An input data set
Data Subset : A subset of data instances
Outputs:
Selected Data : A data subset that the user has manually selected in the
projection.
Steps to demonstrate Linear Projection using Orange
Axes in the projection that are displayed and other available axes.
Set the color of the displayed dots (you will get colored dots for discrete values
and grey-scale dots for continuous). Set opacity, shape and size to differentiate
between instances.
Set jittering to prevent the dots from overlapping (especially for discrete
attributes).
Select, zoom, pan and zoom to fit options for exploring the graph. Manual
selection of data instances works as a non-angular/free-hand selection tool.
Double click to move the projection. Scroll in or out for zoom.
When the box is ticked (Auto commit is on), the widget will communicate the
changes automatically. Alternatively, click Commit.
Save Image saves the created image to your computer in a .svg or .png format.
42. DISTRIBUTED DATABASE SYSTEM 42
CSIT Dept’s SGBAU Amravati.
Output:
Fig 16.1 Main window of Linear Projection using Orange Mining Tool
Fig 16.2 Paint Data view of Linear Projection using Orange Mining Tool
43. DISTRIBUTED DATABASE SYSTEM 43
CSIT Dept’s SGBAU Amravati.
Fig 16.3 Linear Projection using Orange Mining Tool
Fig 16.4 Rank Table of Linear Projection using Orange Mining Tool
Conclusion: Hence Linear Projection using Orange Mining Tool has been
demonstrated.
44. DISTRIBUTED DATABASE SYSTEM 44
CSIT Dept’s SGBAU Amravati.
Practical No: 17
Aim: Study of Net Tool Spider.
Tool: Net Tool Spider
Theory:
Net Tool Spider:
A web spider is a software program that searches the Internet for information. The
basic process of a web spider is to download a web page and to search the web page
for links to other web pages. It then repeats this behavior in all of the new pages that it
found. By repeating this process a web spider can find all of the pages within a web
site and all of the pages on the Internet.
Web spiders can have many purposes. The most common spiders are used by search
engines like Google, Yahoo and AltaVista. Their web spiders search the Internet for
web pages and then create indexes of all the words found on the pages. This allows us
to search the Internet quickly and easily.
Net Tools Spider is multi-functional web spider that supports:
Web Site Downloading Spidering web sites and saving the web pages and
files that it finds to your hard drive.
Web Mining Spidering web sites and extracting pieces of
information to be used for other purposes.
Link Checking
Spidering web sites and searching for broken links.
Web Site Searching Spidering web sites and searching for files that
contain certain keywords.
Today, most Internet users limit their searches to the Web, so we'll limit this article to
search engines that focus on the contents of Web pages.
Before a search engine can tell you where a file or document is, it must be found. To
find information on the hundreds of millions of Web pages that exist, a search engine
employs special software robots, called spiders, to build lists of the words found on
Web sites. When a spider is building its lists, the process is called Web crawling.
(There are some disadvantages to calling part of the Internet the World Wide Web -- a
large set of arachnid-centric names for tools is one of them.) In order to build and
maintain a useful list of words, a search engine's spiders have to look at a lot of pages.
45. DISTRIBUTED DATABASE SYSTEM 45
CSIT Dept’s SGBAU Amravati.
How does any spider start its travels over the Web? The usual starting points are lists
of heavily used servers and very popular pages. The spider will begin with a popular
site, indexing the words on its pages and following every link found within the site. In
this way, the spidering system quickly begins to travel, spreading out across the most
widely used portions of the Web.
Google began as an academic search engine. In the paper that describes how the
system was built, Sergey Brin and Lawrence Page give an example of how quickly
their spiders can work. They built their initial system to use multiple spiders, usually
three at one time. Each spider could keep about 300 connections to Web pages open at
a time. At its peak performance, using four spiders, their system could crawl over 100
pages per second, generating around 600 kilobytes of data each second.
Keeping everything running quickly meant building a system to feed necessary
information to the spiders. The early Google system had a server dedicated to
providing URLs to the spiders. Rather than depending on an Internet service provider
for the domain name server (DNS) that translates a server's name into an address,
Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML page, it took note of two things:
· The words within the page
· Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative
importance were noted for special consideration during a subsequent user search. The
Google spider was built to index every significant word on a page, leaving out the
articles "a," "an" and "the." Other spiders take different approaches.
These different approaches usually attempt to make the spider operate faster, allow
users to search more efficiently, or both. For example, some spiders will keep track of
the words in the title, sub-headings and links, along with the 100 most frequently used
words on the page and each word in the first 20 lines of text. Lycos is said to use this
approach to spidering the Web.
Other systems, such as AltaVista, go in the other direction, indexing every single
word on a page, including "a," "an," "the" and other "insignificant" words. The push
to completeness in this approach is matched by other systems in the attention given to
the unseen portion of the Web page, the meta tags.
48. DISTRIBUTED DATABASE SYSTEM 48
CSIT Dept’s SGBAU Amravati.
· After running Web Miner Result will be seen as follows,
Conclusion: In this practical Net tool spider is studied and website is mined up to two
levels.