SlideShare a Scribd company logo
SRI GURU GRANTH SAHIB WORLD UNIVERSITY
FATEHGARH SAHIB – 140406
JULY,2014
Guided By : Presented By:
Ms. Shruti Aggarwal Prabhjit Singh Sekhon
Assistant Professor Roll No: 12012238
(CSE Department) M.tech(CSE)
 Introduction
 Literature Review
 Problem Statement
 Results
 Conclusion
 Future works
 References
2
Data mining is computational process of discovering patterns in large data sets. The
overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
3
4
Functions of data mining can be classified as:
 Class Description: It can be useful to describe individual classes and concepts.
 Association Analysis: It is the discovery of association rules that define attribute-
value conditions.
 Classification: It is the process of finding a set of models that describe and
distinguish data classes or concepts by using class label.
 Cluster Analysis: The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the inter-class similarity.
 Outlier Analysis: Outliers are data objects that do not comply with the general
behavior of model of the data.
 Evolution Analysis: It describes and models trends for objects whose behaviors
changes over time.
5
 Pre-processing: In this data is collected, irrelevant data items are removed and
converted into appropriate form.
 Processing: To process the data algorithms are considered to develop and evaluate
the models. Models are implemented on data.
 Post-processing: When models are implemented on data it is presented if still there
is irrelevant data then again steps repeated.
6
7
There are various types of Mining :
 Text Mining: Deriving high quality information from large text.
 Visual Mining: In this patterns are generated which are converted from analog to
digital data that are hidden in data.
 Audio Mining: In this content of an audio signal can be analysed and searched.
 Spatial Mining: Finding patterns in data with respect to geography.
 Web Mining :Discover patterns from web.
 Video Mining: Video mining is process to extract the relevant video content and
structure of video automatically like moving objects, spatial and correlations of
features as well as object activity, video events, video structure patterns .
8
Advantages
 Predict future trends, customer purchase habits.
 Help with decision making.
 Improve company revenue and lower costs.
 Market basket analysis.
 Fraud detection.
Disadvantages
 User privacy/security.
 Great cost at implementation stage.
 Possible misuse of information.
 Possible inaccuracy of data.
9
Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services .
There are three general classes of information that can be discovered by web mining:
 Web Activity from server logs and Web browser activity tracking.
 Web Graph from links between pages, people and other data.
 Web Content for the data found on Web pages and inside of documents.
10
11
Web Content Mining is the process of extracting useful information from the
contents of web documents. There are two types of approach in content mining
called agent based approach and database based approach.
 Agent Based Approach: This approach concentrate on searching relevant
information using the characteristics of a particular domain to interpret and organize
the collected information.
 Database Approach: This approach is used for retrieving the semi-structure data
from the web.
12
Web structure mining is the process of using graph theory to analyze the node and
connection structure of a web site. According to the type of web structural data, web
structure mining can be divided into two kinds:
 Extracting patterns from hyperlinks in the web: A hyperlink is a structural
component that connects the web page to a different location.
 Mining the document structure: Analysis of the tree-like structure of page
structures.
13
Web usage mining is the process of extracting useful information from server logs
e.g. use Web usage mining is the process of finding out what users are looking for
on the Internet. Some users might be looking at only textual data, whereas some
others might be interested in multi media data.
14
 A Search Engine Spider is a program that most search engines use to find what’s
new on the Internet.
 The program starts at a website and follows every hyperlink on each page. So we
can say that everything on the web will eventually be found and spidered, as the so
called “spider” crawls from one website to another.
15
Crawler Description
GNU Wget GNU Wget is a command-line-operated
crawler written in C and released under
the GPL. It is typically used to mirror Web
and FTP sites.
DataparkSearch It is a crawler and search engine released
under the General Public License.
GRUB It is an open source distributed search
crawler that Wikia Search used to crawl
the web.
HTTrack It uses a Web crawler to create a mirror of
a web site for off-line viewing. It is written
in C and released under the GPL
16
FAST Crawler is a distributed crawler, used by Fast Search & Transfer.
Googlebot is used by google search engine, which is based on C++
and Python.
Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo.
 Information on the Web changes over time.
 Knowledge of this ‘rate of change’ is crucial, as it allows us to
estimate how often a search engine should visit each portion of
the Web in order to maintain a fresh index.
19
 Recognizing the relevance or importance of sites
 Define a scoring function for relevance
where  is the set of parameters, u is url ξ is relevance
20
:)()(
us 

 Indexing the web is a challenge due to its growing and dynamic nature.
 A single crawling process even if multithreading is used will be insufficient for large
scale engines that need to fetch large amounts of data rapidly.
 Distributing crawling activity is the splitting the load, decreases hardware
requirements and at the same time increases the overall download speed and
reliability.
21
 Focused web crawling is to visit unvisited url to check whether it is relevant to
search topic or not.
 Avoid irrelevant documents.
 Reduce network traffic.
22
Basically three steps that are involved in the web crawling process:-
The search crawler starts by crawling the pages of your site.
Then it continues indexing the words and content of the site.
Finally it visit the links (web page addresses or URLs) that are found in
your site.
The Web crawler is the outcome of a combination of policies:-
 Selection policy: It states that which pages to download. Given the current size of
the Web, even large search engines cover only a portion of the publicly available
part.
 Re-visit policy: This policy states when to check for changes to the pages. The
most-used cost functions are freshness and age.
 Politeness policy: This policy states how to avoid overloading sites. If a single
crawler is performing multiple requests per second and/or downloading large files.
 Parallelization policy: A parallel crawler is a crawler that runs multiple processes
in parallel.
24
 New learning-based approach which uses the Naïve Bayes classifier as the base
prediction model to improve relevance prediction in focused Web crawlers is used.
 New learning-based focused crawling approach that uses four relevance attributes to
predict the relevance and if additionally allowing dynamic update of the training
dataset, the prediction accuracy is further boosted [1].
25
 Intelligent crawling method involves looking for specific features in a page to rank
the candidate links.
 These features include page content, URL names of referred Web page, and the
nature of the parent and sibling pages.
 It is a generic framework in that it allows the user to specify the relevant criteria [2].
26
 The Fish-Search is an early crawler that prioritizes unvisited URLs on a queue for
specific search goal.
 The Fish-Search approach assigns priority values (1 or 0) to candidate pages using
simple keyword matching.
 One of the disadvantages of Fish-Search is that all relevant pages are assigned the
same priority value 1 based on keyword matching [3].
27
 The Shark-Search is a modified version of Fish-Search, in which, Vector Space
Model (VSM) is used.
 The priority values (more than just 1 and 0) are computed based on the priority
values of parent pages, page content, and anchor text[3].
28
http://en.wikipedia.org/wiki/DNA
http://dna.ancestry.com
http://www.familytreedns.com
http://learn.genetics.utah.edu
http://facebook.com
http://www.technologystudent.com/joints/iron2
http://en.wikipedia.org/wiki/iron_ore
http://www.metalprices.com/metal/iron-ore
http://miningartifacts.homestead.com/IronOres.html
http://facebook.com
http://www.worldmarket.com
http://www.marketbarchicago.com
http://finance.yahoo.com
http://www.nasdaq.com
http://facebook.com
www.freecomputerbooks.com
www.computer-books.us
www.onlinecomputerbook.com
www.freetechbook.com
http://facebook.com
39
43
 Decision tree
 Neural Network
 Naïve Bayes
 Decision tree is an important tool for machine learning. It is used as a predictive
model to map conclusions about an item's target value to observations about the
item.
 Tree structures comprise of leaves and branches.
 Leaves represent class labels and branches represent conjunctions of features that
lead to those class labels.
 In decision analysis, a decision tree can be used to explicitly and visually represent
decisions and decision making.
 The resulting classification tree can be an input for decision making.
44
 NN is always learning model.
 It usually receives many inputs, called input vector, (analogy of dendrites) and
the sum of them forms output (analogy of synapse).
 The sums of inputs are weighted and the sum is passed to the activation
function.
 However, the brain makes up for the relatively slow rate of operation of a
neuron by having a truly staggering number of neurons(nerve cells) with
massive interconnections between them.
 It is estimated that there must be on the order of 10 billion neurons in the
human cortex, and 60 trillion connections.
 The net result is that the brain is an enormously efficient structure.
45
 Naïve Bayes algorithm is based on Probabilistic learning and classification.
 It assumes that one feature is independent of another.
 This algorithm proved to be efficient over many other approaches although its
simple assumption is not much applicable in realistic world cases.
 Exploits the fact that relevant pages possibly link to other relevant pages. Therefore,
the relevance of a page a to a topic t, pointed by a page b, is estimated by the
relevance of page b to the topic t.
46
 With the advent of WWW and with the increase in the size of web, it becomes a
challenging task for searching relevant information on the web. The result of any search
is millions of documents ranked by a search engine.
 The ranking depends on keywords count, keywords density, link count and some other
proprietary algorithms that are specific to every search engine.
 A user does not have an easy way to reduce those millions of documents by defining a
context or refining the meaning of the keyword.
 The documents appearing in the search result are ranked according to a specific algorithm
that is characteristic to every search engine. The results may in all probability not sorted
according to a users interest.
 Due to the above mentioned problems and the keyword approach used by most search
engines the amount of time and effort required to find the right information is directly
proportional to the amount of information on the web. Data mining is a valid solution to
the problem.
 Crawlers are important tools of data mining which traverses the internet for retrieving
web pages which are relevant to a topic.
47
 To understand the working of content Block segmentation form previous research.
 To find attributes for seed URLs and their child URLs.
 To classify the URLs according to their weight or score in the weight table.
 To prepare the full training data set and maintain the keyword table.
 To classify the relevancy of the new unseen URLs using Decision Tree Induction,
Neural Network, Naïve Bayes Classifier.
 To find out the Harvest ratio or Precision rate for overall performance evaluation.
48
49
There is basically Precision rate parameters which are considered to form results. On
the basis of precision rate performance of all the algorithms is measured.
Precision rate: It estimates the fraction of crawled pages that are relevant. We
depend on multiple classifiers to make this relevance judgement.
Precision Rate = No. of Relevant pages/Total downloaded pages
50
Hardware and Software Requirements
 Hardware:
Intel Core I3 processor.
Memory 3GB.
64-bit Operating System.
 Software:
MATLAB 2010a.
51
 The name MATLAB stands for MATrix-LABoratory.
 MATLAB is a tool for numerical computation and visualization. The basic data
element is a matrix, so if you need a program that manipulates array-based data it is
generally fast to write .
 MATLAB is a high-performance language for technical computing. It
integrates computation, visualization, and programming environment.
52
 MATLAB is an excellent tool because it has unsophisticated data structures,
contains built-in editing and debugging tools, and supports object-oriented
programming.
 It also has easy to use graphics commands that make the visualization of results
immediately available.
 There are toolboxes (specific applications) for signal processing, symbolic
computation, control theory, simulation, optimization, and several other fields of
applied science and engineering.
53
54
55
 Our work was to compare the accuracy the three prime classifiers. We can end up
this discussion by saying that in terms of classification accuracy Neural network
leads the Decision tree induction and Naive Bayesian by a big margin.
 Moreover in order to have more complex computing we need to improve upon the
memory of a particular classifier by training it, in this context also the NN
dominance prevails over the DTI and NB.
56
It can be applied on large database for relevancy prediction.
Further can be used support vector matching.
[1] S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach for
Topic Specific Resource Discovery”, In Journal of Computer and Information
Science, vol. 31, no. 11-16, pp. 1623-1640,1999.
[2] Li, J., Furuse, K. & Yamaguchi, K., “ Focused Crawling by Exploiting Anchor
Text Using Decision Tree”,In Special interest tracks and posters of the 14th
international conference on World Wide Web,Chiba, Japan,2005
[3] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web
Efficiently," In proceedings of the 16th International Conference on Machine
Learning(ICML-99), pp. 335-343, 1999.
58

More Related Content

What's hot

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
iosrjce
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
IRJET Journal
 
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineMulti Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
IDES Editor
 
Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)
Mumbai Academisc
 
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalA survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
IJSRD
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
inventionjournals
 
Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
A comprehensive study of mining web data
A comprehensive study of mining web dataA comprehensive study of mining web data
A comprehensive study of mining web data
eSAT Publishing House
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBCOST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
IJDKP
 
D43062127
D43062127D43062127
D43062127
IJERA Editor
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
ishmecse13
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET Journal
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
csandit
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
sreesaranya
 

What's hot (16)

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineMulti Similarity Measure based Result Merging Strategies in Meta Search Engine
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
 
Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)Odam an optimized distributed association rule mining algorithm (synopsis)
Odam an optimized distributed association rule mining algorithm (synopsis)
 
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalA survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
 
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
A comprehensive study of mining web data
A comprehensive study of mining web dataA comprehensive study of mining web data
A comprehensive study of mining web data
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBCOST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
 
D43062127
D43062127D43062127
D43062127
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 

Similar to Sekhon final 1_ppt

DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
ijmech
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
ijmech
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
ijmech
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
BookStoreLib
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
anchalsinghdm
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
Van-Duyet Le
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
ijsrd.com
 
Recommendation generation by integrating sequential
Recommendation generation by integrating sequentialRecommendation generation by integrating sequential
Recommendation generation by integrating sequential
eSAT Publishing House
 
Recommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semanticsRecommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semantics
eSAT Journals
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
IOSR Journals
 
F43033234
F43033234F43033234
F43033234
IJERA Editor
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
Sanjeev Kumar Jaiswal
 
G017254554
G017254554G017254554
G017254554
IOSR Journals
 
E3602042044
E3602042044E3602042044
E3602042044
ijceronline
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
IJERA Editor
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
Enhance Crawler For Efficiently Harvesting Deep Web InterfacesEnhance Crawler For Efficiently Harvesting Deep Web Interfaces
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
rahulmonikasharma
 
A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage Mining
IJERA Editor
 

Similar to Sekhon final 1_ppt (20)

DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
International conference On Computer Science And technology
International conference On Computer Science And technologyInternational conference On Computer Science And technology
International conference On Computer Science And technology
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
 
Recommendation generation by integrating sequential
Recommendation generation by integrating sequentialRecommendation generation by integrating sequential
Recommendation generation by integrating sequential
 
Recommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semanticsRecommendation generation by integrating sequential pattern mining and semantics
Recommendation generation by integrating sequential pattern mining and semantics
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 
F43033234
F43033234F43033234
F43033234
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
G017254554
G017254554G017254554
G017254554
 
E3602042044
E3602042044E3602042044
E3602042044
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
Enhance Crawler For Efficiently Harvesting Deep Web InterfacesEnhance Crawler For Efficiently Harvesting Deep Web Interfaces
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
 
A Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage MiningA Review on Pattern Discovery Techniques of Web Usage Mining
A Review on Pattern Discovery Techniques of Web Usage Mining
 

Sekhon final 1_ppt

  • 1. SRI GURU GRANTH SAHIB WORLD UNIVERSITY FATEHGARH SAHIB – 140406 JULY,2014 Guided By : Presented By: Ms. Shruti Aggarwal Prabhjit Singh Sekhon Assistant Professor Roll No: 12012238 (CSE Department) M.tech(CSE)
  • 2.  Introduction  Literature Review  Problem Statement  Results  Conclusion  Future works  References 2
  • 3. Data mining is computational process of discovering patterns in large data sets. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. 3
  • 4. 4
  • 5. Functions of data mining can be classified as:  Class Description: It can be useful to describe individual classes and concepts.  Association Analysis: It is the discovery of association rules that define attribute- value conditions.  Classification: It is the process of finding a set of models that describe and distinguish data classes or concepts by using class label.  Cluster Analysis: The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity.  Outlier Analysis: Outliers are data objects that do not comply with the general behavior of model of the data.  Evolution Analysis: It describes and models trends for objects whose behaviors changes over time. 5
  • 6.  Pre-processing: In this data is collected, irrelevant data items are removed and converted into appropriate form.  Processing: To process the data algorithms are considered to develop and evaluate the models. Models are implemented on data.  Post-processing: When models are implemented on data it is presented if still there is irrelevant data then again steps repeated. 6
  • 7. 7
  • 8. There are various types of Mining :  Text Mining: Deriving high quality information from large text.  Visual Mining: In this patterns are generated which are converted from analog to digital data that are hidden in data.  Audio Mining: In this content of an audio signal can be analysed and searched.  Spatial Mining: Finding patterns in data with respect to geography.  Web Mining :Discover patterns from web.  Video Mining: Video mining is process to extract the relevant video content and structure of video automatically like moving objects, spatial and correlations of features as well as object activity, video events, video structure patterns . 8
  • 9. Advantages  Predict future trends, customer purchase habits.  Help with decision making.  Improve company revenue and lower costs.  Market basket analysis.  Fraud detection. Disadvantages  User privacy/security.  Great cost at implementation stage.  Possible misuse of information.  Possible inaccuracy of data. 9
  • 10. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services . There are three general classes of information that can be discovered by web mining:  Web Activity from server logs and Web browser activity tracking.  Web Graph from links between pages, people and other data.  Web Content for the data found on Web pages and inside of documents. 10
  • 11. 11
  • 12. Web Content Mining is the process of extracting useful information from the contents of web documents. There are two types of approach in content mining called agent based approach and database based approach.  Agent Based Approach: This approach concentrate on searching relevant information using the characteristics of a particular domain to interpret and organize the collected information.  Database Approach: This approach is used for retrieving the semi-structure data from the web. 12
  • 13. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:  Extracting patterns from hyperlinks in the web: A hyperlink is a structural component that connects the web page to a different location.  Mining the document structure: Analysis of the tree-like structure of page structures. 13
  • 14. Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multi media data. 14
  • 15.  A Search Engine Spider is a program that most search engines use to find what’s new on the Internet.  The program starts at a website and follows every hyperlink on each page. So we can say that everything on the web will eventually be found and spidered, as the so called “spider” crawls from one website to another. 15
  • 16. Crawler Description GNU Wget GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites. DataparkSearch It is a crawler and search engine released under the General Public License. GRUB It is an open source distributed search crawler that Wikia Search used to crawl the web. HTTrack It uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL 16
  • 17. FAST Crawler is a distributed crawler, used by Fast Search & Transfer. Googlebot is used by google search engine, which is based on C++ and Python. Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo.
  • 18.
  • 19.  Information on the Web changes over time.  Knowledge of this ‘rate of change’ is crucial, as it allows us to estimate how often a search engine should visit each portion of the Web in order to maintain a fresh index. 19
  • 20.  Recognizing the relevance or importance of sites  Define a scoring function for relevance where  is the set of parameters, u is url ξ is relevance 20 :)()( us  
  • 21.  Indexing the web is a challenge due to its growing and dynamic nature.  A single crawling process even if multithreading is used will be insufficient for large scale engines that need to fetch large amounts of data rapidly.  Distributing crawling activity is the splitting the load, decreases hardware requirements and at the same time increases the overall download speed and reliability. 21
  • 22.  Focused web crawling is to visit unvisited url to check whether it is relevant to search topic or not.  Avoid irrelevant documents.  Reduce network traffic. 22
  • 23. Basically three steps that are involved in the web crawling process:- The search crawler starts by crawling the pages of your site. Then it continues indexing the words and content of the site. Finally it visit the links (web page addresses or URLs) that are found in your site.
  • 24. The Web crawler is the outcome of a combination of policies:-  Selection policy: It states that which pages to download. Given the current size of the Web, even large search engines cover only a portion of the publicly available part.  Re-visit policy: This policy states when to check for changes to the pages. The most-used cost functions are freshness and age.  Politeness policy: This policy states how to avoid overloading sites. If a single crawler is performing multiple requests per second and/or downloading large files.  Parallelization policy: A parallel crawler is a crawler that runs multiple processes in parallel. 24
  • 25.  New learning-based approach which uses the Naïve Bayes classifier as the base prediction model to improve relevance prediction in focused Web crawlers is used.  New learning-based focused crawling approach that uses four relevance attributes to predict the relevance and if additionally allowing dynamic update of the training dataset, the prediction accuracy is further boosted [1]. 25
  • 26.  Intelligent crawling method involves looking for specific features in a page to rank the candidate links.  These features include page content, URL names of referred Web page, and the nature of the parent and sibling pages.  It is a generic framework in that it allows the user to specify the relevant criteria [2]. 26
  • 27.  The Fish-Search is an early crawler that prioritizes unvisited URLs on a queue for specific search goal.  The Fish-Search approach assigns priority values (1 or 0) to candidate pages using simple keyword matching.  One of the disadvantages of Fish-Search is that all relevant pages are assigned the same priority value 1 based on keyword matching [3]. 27
  • 28.  The Shark-Search is a modified version of Fish-Search, in which, Vector Space Model (VSM) is used.  The priority values (more than just 1 and 0) are computed based on the priority values of parent pages, page content, and anchor text[3]. 28
  • 29.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39. 39
  • 40.
  • 41.
  • 42.
  • 43. 43  Decision tree  Neural Network  Naïve Bayes
  • 44.  Decision tree is an important tool for machine learning. It is used as a predictive model to map conclusions about an item's target value to observations about the item.  Tree structures comprise of leaves and branches.  Leaves represent class labels and branches represent conjunctions of features that lead to those class labels.  In decision analysis, a decision tree can be used to explicitly and visually represent decisions and decision making.  The resulting classification tree can be an input for decision making. 44
  • 45.  NN is always learning model.  It usually receives many inputs, called input vector, (analogy of dendrites) and the sum of them forms output (analogy of synapse).  The sums of inputs are weighted and the sum is passed to the activation function.  However, the brain makes up for the relatively slow rate of operation of a neuron by having a truly staggering number of neurons(nerve cells) with massive interconnections between them.  It is estimated that there must be on the order of 10 billion neurons in the human cortex, and 60 trillion connections.  The net result is that the brain is an enormously efficient structure. 45
  • 46.  Naïve Bayes algorithm is based on Probabilistic learning and classification.  It assumes that one feature is independent of another.  This algorithm proved to be efficient over many other approaches although its simple assumption is not much applicable in realistic world cases.  Exploits the fact that relevant pages possibly link to other relevant pages. Therefore, the relevance of a page a to a topic t, pointed by a page b, is estimated by the relevance of page b to the topic t. 46
  • 47.  With the advent of WWW and with the increase in the size of web, it becomes a challenging task for searching relevant information on the web. The result of any search is millions of documents ranked by a search engine.  The ranking depends on keywords count, keywords density, link count and some other proprietary algorithms that are specific to every search engine.  A user does not have an easy way to reduce those millions of documents by defining a context or refining the meaning of the keyword.  The documents appearing in the search result are ranked according to a specific algorithm that is characteristic to every search engine. The results may in all probability not sorted according to a users interest.  Due to the above mentioned problems and the keyword approach used by most search engines the amount of time and effort required to find the right information is directly proportional to the amount of information on the web. Data mining is a valid solution to the problem.  Crawlers are important tools of data mining which traverses the internet for retrieving web pages which are relevant to a topic. 47
  • 48.  To understand the working of content Block segmentation form previous research.  To find attributes for seed URLs and their child URLs.  To classify the URLs according to their weight or score in the weight table.  To prepare the full training data set and maintain the keyword table.  To classify the relevancy of the new unseen URLs using Decision Tree Induction, Neural Network, Naïve Bayes Classifier.  To find out the Harvest ratio or Precision rate for overall performance evaluation. 48
  • 49. 49
  • 50. There is basically Precision rate parameters which are considered to form results. On the basis of precision rate performance of all the algorithms is measured. Precision rate: It estimates the fraction of crawled pages that are relevant. We depend on multiple classifiers to make this relevance judgement. Precision Rate = No. of Relevant pages/Total downloaded pages 50
  • 51. Hardware and Software Requirements  Hardware: Intel Core I3 processor. Memory 3GB. 64-bit Operating System.  Software: MATLAB 2010a. 51
  • 52.  The name MATLAB stands for MATrix-LABoratory.  MATLAB is a tool for numerical computation and visualization. The basic data element is a matrix, so if you need a program that manipulates array-based data it is generally fast to write .  MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming environment. 52
  • 53.  MATLAB is an excellent tool because it has unsophisticated data structures, contains built-in editing and debugging tools, and supports object-oriented programming.  It also has easy to use graphics commands that make the visualization of results immediately available.  There are toolboxes (specific applications) for signal processing, symbolic computation, control theory, simulation, optimization, and several other fields of applied science and engineering. 53
  • 54. 54
  • 55. 55
  • 56.  Our work was to compare the accuracy the three prime classifiers. We can end up this discussion by saying that in terms of classification accuracy Neural network leads the Decision tree induction and Naive Bayesian by a big margin.  Moreover in order to have more complex computing we need to improve upon the memory of a particular classifier by training it, in this context also the NN dominance prevails over the DTI and NB. 56
  • 57. It can be applied on large database for relevancy prediction. Further can be used support vector matching.
  • 58. [1] S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach for Topic Specific Resource Discovery”, In Journal of Computer and Information Science, vol. 31, no. 11-16, pp. 1623-1640,1999. [2] Li, J., Furuse, K. & Yamaguchi, K., “ Focused Crawling by Exploiting Anchor Text Using Decision Tree”,In Special interest tracks and posters of the 14th international conference on World Wide Web,Chiba, Japan,2005 [3] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web Efficiently," In proceedings of the 16th International Conference on Machine Learning(ICML-99), pp. 335-343, 1999. 58