Sekhon final 1_ppt

SRI GURU GRANTH SAHIB WORLD UNIVERSITY
FATEHGARH SAHIB – 140406
JULY,2014
Guided By : Presented By:
Ms. Shruti Aggarwal Prabhjit Singh Sekhon
Assistant Professor Roll No: 12012238
(CSE Department) M.tech(CSE)

 Introduction
 Literature Review
 Problem Statement
 Results
 Conclusion
 Future works
 References
2

Data mining is computational process of discovering patterns in large data sets. The
overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
3

Functions of data mining can be classified as:
 Class Description: It can be useful to describe individual classes and concepts.
 Association Analysis: It is the discovery of association rules that define attribute-
value conditions.
 Classification: It is the process of finding a set of models that describe and
distinguish data classes or concepts by using class label.
 Cluster Analysis: The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the inter-class similarity.
 Outlier Analysis: Outliers are data objects that do not comply with the general
behavior of model of the data.
 Evolution Analysis: It describes and models trends for objects whose behaviors
changes over time.
5

 Pre-processing: In this data is collected, irrelevant data items are removed and
converted into appropriate form.
 Processing: To process the data algorithms are considered to develop and evaluate
the models. Models are implemented on data.
 Post-processing: When models are implemented on data it is presented if still there
is irrelevant data then again steps repeated.
6

There are various types of Mining :
 Text Mining: Deriving high quality information from large text.
 Visual Mining: In this patterns are generated which are converted from analog to
digital data that are hidden in data.
 Audio Mining: In this content of an audio signal can be analysed and searched.
 Spatial Mining: Finding patterns in data with respect to geography.
 Web Mining :Discover patterns from web.
 Video Mining: Video mining is process to extract the relevant video content and
structure of video automatically like moving objects, spatial and correlations of
features as well as object activity, video events, video structure patterns .
8

Advantages
 Predict future trends, customer purchase habits.
 Help with decision making.
 Improve company revenue and lower costs.
 Market basket analysis.
 Fraud detection.
Disadvantages
 User privacy/security.
 Great cost at implementation stage.
 Possible misuse of information.
 Possible inaccuracy of data.
9

Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services .
There are three general classes of information that can be discovered by web mining:
 Web Activity from server logs and Web browser activity tracking.
 Web Graph from links between pages, people and other data.
 Web Content for the data found on Web pages and inside of documents.
10

Web Content Mining is the process of extracting useful information from the
contents of web documents. There are two types of approach in content mining
called agent based approach and database based approach.
 Agent Based Approach: This approach concentrate on searching relevant
information using the characteristics of a particular domain to interpret and organize
the collected information.
 Database Approach: This approach is used for retrieving the semi-structure data
from the web.
12

Web structure mining is the process of using graph theory to analyze the node and
connection structure of a web site. According to the type of web structural data, web
structure mining can be divided into two kinds:
 Extracting patterns from hyperlinks in the web: A hyperlink is a structural
component that connects the web page to a different location.
 Mining the document structure: Analysis of the tree-like structure of page
structures.
13

Web usage mining is the process of extracting useful information from server logs
e.g. use Web usage mining is the process of finding out what users are looking for
on the Internet. Some users might be looking at only textual data, whereas some
others might be interested in multi media data.
14

 A Search Engine Spider is a program that most search engines use to find what’s
new on the Internet.
 The program starts at a website and follows every hyperlink on each page. So we
can say that everything on the web will eventually be found and spidered, as the so
called “spider” crawls from one website to another.
15

Crawler Description
GNU Wget GNU Wget is a command-line-operated
crawler written in C and released under
the GPL. It is typically used to mirror Web
and FTP sites.
DataparkSearch It is a crawler and search engine released
under the General Public License.
GRUB It is an open source distributed search
crawler that Wikia Search used to crawl
the web.
HTTrack It uses a Web crawler to create a mirror of
a web site for off-line viewing. It is written
in C and released under the GPL
16

FAST Crawler is a distributed crawler, used by Fast Search & Transfer.
Googlebot is used by google search engine, which is based on C++
and Python.
Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo.

 Information on the Web changes over time.
 Knowledge of this ‘rate of change’ is crucial, as it allows us to
estimate how often a search engine should visit each portion of
the Web in order to maintain a fresh index.
19

 Recognizing the relevance or importance of sites
 Define a scoring function for relevance
where  is the set of parameters, u is url ξ is relevance
20
:)()(
us 


 Indexing the web is a challenge due to its growing and dynamic nature.
 A single crawling process even if multithreading is used will be insufficient for large
scale engines that need to fetch large amounts of data rapidly.
 Distributing crawling activity is the splitting the load, decreases hardware
requirements and at the same time increases the overall download speed and
reliability.
21

 Focused web crawling is to visit unvisited url to check whether it is relevant to
search topic or not.
 Avoid irrelevant documents.
 Reduce network traffic.
22

Basically three steps that are involved in the web crawling process:-
The search crawler starts by crawling the pages of your site.
Then it continues indexing the words and content of the site.
Finally it visit the links (web page addresses or URLs) that are found in
your site.

The Web crawler is the outcome of a combination of policies:-
 Selection policy: It states that which pages to download. Given the current size of
the Web, even large search engines cover only a portion of the publicly available
part.
 Re-visit policy: This policy states when to check for changes to the pages. The
most-used cost functions are freshness and age.
 Politeness policy: This policy states how to avoid overloading sites. If a single
crawler is performing multiple requests per second and/or downloading large files.
 Parallelization policy: A parallel crawler is a crawler that runs multiple processes
in parallel.
24

 New learning-based approach which uses the Naïve Bayes classifier as the base
prediction model to improve relevance prediction in focused Web crawlers is used.
 New learning-based focused crawling approach that uses four relevance attributes to
predict the relevance and if additionally allowing dynamic update of the training
dataset, the prediction accuracy is further boosted [1].
25

 Intelligent crawling method involves looking for specific features in a page to rank
the candidate links.
 These features include page content, URL names of referred Web page, and the
nature of the parent and sibling pages.
 It is a generic framework in that it allows the user to specify the relevant criteria [2].
26

 The Fish-Search is an early crawler that prioritizes unvisited URLs on a queue for
specific search goal.
 The Fish-Search approach assigns priority values (1 or 0) to candidate pages using
simple keyword matching.
 One of the disadvantages of Fish-Search is that all relevant pages are assigned the
same priority value 1 based on keyword matching [3].
27

 The Shark-Search is a modified version of Fish-Search, in which, Vector Space
Model (VSM) is used.
 The priority values (more than just 1 and 0) are computed based on the priority
values of parent pages, page content, and anchor text[3].
28

http://en.wikipedia.org/wiki/DNA
http://dna.ancestry.com
http://www.familytreedns.com
http://learn.genetics.utah.edu
http://facebook.com

http://www.technologystudent.com/joints/iron2
http://en.wikipedia.org/wiki/iron_ore
http://www.metalprices.com/metal/iron-ore
http://miningartifacts.homestead.com/IronOres.html
http://facebook.com

http://www.worldmarket.com
http://www.marketbarchicago.com
http://finance.yahoo.com
http://www.nasdaq.com
http://facebook.com

www.freecomputerbooks.com
www.computer-books.us
www.onlinecomputerbook.com
www.freetechbook.com
http://facebook.com

43
 Decision tree
 Neural Network
 Naïve Bayes

 Decision tree is an important tool for machine learning. It is used as a predictive
model to map conclusions about an item's target value to observations about the
item.
 Tree structures comprise of leaves and branches.
 Leaves represent class labels and branches represent conjunctions of features that
lead to those class labels.
 In decision analysis, a decision tree can be used to explicitly and visually represent
decisions and decision making.
 The resulting classification tree can be an input for decision making.
44

 NN is always learning model.
 It usually receives many inputs, called input vector, (analogy of dendrites) and
the sum of them forms output (analogy of synapse).
 The sums of inputs are weighted and the sum is passed to the activation
function.
 However, the brain makes up for the relatively slow rate of operation of a
neuron by having a truly staggering number of neurons(nerve cells) with
massive interconnections between them.
 It is estimated that there must be on the order of 10 billion neurons in the
human cortex, and 60 trillion connections.
 The net result is that the brain is an enormously efficient structure.
45

 Naïve Bayes algorithm is based on Probabilistic learning and classification.
 It assumes that one feature is independent of another.
 This algorithm proved to be efficient over many other approaches although its
simple assumption is not much applicable in realistic world cases.
 Exploits the fact that relevant pages possibly link to other relevant pages. Therefore,
the relevance of a page a to a topic t, pointed by a page b, is estimated by the
relevance of page b to the topic t.
46

 With the advent of WWW and with the increase in the size of web, it becomes a
challenging task for searching relevant information on the web. The result of any search
is millions of documents ranked by a search engine.
 The ranking depends on keywords count, keywords density, link count and some other
proprietary algorithms that are specific to every search engine.
 A user does not have an easy way to reduce those millions of documents by defining a
context or refining the meaning of the keyword.
 The documents appearing in the search result are ranked according to a specific algorithm
that is characteristic to every search engine. The results may in all probability not sorted
according to a users interest.
 Due to the above mentioned problems and the keyword approach used by most search
engines the amount of time and effort required to find the right information is directly
proportional to the amount of information on the web. Data mining is a valid solution to
the problem.
 Crawlers are important tools of data mining which traverses the internet for retrieving
web pages which are relevant to a topic.
47

 To understand the working of content Block segmentation form previous research.
 To find attributes for seed URLs and their child URLs.
 To classify the URLs according to their weight or score in the weight table.
 To prepare the full training data set and maintain the keyword table.
 To classify the relevancy of the new unseen URLs using Decision Tree Induction,
Neural Network, Naïve Bayes Classifier.
 To find out the Harvest ratio or Precision rate for overall performance evaluation.
48

There is basically Precision rate parameters which are considered to form results. On
the basis of precision rate performance of all the algorithms is measured.
Precision rate: It estimates the fraction of crawled pages that are relevant. We
depend on multiple classifiers to make this relevance judgement.
Precision Rate = No. of Relevant pages/Total downloaded pages
50

Hardware and Software Requirements
 Hardware:
Intel Core I3 processor.
Memory 3GB.
64-bit Operating System.
 Software:
MATLAB 2010a.
51

 The name MATLAB stands for MATrix-LABoratory.
 MATLAB is a tool for numerical computation and visualization. The basic data
element is a matrix, so if you need a program that manipulates array-based data it is
generally fast to write .
 MATLAB is a high-performance language for technical computing. It
integrates computation, visualization, and programming environment.
52

 MATLAB is an excellent tool because it has unsophisticated data structures,
contains built-in editing and debugging tools, and supports object-oriented
programming.
 It also has easy to use graphics commands that make the visualization of results
immediately available.
 There are toolboxes (specific applications) for signal processing, symbolic
computation, control theory, simulation, optimization, and several other fields of
applied science and engineering.
53

 Our work was to compare the accuracy the three prime classifiers. We can end up
this discussion by saying that in terms of classification accuracy Neural network
leads the Decision tree induction and Naive Bayesian by a big margin.
 Moreover in order to have more complex computing we need to improve upon the
memory of a particular classifier by training it, in this context also the NN
dominance prevails over the DTI and NB.
56

It can be applied on large database for relevancy prediction.
Further can be used support vector matching.

[1] S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach for
Topic Specific Resource Discovery”, In Journal of Computer and Information
Science, vol. 31, no. 11-16, pp. 1623-1640,1999.
[2] Li, J., Furuse, K. & Yamaguchi, K., “ Focused Crawling by Exploiting Anchor
Text Using Decision Tree”,In Special interest tracks and posters of the 14th
international conference on World Wide Web,Chiba, Japan,2005
[3] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web
Efficiently," In proceedings of the 16th International Conference on Machine
Learning(ICML-99), pp. 335-343, 1999.
58

Sekhon final 1_ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Sekhon final 1_ppt

Similar to Sekhon final 1_ppt (20)

Sekhon final 1_ppt