SlideShare a Scribd company logo
1 of 5
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 3 (Nov. - Dec. 2013), PP 08-12
www.iosrjournals.org
www.iosrjournals.org 8 | Page
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in
Web Pages
Rachna Singh Bhullar, Dr. Praveen Dhyani
(Computer Science Department, Guru Nanak Dev University, Amritsar, Punjab, India)
(Banasthali University – Jaipur Campus, Jaipur, Rajasthan, India)
Abstract: Web Mining is specialized field of Data Mining which deals with the methods and techniques of data
mining to extract useful patterns from the web data that is available in web server logs/databases. Web content
mining is one of the classifications of web mining which extracts information from the web documents
containing texts, links, videos and multimedia data available in World Wide Web databases. Further, web
structure mining is a kind of web content mining which extracts patterns and meaningful information from the
structure of hyperlinks contained in web documents having the same domain. The hyperlinks which are not
related to content or the invalid ones are called web structure outliers. In this paper the basic aim is to find out
these web structure outliers.
Keywords- Outliers, web outlier mining, web structure mining, Web mining, web structure documents.
I. Introduction
Millions and millions of users are uploading and downloading web data into/from the web databases in
World Wide Web. That‟s why, data in web server logs and databases are increasing exponentially. Updating and
retrieving efficient and relevant data from web databases is a major concern. The aim of our research is to
develop a new methodology for efficiently and effectively mine useful and relevant data from the web
documents having the same domain. Web mining tasks can be divided into three main categories, namely, Web
Structure Mining, Web Usage Mining and Web Content Mining. Web Structure Mining mines relevant
knowledge and meaningful patterns from the structure of hyperlinks contained in web pages. Web Usage
Mining is the application of web mining techniques to mine information from web usage logs. Web Content
Mining extracts efficient and relevant information from web pages having text, image, video and hyperlinks as
their content [3], [4], [5] and [6]. Web structure mining is a kind of Web content mining as it mines relevant
data from the hyperlinks of web documents to be mined by the algorithms of web content mining [1]. Existing
Web Content Mining algorithms focus on web documents of same domain; these algorithms do not consider
web pages with varying contents of the same domain called the Web Content Outliers. In general, Outliers are
the data that are irrelevant in terms of meaning and behavior of the existing data.
II. Outline of work
Section II provides the brief review of related work in web content mining. Section III explains the
proposed algorithm. Section IV provides the results while in section V conclusions and future work is
summarized.
III. Related Work
Outliers are those data objects which behave differently on the basis of their properties and valuable
information that they contain. Outlier Mining is mainly studied in statistics because standard distribution
techniques are applied on data objects to find out the outliers. A prior knowledge of data distribution like
Poisson, Normal, etc. is mainly required to apply the statistical techniques which are the major setback. Outlier
Detection techniques can be of following categories:
1. Statistical techniques: The statistical techniques like depth, distance, derivation and density based
techniques can be applied on numeric data objects/sets.
2. Web Text Outlier Mining Algorithm: This computes the difference in web texts within a certain domain.
3. WCO-ND algorithm: This algorithm is designed to determine the similarity between different but related
words in text processing.
All the above discussed algorithms make use of web texts present in the various web documents. For example,
consider the following scenario for a web search engine.
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
www.iosrjournals.org 9 | Page
Figure1 figure2
In figure1 Google search engine is there and in figure2, after searching Banasthali by Google, list of web
documents containing hyperlinks (not the simple text related to Banasthali) is listed. That is why, the above
simple web discussed algorithms are not sufficient to yield the desired and efficient output.
IV. Architecture of the proposed system
In the proposed system, query given by the user is searched using a web scrapper. Web search engine,
opened in web scrapper, then generates a list of related web pages. Each web page is preprocessed by extracting
all the links in an excel file. Now corresponding to each page, a separate excel file on the disk is placed. After
this, each excel file is processed by a programming code to eliminate the web structure outliers.
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
www.iosrjournals.org 10 | Page
V. Proposed Flowchart
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
www.iosrjournals.org 11 | Page
VI. Proposed Algorithm
Step1:- Enter the query on the web search engine opened in the web scrapper.
Step2:- Take the input as document D to be mined.
Step3:- Document D is consisted of:
n
D=U W i
i=1
Where i=1, 2, 3…………n web pages.
Step4:- Initialize i=1
k
Step5:- Assign L [K] =U L t
t=1
Where L=name of array whose elements are links from web page W i
L=name of array whose elements links from webpage W i
t=1, 2, 3………Kwhere k= total no. of links in W i
Step6:- Our aim is to find the web structure outlier from W i which are the links not related to
searched content as well as not reachable that means the fake hyperlinks.
Step7:- We can say n n k
D=U W I = U (U L t )i
I=1 i=1 t=1
k
Step8:- for m=1 to k, check whether Lm of L[k](which is equivalent to L[k]=U Lm ) where m=1, 2..k
m=1
At first instance, m=1 L [1] is
(i) A valid hyperlink or not by checking through a java code.
(ii) A hyperlink related to the searched content or not.
Step9:- If (i) and (ii) are true then repeat the above step8 for k times.
Step10:- Repeat both the step8 & 9 for n times so that we can remove outliers from all the web pages
contained in a document „D‟.
VII. Observations
Elimination of outliers results in the reduction of space and time complexity. Quality of search engine gets
increased as web content is efficient and relevant to the searched content. In statistics, we have a measurement
to find the quality of refined pages which is known as Precision. It can be defined as the ratio between the
number of relevant pages and the total number of relevant documents returned after the elimination of outliers
[9].
Relevant documents retrieved originally
Precision = -------------------------------------------------
Refined documents retrieved
VIII. Future Work
Web mining is a growing research area in data mining research. This paper proposes an algorithm to find the
outliers to improve the efficiency of web search engine. Future work aims at experimental evaluation and
comparative study of our algorithm with results of existing web content mining algorithms.
References
Journals
[1]. Signed approach for mining web content outliers by G.Poonkuzhali, K. Thaiagrajan, K. Sarukesi and G.V.Uma, World Academy of
Science Engineering & Technology, 32, 2009.
[2]. Bing Liu, Kevin chen-chuan chang, Editorial special issue on web content mining, SIGKDD Explorations Volume 6, issue 2.
[3]. Hongqili, Zhuang Wu, Xia.ogang Ji research on the techniques for effectively searching and retrieving information from Internet
Symposium on Electronic Commerce & Security, IEEE 2008.
[4]. G.Poonkuzhali, K.Thaigarajan, K.Sarukesi, set theoretical approach for mining web content through outlier detection, International
Journal on Research & Industrial Applications, Volume 2, January 2009.
[5]. “Chinese web text outlier mining Based on domain knowledge “, by Xia Huosang , Fan Xhaoyan, Pang Liuyan in 2010 Second WRI
Global Congress on Intelligent Systems.
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
www.iosrjournals.org 12 | Page
Proceedings
[6]. WCO ND-Mine: Algorithm for detecting web content outliers from web documents by Malik Agyemang, Ken Barker, Raja S.Anthajj,
proceedings of the 10th
IEEE symposium Computer & Communications (ISCC2005).
[7]. G.Poonkuzhali, K.Thaigarajan, K.Sarukesi elimination of redundant links in web pages –Mathematical approach, Proceedings of
World Academy of Science, Engineering & Technology, Volume 40,April 2009, PP 555-562.
[8]. Jhonshon T, Kwok I, Ng R. “Fast computation of 2-D Depth Contours” , In Proceedings of KDD 98, PP 224-228.
[9]. Knorr E.M., Ng R.T. “Algorithm for Mining distant based outliers in large datasets” in Proceedings of the 24th
VLDB conference, New
York, 1998, PP 392-403.
Books:
[10]. Data mining Concepts and Techniques by Jiawei Hen and Micheline Kamber.

More Related Content

What's hot

A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure MiningIRJET Journal
 
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...rahulmonikasharma
 
Web Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A SurveyWeb Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A SurveyIOSR Journals
 
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBCOST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBIJDKP
 
A comprehensive study of mining web data
A comprehensive study of mining web dataA comprehensive study of mining web data
A comprehensive study of mining web dataeSAT Publishing House
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web MiningIOSR Journals
 
Cloud Storage Client Application Analysis
Cloud Storage Client Application AnalysisCloud Storage Client Application Analysis
Cloud Storage Client Application AnalysisCSCJournals
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Editor IJCATR
 
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...ijdkp
 
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...IJSRD
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
Web content mining a case study for bput results
Web content mining a case study for bput resultsWeb content mining a case study for bput results
Web content mining a case study for bput resultseSAT Publishing House
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 

What's hot (18)

A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
PAS: A Sampling Based Similarity Identification Algorithm for compression of ...
 
Web Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A SurveyWeb Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A Survey
 
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBCOST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
 
A comprehensive study of mining web data
A comprehensive study of mining web dataA comprehensive study of mining web data
A comprehensive study of mining web data
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web Mining
 
Cloud Storage Client Application Analysis
Cloud Storage Client Application AnalysisCloud Storage Client Application Analysis
Cloud Storage Client Application Analysis
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...
 
L017418893
L017418893L017418893
L017418893
 
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...Integrated Web Recommendation Model with Improved Weighted Association Rule M...
Integrated Web Recommendation Model with Improved Weighted Association Rule M...
 
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
 
01635156
0163515601635156
01635156
 
Minning www
Minning wwwMinning www
Minning www
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
Web content minin
Web content mininWeb content minin
Web content minin
 
Web content mining a case study for bput results
Web content mining a case study for bput resultsWeb content mining a case study for bput results
Web content mining a case study for bput results
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 

Viewers also liked

Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...
Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...
Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...IOSR Journals
 
Analysis of the Demand for Eggs in City Of Malang
Analysis of the Demand for Eggs in City Of MalangAnalysis of the Demand for Eggs in City Of Malang
Analysis of the Demand for Eggs in City Of MalangIOSR Journals
 
Reduce Evaporation Losses from Water Reservoirs
Reduce Evaporation Losses from Water ReservoirsReduce Evaporation Losses from Water Reservoirs
Reduce Evaporation Losses from Water ReservoirsIOSR Journals
 
Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...
Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...
Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...IOSR Journals
 
Performance Analysis of OFDM in Combating Multipath Fading
Performance Analysis of OFDM in Combating Multipath FadingPerformance Analysis of OFDM in Combating Multipath Fading
Performance Analysis of OFDM in Combating Multipath FadingIOSR Journals
 
The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...
The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...
The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...IOSR Journals
 
Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...
Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...
Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...IOSR Journals
 
Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...
Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...
Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...IOSR Journals
 
An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...
An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...
An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...IOSR Journals
 
Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...
Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...
Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...IOSR Journals
 
Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...
Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...
Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...IOSR Journals
 
Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...
Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...
Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...IOSR Journals
 
Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...
Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...
Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...IOSR Journals
 
A survey on context aware system & intelligent Middleware’s
A survey on context aware system & intelligent Middleware’sA survey on context aware system & intelligent Middleware’s
A survey on context aware system & intelligent Middleware’sIOSR Journals
 
Study on Coefficient of Permeability of Copper slag when admixed with Lime an...
Study on Coefficient of Permeability of Copper slag when admixed with Lime an...Study on Coefficient of Permeability of Copper slag when admixed with Lime an...
Study on Coefficient of Permeability of Copper slag when admixed with Lime an...IOSR Journals
 

Viewers also liked (20)

C0751720
C0751720C0751720
C0751720
 
Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...
Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...
Experimental & Finite Element Analysis of Left Side Lower Wishbone Arm of Ind...
 
Analysis of the Demand for Eggs in City Of Malang
Analysis of the Demand for Eggs in City Of MalangAnalysis of the Demand for Eggs in City Of Malang
Analysis of the Demand for Eggs in City Of Malang
 
Reduce Evaporation Losses from Water Reservoirs
Reduce Evaporation Losses from Water ReservoirsReduce Evaporation Losses from Water Reservoirs
Reduce Evaporation Losses from Water Reservoirs
 
Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...
Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...
Effects of Variable Fluid Properties and MHD on Mixed Convection Heat Transfe...
 
K0736871
K0736871K0736871
K0736871
 
Performance Analysis of OFDM in Combating Multipath Fading
Performance Analysis of OFDM in Combating Multipath FadingPerformance Analysis of OFDM in Combating Multipath Fading
Performance Analysis of OFDM in Combating Multipath Fading
 
The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...
The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...
The Effect Of Planning And Control To Bureaucracy Behavior In The Improving S...
 
Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...
Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...
Idiosyncratic Effect of Corporate Solvency Management Strategies on Corporate...
 
Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...
Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...
Steganography Technique of Sending Random Passwords on Receiver’s Mobile (A N...
 
An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...
An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...
An Effective Model for Evaluating Organizational Risk and Cost in ERP Impleme...
 
E01053234
E01053234E01053234
E01053234
 
Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...
Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...
Educational Setup by a Global Repute Company of Greater Noida (Uttar Pradesh)...
 
Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...
Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...
Impact of Trade Associations on Entrepreneurial Traits in Nigeria’s Transport...
 
Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...
Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...
Computations of the Ground State Cohesive Properties Of Alas Crystalline Stru...
 
Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...
Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...
Integrating ICT in Re-Branding Nigerian Youths for Constructive Empowerment a...
 
A survey on context aware system & intelligent Middleware’s
A survey on context aware system & intelligent Middleware’sA survey on context aware system & intelligent Middleware’s
A survey on context aware system & intelligent Middleware’s
 
Study on Coefficient of Permeability of Copper slag when admixed with Lime an...
Study on Coefficient of Permeability of Copper slag when admixed with Lime an...Study on Coefficient of Permeability of Copper slag when admixed with Lime an...
Study on Coefficient of Permeability of Copper slag when admixed with Lime an...
 
K0526068
K0526068K0526068
K0526068
 
G0563337
G0563337G0563337
G0563337
 

Similar to WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages

A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure MiningNicole Heredia
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSAM Publications
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categoriestheijes
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logsijsrd.com
 
Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...IRJET Journal
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioINFOGAIN PUBLICATION
 
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...ijdkp
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningIJERA Editor
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimizationBookStoreLib
 
A Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web UsageA Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web Usageijbuiiir1
 
content extraction
content extractioncontent extraction
content extractionCharmi Patel
 
Study on Web Content Extraction Techniques
Study on Web Content Extraction TechniquesStudy on Web Content Extraction Techniques
Study on Web Content Extraction Techniquesijtsrd
 
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAMINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAcscpconf
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data csandit
 

Similar to WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages (20)

A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
 
H0314450
H0314450H0314450
H0314450
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
 
Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...Review on an automatic extraction of educational digital objects and metadata...
Review on an automatic extraction of educational digital objects and metadata...
 
G017334248
G017334248G017334248
G017334248
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
 
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
H017554148
H017554148H017554148
H017554148
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
A Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web UsageA Study of Pattern Analysis Techniques of Web Usage
A Study of Pattern Analysis Techniques of Web Usage
 
content extraction
content extractioncontent extraction
content extraction
 
Study on Web Content Extraction Techniques
Study on Web Content Extraction TechniquesStudy on Web Content Extraction Techniques
Study on Web Content Extraction Techniques
 
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATAMINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
MINING FUZZY ASSOCIATION RULES FROM WEB USAGE QUANTITATIVE DATA
 
Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data Mining Fuzzy Association Rules from Web Usage Quantitative Data
Mining Fuzzy Association Rules from Web Usage Quantitative Data
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 

WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 3 (Nov. - Dec. 2013), PP 08-12 www.iosrjournals.org www.iosrjournals.org 8 | Page WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages Rachna Singh Bhullar, Dr. Praveen Dhyani (Computer Science Department, Guru Nanak Dev University, Amritsar, Punjab, India) (Banasthali University – Jaipur Campus, Jaipur, Rajasthan, India) Abstract: Web Mining is specialized field of Data Mining which deals with the methods and techniques of data mining to extract useful patterns from the web data that is available in web server logs/databases. Web content mining is one of the classifications of web mining which extracts information from the web documents containing texts, links, videos and multimedia data available in World Wide Web databases. Further, web structure mining is a kind of web content mining which extracts patterns and meaningful information from the structure of hyperlinks contained in web documents having the same domain. The hyperlinks which are not related to content or the invalid ones are called web structure outliers. In this paper the basic aim is to find out these web structure outliers. Keywords- Outliers, web outlier mining, web structure mining, Web mining, web structure documents. I. Introduction Millions and millions of users are uploading and downloading web data into/from the web databases in World Wide Web. That‟s why, data in web server logs and databases are increasing exponentially. Updating and retrieving efficient and relevant data from web databases is a major concern. The aim of our research is to develop a new methodology for efficiently and effectively mine useful and relevant data from the web documents having the same domain. Web mining tasks can be divided into three main categories, namely, Web Structure Mining, Web Usage Mining and Web Content Mining. Web Structure Mining mines relevant knowledge and meaningful patterns from the structure of hyperlinks contained in web pages. Web Usage Mining is the application of web mining techniques to mine information from web usage logs. Web Content Mining extracts efficient and relevant information from web pages having text, image, video and hyperlinks as their content [3], [4], [5] and [6]. Web structure mining is a kind of Web content mining as it mines relevant data from the hyperlinks of web documents to be mined by the algorithms of web content mining [1]. Existing Web Content Mining algorithms focus on web documents of same domain; these algorithms do not consider web pages with varying contents of the same domain called the Web Content Outliers. In general, Outliers are the data that are irrelevant in terms of meaning and behavior of the existing data. II. Outline of work Section II provides the brief review of related work in web content mining. Section III explains the proposed algorithm. Section IV provides the results while in section V conclusions and future work is summarized. III. Related Work Outliers are those data objects which behave differently on the basis of their properties and valuable information that they contain. Outlier Mining is mainly studied in statistics because standard distribution techniques are applied on data objects to find out the outliers. A prior knowledge of data distribution like Poisson, Normal, etc. is mainly required to apply the statistical techniques which are the major setback. Outlier Detection techniques can be of following categories: 1. Statistical techniques: The statistical techniques like depth, distance, derivation and density based techniques can be applied on numeric data objects/sets. 2. Web Text Outlier Mining Algorithm: This computes the difference in web texts within a certain domain. 3. WCO-ND algorithm: This algorithm is designed to determine the similarity between different but related words in text processing. All the above discussed algorithms make use of web texts present in the various web documents. For example, consider the following scenario for a web search engine.
  • 2. WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages www.iosrjournals.org 9 | Page Figure1 figure2 In figure1 Google search engine is there and in figure2, after searching Banasthali by Google, list of web documents containing hyperlinks (not the simple text related to Banasthali) is listed. That is why, the above simple web discussed algorithms are not sufficient to yield the desired and efficient output. IV. Architecture of the proposed system In the proposed system, query given by the user is searched using a web scrapper. Web search engine, opened in web scrapper, then generates a list of related web pages. Each web page is preprocessed by extracting all the links in an excel file. Now corresponding to each page, a separate excel file on the disk is placed. After this, each excel file is processed by a programming code to eliminate the web structure outliers.
  • 3. WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages www.iosrjournals.org 10 | Page V. Proposed Flowchart
  • 4. WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages www.iosrjournals.org 11 | Page VI. Proposed Algorithm Step1:- Enter the query on the web search engine opened in the web scrapper. Step2:- Take the input as document D to be mined. Step3:- Document D is consisted of: n D=U W i i=1 Where i=1, 2, 3…………n web pages. Step4:- Initialize i=1 k Step5:- Assign L [K] =U L t t=1 Where L=name of array whose elements are links from web page W i L=name of array whose elements links from webpage W i t=1, 2, 3………Kwhere k= total no. of links in W i Step6:- Our aim is to find the web structure outlier from W i which are the links not related to searched content as well as not reachable that means the fake hyperlinks. Step7:- We can say n n k D=U W I = U (U L t )i I=1 i=1 t=1 k Step8:- for m=1 to k, check whether Lm of L[k](which is equivalent to L[k]=U Lm ) where m=1, 2..k m=1 At first instance, m=1 L [1] is (i) A valid hyperlink or not by checking through a java code. (ii) A hyperlink related to the searched content or not. Step9:- If (i) and (ii) are true then repeat the above step8 for k times. Step10:- Repeat both the step8 & 9 for n times so that we can remove outliers from all the web pages contained in a document „D‟. VII. Observations Elimination of outliers results in the reduction of space and time complexity. Quality of search engine gets increased as web content is efficient and relevant to the searched content. In statistics, we have a measurement to find the quality of refined pages which is known as Precision. It can be defined as the ratio between the number of relevant pages and the total number of relevant documents returned after the elimination of outliers [9]. Relevant documents retrieved originally Precision = ------------------------------------------------- Refined documents retrieved VIII. Future Work Web mining is a growing research area in data mining research. This paper proposes an algorithm to find the outliers to improve the efficiency of web search engine. Future work aims at experimental evaluation and comparative study of our algorithm with results of existing web content mining algorithms. References Journals [1]. Signed approach for mining web content outliers by G.Poonkuzhali, K. Thaiagrajan, K. Sarukesi and G.V.Uma, World Academy of Science Engineering & Technology, 32, 2009. [2]. Bing Liu, Kevin chen-chuan chang, Editorial special issue on web content mining, SIGKDD Explorations Volume 6, issue 2. [3]. Hongqili, Zhuang Wu, Xia.ogang Ji research on the techniques for effectively searching and retrieving information from Internet Symposium on Electronic Commerce & Security, IEEE 2008. [4]. G.Poonkuzhali, K.Thaigarajan, K.Sarukesi, set theoretical approach for mining web content through outlier detection, International Journal on Research & Industrial Applications, Volume 2, January 2009. [5]. “Chinese web text outlier mining Based on domain knowledge “, by Xia Huosang , Fan Xhaoyan, Pang Liuyan in 2010 Second WRI Global Congress on Intelligent Systems.
  • 5. WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages www.iosrjournals.org 12 | Page Proceedings [6]. WCO ND-Mine: Algorithm for detecting web content outliers from web documents by Malik Agyemang, Ken Barker, Raja S.Anthajj, proceedings of the 10th IEEE symposium Computer & Communications (ISCC2005). [7]. G.Poonkuzhali, K.Thaigarajan, K.Sarukesi elimination of redundant links in web pages –Mathematical approach, Proceedings of World Academy of Science, Engineering & Technology, Volume 40,April 2009, PP 555-562. [8]. Jhonshon T, Kwok I, Ng R. “Fast computation of 2-D Depth Contours” , In Proceedings of KDD 98, PP 224-228. [9]. Knorr E.M., Ng R.T. “Algorithm for Mining distant based outliers in large datasets” in Proceedings of the 24th VLDB conference, New York, 1998, PP 392-403. Books: [10]. Data mining Concepts and Techniques by Jiawei Hen and Micheline Kamber.