Searching useful information from the web, a popular activity, often involves huge irrelevant contents or noises leading to difficulties in extracting useful information. Indeed, search engines, crawlers and information agents may often fail to separate relevant information from noises indicating significance of efficient search results. Earlier, some research works locate noisy data only at the edges of the web page; while others prefer to consider the whole page for noisy data detection. In our paper, we propose a simple priority-assignment based approach with a view to differentiating main contents of the page from the noises. In our proposed technique, we first make partition of the whole page into a number of disjoint blocks using HTML tag based technique. Next, we determine a priority level for each block based on HTML tags priority while considering aggregate priority calculation. This assignment process gives a
priority value to each block which helps rank the overall search results in online searching. In our work, the blocks with higher priority are termed as informative blocks and preserved in database for future use, whereas lower priority blocks are considered as noisy blocks and are not used for further data searching operation. Our experimental results show considerable improvement in noisy block elimination and in online page ranking with limited searching time as compared to other known approaches. Moreover, the obtained accuracy from our approach by applying the Naive Bayes text classification method is about 90 percent, quite high as compared to others.
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work's result with existing mechanism and found our result much better for number webpage
This document summarizes previous work on content extraction from web pages and proposes a new approach. It discusses existing methods that use techniques like entropy analysis, DOM trees, clustering, and ratios of text, links and tags. The proposed approach combines word to leaf ratio with text link ratio and link text ratio to identify informative nodes in the DOM tree. It calculates weights and relative positions of nodes to select the most informative content. The method will be tested on different website types and compared to existing approaches.
This document provides an introduction to web structure mining and discusses two popular methods: HITS and PageRank. It begins with an overview of web mining categories including web content mining, web structure mining, and web usage mining. Web structure mining focuses on the hyperlink structure of the web and analyzes link relationships between pages. HITS and PageRank are two algorithms that have been proposed to handle potential correlations between linked pages and improve predictive accuracy.
Integrating content search with structure analysis for hypermedia retrieval a...unyil96
This document summarizes research on integrating content search and structure analysis for hypermedia retrieval and management. It discusses how link analysis and topic distillation techniques can organize query results and identify authoritative pages. Database approaches aim to facilitate search, navigation and associating web pages through extended query languages and logical document representations. Overall the paper outlines the state-of-the-art in utilizing both content and link structure to improve hypermedia search and organization.
This document discusses interactive visualization techniques for information retrieval. It begins by stating that information retrieval systems often return many results, some more relevant than others. While search engines have grown, problems remain with low precision and recall. Visualization techniques can help users better understand retrieval results. The document then reviews several visualization methods like tree views, title views, and bubble views that can enhance web information retrieval systems by helping users browse, filter, and reformulate queries. It argues visualization is an effective tool for dealing with large numbers of documents returned in web searches.
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
This document discusses an enhanced web usage mining system using fuzzy clustering and collaborative filtering recommendation algorithms. It aims to address challenges with existing recommender systems like producing low quality recommendations for large datasets. The system architecture uses fuzzy clustering to predict future user access based on browsing behavior. Collaborative filtering is then used to produce expected results by combining fuzzy clustering outputs with a web database. This approach aims to provide users with more relevant recommendations in a shorter time compared to other systems.
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work's result with existing mechanism and found our result much better for number webpage
This document summarizes previous work on content extraction from web pages and proposes a new approach. It discusses existing methods that use techniques like entropy analysis, DOM trees, clustering, and ratios of text, links and tags. The proposed approach combines word to leaf ratio with text link ratio and link text ratio to identify informative nodes in the DOM tree. It calculates weights and relative positions of nodes to select the most informative content. The method will be tested on different website types and compared to existing approaches.
This document provides an introduction to web structure mining and discusses two popular methods: HITS and PageRank. It begins with an overview of web mining categories including web content mining, web structure mining, and web usage mining. Web structure mining focuses on the hyperlink structure of the web and analyzes link relationships between pages. HITS and PageRank are two algorithms that have been proposed to handle potential correlations between linked pages and improve predictive accuracy.
Integrating content search with structure analysis for hypermedia retrieval a...unyil96
This document summarizes research on integrating content search and structure analysis for hypermedia retrieval and management. It discusses how link analysis and topic distillation techniques can organize query results and identify authoritative pages. Database approaches aim to facilitate search, navigation and associating web pages through extended query languages and logical document representations. Overall the paper outlines the state-of-the-art in utilizing both content and link structure to improve hypermedia search and organization.
This document discusses interactive visualization techniques for information retrieval. It begins by stating that information retrieval systems often return many results, some more relevant than others. While search engines have grown, problems remain with low precision and recall. Visualization techniques can help users better understand retrieval results. The document then reviews several visualization methods like tree views, title views, and bubble views that can enhance web information retrieval systems by helping users browse, filter, and reformulate queries. It argues visualization is an effective tool for dealing with large numbers of documents returned in web searches.
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
This document discusses an enhanced web usage mining system using fuzzy clustering and collaborative filtering recommendation algorithms. It aims to address challenges with existing recommender systems like producing low quality recommendations for large datasets. The system architecture uses fuzzy clustering to predict future user access based on browsing behavior. Collaborative filtering is then used to produce expected results by combining fuzzy clustering outputs with a web database. This approach aims to provide users with more relevant recommendations in a shorter time compared to other systems.
Tutorial given at ICWE'13, Aalborg, Denmark on 08.07.2013
Abstract:
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.
To cite this tutorial:
Please refer to http://dx.doi.org/10.1007/978-3-642-39200-9_49
A language independent web data extraction using vision based page segmentati...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A machine learning approach to web page filtering using ...butest
This document describes a machine learning approach to web page filtering that combines content and structural analysis. The proposed approach represents web pages with features extracted from content, such as terms and phrases, and from links. These features are used as input for machine learning algorithms like neural networks and support vector machines to classify pages. An experiment compares this approach to keyword-based and lexicon-based filtering, finding the proposed approach generally performs better, especially with few training examples. The approach could benefit topic-specific search engines and other applications.
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
The document proposes an innovative vision-based page segmentation (IVBPS) algorithm to improve hidden web content extraction. It aims to overcome limitations of existing approaches that rely heavily on HTML structure. IVBPS extracts blocks from the visual representation of a page and clusters them to segment the page semantically. It uses layout features like position and appearance to locate data regions and extract records. The algorithm analyzes the entire page structure rather than local regions, allowing it to retain content DOM tree methods may discard. This is expected to significantly improve hidden web extraction performance.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
Full-text of my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" defended in ICT-Building, Turku, Finland on 12.06.2008
Thesis contributions:
* New methods for deep Web characterization
* Estimating the scale of a national segment of the Web
* Building a publicly available dataset describing >200 web databases on the Russian Web
* Designing and implementing the I-Crawler, a system for automatic finding and classifying search interfaces
* Technique for recognizing and analyzing JavaScript-rich and non-HTML searchable forms
* Introducing a data model for representing search interfaces and result pages
* New user-friendly and expressive form query language for querying search interfaces and extracting data from result pages
* Designing and implementing a prototype system for querying web databases
* Bibliography with over 110 references to publications in the area of deep Web
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
The document discusses integrating library resources into the Moodle e-learning environment. It describes installing Moodle, configuring courses, and creating library blocks within courses to provide links to resources like the library catalog, databases, guides, and more. HTML codes are used to embed these resources. Plugins like BigBlueButton are also discussed to enable video conferencing. Collaboration between librarians, IT, and course administrators is emphasized to maximize use of library resources through the LMS.
This document discusses fuzzy clustering techniques for web mining. It proposes a fuzzy hierarchical clustering method to create clusters of web documents using fuzzy equivalence relations. The method aims to improve information retrieval by grouping similar documents into clusters. It describes how fuzzy clustering is suitable for web mining given the fuzzy nature of the web. It also provides background on related topics like web mining taxonomy, document clustering algorithms, and challenges of information retrieval on the web.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
This document discusses web page classification using a rule-based system. It begins by introducing the problem of classifying the large amount of unstructured information on the web. It then discusses various approaches to web page classification, including text content-based categorization and link and content analysis. The document focuses on using a rule-based classifier to assign HTML documents to predefined categories by checking for occurrences of system rules in the HTML content. Finally, it discusses common machine learning algorithms that have been used for web page classification, such as association rule mining, naive Bayes, support vector machines, logistic regression, and decision trees. The goal of the proposed system is to enhance other web page classification systems by enabling online classification of web pages.
Comparable Analysis of Web Mining Categoriestheijes
Web Data Mining is the current field of analysis which is a combination of two research area known as Data Mining and World Wide Web. Web Data Mining research associates with various research diversities like Database, Artificial Intelligence and Information redeem. The mining techniques are categorized into various categories namely Web Content Mining, Web Structure Mining and Web Usage Mining. In this work, analysis of mining techniques are done. From the analysis it has been concluded that Web Content Mining has unstructured or semi- structure view of data whereas Web Structure Mining have linked structure and Web Usage Mining mainly includes interaction.
Web content mining mines data from web pages including text, images, audio, video, metadata and hyperlinks. It examines the content of web pages and search results to extract useful information. Web content mining helps understand customer behavior, evaluate website performance, and boost business through research. It can classify data into structured, unstructured, semi-structured and multimedia types and applies techniques such as information extraction, topic tracking, summarization, categorization and clustering to analyze the data.
Various FAIR criteria pertaining to machine interaction with scholarly artifacts can commonly be addressed by means of repository-wide affordances that are uniformly provided for all hosted artifacts rather than through artifact-specific interventions. If various repository platforms provide such affordances in an interoperable manner, devising tools - for both human and machine use - that leverage them becomes easier.
My involvement, over the years, in a range of interoperability efforts has brought the insight that two factors strongly influence adoption: addressing a burning issue and delivering a KISS solution to tackle it. Undoubtedly, FAIR and FAIR DOs are burning issues. FAIR Signposting <https://signposting.org/FAIR/> is an ad-hoc repository interoperability effort that squarely fits in this problem space and that purposely specifies a KISS solution, hoping to inspire wide adoption.
The document discusses web content mining. It covers topics such as web content data structure including unstructured, semi-structured, and structured data. It also discusses techniques used for web content mining such as classification, clustering, and association. Finally, it provides examples of applications such as structured data extraction, sentiment analysis of reviews, and targeted advertising.
Boilerplate Removal and Content Extraction from Dynamic Web PagesIJCSEA Journal
Web pages not only contain main content, but also other elements such as navigation panels,
advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate
removal algorithm is needed to extract only the relevant contents from web page. Main textual contents are
just included in HTML source code which makes up the files. The goal of content extraction or boilerplate
detection is to separate the main content from navigation chrome, advertising blocks, and copyright notices
in web pages. The system removes boilerplate and extracts main content. In this system, there are two
phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags, one line may not contain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or non- content.
Web pages not only contain main content, but also other elements such as navigation panels,
advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate
removal algorithm is needed to extract only the relevant contents from web page. Main textual contents are
just included in HTML source code which makes up the files. The goal of content extraction or boilerplate
detection is to separate the main content from navigation chrome, advertising blocks, and copyright notices
in web pages. The system removes boilerplate and extracts main content. In this system, there are two
phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from
HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM
trees. After observation the HTML tags, one line may not contain a piece of complete information and long
texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any
two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text
ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After
extracting the features, the system uses these features as parameters in threshold method to classify the
block are content or non- content.
No other medium has taken a more meaningful place in our life in such a short time than the world wide largest data network, the World Wide Web. However, when searching for information in the data network, the user is constantly exposed to an ever growing ood of information. This is both a blessing and a curse at the same time. The explosive growth and popularity of the world wide web has resulted in a huge number of information sources on the Internet. As web sites are getting more complicated, the construction of web information extraction systems becomes more difficult and time consuming. So the scalable automatic Web Information Extraction WIE is also becoming high demand. There are four levels of information extraction from the World Wide Web such as free text level, record level, page level and site level. In this paper, the target extraction task is record level extraction. Nwe Nwe Hlaing | Thi Thi Soe Nyunt | Myat Thet Nyo "The Data Records Extraction from Web Pages" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28010.pdfPaper URL: https://www.ijtsrd.com/computer-science/world-wide-web/28010/the-data-records-extraction-from-web-pages/nwe-nwe-hlaing
A language independent web data extraction using vision based page segmentati...eSAT Journals
Abstract Web usage mining is a process of extracting useful information from server logs i.e. user’s history. Web usage mining is a process of finding out what users are looking for on the internet. Some users might be looking at only textual data, where as some others might be interested in multimedia data. One would retrieve the data by copying it and pasting it to the relevant document. But this is tedious and time consuming as well as difficult when the data to be retrieved is plenty. Extracting structured data from a web page is challenging problem due to complicated structured pages. Earlier they were used web page programming language dependent; the main problem is to analyze the html source code. In earlier they were considered the scripts such as java scripts and cascade styles in the html files. When it makes different for existing solutions to infer the regularity of the structure of the WebPages only by analyzing the tag structures. To overcome this problem we are using a new algorithm called VIPS algorithm i.e. independent language. This approach primary utilizes the visual features on the webpage to implement web data extraction. Keywords: Index terms-Web mining, Web data extraction.
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Nowadays, the explosive growth of the World Wide Web generates tremendous amount of web data and consequently web data mining has become an important technique for discovering useful information and knowledge. Web mining is a vivid research area closely related to Information Extraction IE . Automatic content extraction from web pages is a challenging yet significant problem in the fields of information retrieval and data mining. Web Content mining refers to the discovery of useful information from web content such as text, images videos etc. Web content extraction is the process of organizing data instances into groups whose members are similar in some way. Content Extraction helps the user to easily select the topic of interest. Web Content Ming technology is useful in management information system. Web content mining extracts or mines useful information or knowledge from web page contents. This paper aims to study on web content extraction techniques. Aye Pwint Phyu | Khaing Khaing Wai "Study on Web Content Extraction Techniques" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd27931.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/27931/study-on-web-content-extraction-techniques/aye-pwint-phyu
A language independent web data extraction using vision based page segmentati...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
A machine learning approach to web page filtering using ...butest
This document describes a machine learning approach to web page filtering that combines content and structural analysis. The proposed approach represents web pages with features extracted from content, such as terms and phrases, and from links. These features are used as input for machine learning algorithms like neural networks and support vector machines to classify pages. An experiment compares this approach to keyword-based and lexicon-based filtering, finding the proposed approach generally performs better, especially with few training examples. The approach could benefit topic-specific search engines and other applications.
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
The document proposes an innovative vision-based page segmentation (IVBPS) algorithm to improve hidden web content extraction. It aims to overcome limitations of existing approaches that rely heavily on HTML structure. IVBPS extracts blocks from the visual representation of a page and clusters them to segment the page semantically. It uses layout features like position and appearance to locate data regions and extract records. The algorithm analyzes the entire page structure rather than local regions, allowing it to retain content DOM tree methods may discard. This is expected to significantly improve hidden web extraction performance.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
Full-text of my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" defended in ICT-Building, Turku, Finland on 12.06.2008
Thesis contributions:
* New methods for deep Web characterization
* Estimating the scale of a national segment of the Web
* Building a publicly available dataset describing >200 web databases on the Russian Web
* Designing and implementing the I-Crawler, a system for automatic finding and classifying search interfaces
* Technique for recognizing and analyzing JavaScript-rich and non-HTML searchable forms
* Introducing a data model for representing search interfaces and result pages
* New user-friendly and expressive form query language for querying search interfaces and extracting data from result pages
* Designing and implementing a prototype system for querying web databases
* Bibliography with over 110 references to publications in the area of deep Web
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
The document discusses integrating library resources into the Moodle e-learning environment. It describes installing Moodle, configuring courses, and creating library blocks within courses to provide links to resources like the library catalog, databases, guides, and more. HTML codes are used to embed these resources. Plugins like BigBlueButton are also discussed to enable video conferencing. Collaboration between librarians, IT, and course administrators is emphasized to maximize use of library resources through the LMS.
This document discusses fuzzy clustering techniques for web mining. It proposes a fuzzy hierarchical clustering method to create clusters of web documents using fuzzy equivalence relations. The method aims to improve information retrieval by grouping similar documents into clusters. It describes how fuzzy clustering is suitable for web mining given the fuzzy nature of the web. It also provides background on related topics like web mining taxonomy, document clustering algorithms, and challenges of information retrieval on the web.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Intelligent web crawling
Denis Shestakov, Aalto University
Slides for tutorial given at WI-IAT'13 in Atlanta, USA on November 20th, 2013
Outline:
- overview of web crawling;
- intelligent web crawling;
- open challenges
This document discusses web page classification using a rule-based system. It begins by introducing the problem of classifying the large amount of unstructured information on the web. It then discusses various approaches to web page classification, including text content-based categorization and link and content analysis. The document focuses on using a rule-based classifier to assign HTML documents to predefined categories by checking for occurrences of system rules in the HTML content. Finally, it discusses common machine learning algorithms that have been used for web page classification, such as association rule mining, naive Bayes, support vector machines, logistic regression, and decision trees. The goal of the proposed system is to enhance other web page classification systems by enabling online classification of web pages.
Comparable Analysis of Web Mining Categoriestheijes
Web Data Mining is the current field of analysis which is a combination of two research area known as Data Mining and World Wide Web. Web Data Mining research associates with various research diversities like Database, Artificial Intelligence and Information redeem. The mining techniques are categorized into various categories namely Web Content Mining, Web Structure Mining and Web Usage Mining. In this work, analysis of mining techniques are done. From the analysis it has been concluded that Web Content Mining has unstructured or semi- structure view of data whereas Web Structure Mining have linked structure and Web Usage Mining mainly includes interaction.
Web content mining mines data from web pages including text, images, audio, video, metadata and hyperlinks. It examines the content of web pages and search results to extract useful information. Web content mining helps understand customer behavior, evaluate website performance, and boost business through research. It can classify data into structured, unstructured, semi-structured and multimedia types and applies techniques such as information extraction, topic tracking, summarization, categorization and clustering to analyze the data.
Various FAIR criteria pertaining to machine interaction with scholarly artifacts can commonly be addressed by means of repository-wide affordances that are uniformly provided for all hosted artifacts rather than through artifact-specific interventions. If various repository platforms provide such affordances in an interoperable manner, devising tools - for both human and machine use - that leverage them becomes easier.
My involvement, over the years, in a range of interoperability efforts has brought the insight that two factors strongly influence adoption: addressing a burning issue and delivering a KISS solution to tackle it. Undoubtedly, FAIR and FAIR DOs are burning issues. FAIR Signposting <https://signposting.org/FAIR/> is an ad-hoc repository interoperability effort that squarely fits in this problem space and that purposely specifies a KISS solution, hoping to inspire wide adoption.
The document discusses web content mining. It covers topics such as web content data structure including unstructured, semi-structured, and structured data. It also discusses techniques used for web content mining such as classification, clustering, and association. Finally, it provides examples of applications such as structured data extraction, sentiment analysis of reviews, and targeted advertising.
Boilerplate Removal and Content Extraction from Dynamic Web PagesIJCSEA Journal
Web pages not only contain main content, but also other elements such as navigation panels,
advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate
removal algorithm is needed to extract only the relevant contents from web page. Main textual contents are
just included in HTML source code which makes up the files. The goal of content extraction or boilerplate
detection is to separate the main content from navigation chrome, advertising blocks, and copyright notices
in web pages. The system removes boilerplate and extracts main content. In this system, there are two
phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags, one line may not contain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or non- content.
Web pages not only contain main content, but also other elements such as navigation panels,
advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate
removal algorithm is needed to extract only the relevant contents from web page. Main textual contents are
just included in HTML source code which makes up the files. The goal of content extraction or boilerplate
detection is to separate the main content from navigation chrome, advertising blocks, and copyright notices
in web pages. The system removes boilerplate and extracts main content. In this system, there are two
phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from
HTML web page. Content Extraction algorithm describes to get high performance without parsing DOM
trees. After observation the HTML tags, one line may not contain a piece of complete information and long
texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any
two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text
ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After
extracting the features, the system uses these features as parameters in threshold method to classify the
block are content or non- content.
No other medium has taken a more meaningful place in our life in such a short time than the world wide largest data network, the World Wide Web. However, when searching for information in the data network, the user is constantly exposed to an ever growing ood of information. This is both a blessing and a curse at the same time. The explosive growth and popularity of the world wide web has resulted in a huge number of information sources on the Internet. As web sites are getting more complicated, the construction of web information extraction systems becomes more difficult and time consuming. So the scalable automatic Web Information Extraction WIE is also becoming high demand. There are four levels of information extraction from the World Wide Web such as free text level, record level, page level and site level. In this paper, the target extraction task is record level extraction. Nwe Nwe Hlaing | Thi Thi Soe Nyunt | Myat Thet Nyo "The Data Records Extraction from Web Pages" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28010.pdfPaper URL: https://www.ijtsrd.com/computer-science/world-wide-web/28010/the-data-records-extraction-from-web-pages/nwe-nwe-hlaing
A language independent web data extraction using vision based page segmentati...eSAT Journals
Abstract Web usage mining is a process of extracting useful information from server logs i.e. user’s history. Web usage mining is a process of finding out what users are looking for on the internet. Some users might be looking at only textual data, where as some others might be interested in multimedia data. One would retrieve the data by copying it and pasting it to the relevant document. But this is tedious and time consuming as well as difficult when the data to be retrieved is plenty. Extracting structured data from a web page is challenging problem due to complicated structured pages. Earlier they were used web page programming language dependent; the main problem is to analyze the html source code. In earlier they were considered the scripts such as java scripts and cascade styles in the html files. When it makes different for existing solutions to infer the regularity of the structure of the WebPages only by analyzing the tag structures. To overcome this problem we are using a new algorithm called VIPS algorithm i.e. independent language. This approach primary utilizes the visual features on the webpage to implement web data extraction. Keywords: Index terms-Web mining, Web data extraction.
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Nowadays, the explosive growth of the World Wide Web generates tremendous amount of web data and consequently web data mining has become an important technique for discovering useful information and knowledge. Web mining is a vivid research area closely related to Information Extraction IE . Automatic content extraction from web pages is a challenging yet significant problem in the fields of information retrieval and data mining. Web Content mining refers to the discovery of useful information from web content such as text, images videos etc. Web content extraction is the process of organizing data instances into groups whose members are similar in some way. Content Extraction helps the user to easily select the topic of interest. Web Content Ming technology is useful in management information system. Web content mining extracts or mines useful information or knowledge from web page contents. This paper aims to study on web content extraction techniques. Aye Pwint Phyu | Khaing Khaing Wai "Study on Web Content Extraction Techniques" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd27931.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/27931/study-on-web-content-extraction-techniques/aye-pwint-phyu
This document presents a heuristic approach for extracting web content using tag trees and heuristics. The approach first parses the web page into a tag tree using an HTML parser. Objects and separators are then extracted from the nested tag tree using heuristics like pattern repeating, standard deviation, and sibling tag heuristics. The main content is identified and implemented using the extracted separators. Experimental results showed the technique outperformed existing methods by extracting only the relevant content for the user's query.
Annotation for query result records based on domain specific ontologyijnlc
The World Wide Web is enriched with a large collection of data, scattered in deep web databases and web
pages in unstructured or semi structured formats. Recently evolving customer friendly web applications
need special data extraction mechanisms to draw out the required data from these deep web, according to
the end user query and populate to the output page dynamically at the fastest rate. In existing research
areas web data extraction methods are based on the supervised learning (wrapper induction) methods. In
the past few years researchers depicted on the automatic web data extraction methods based on similarity
measures. Among automatic data extraction methods our existing Combining Tag and Value similarity
method, lags to identify an attribute in the query result table. A novel approach for data extracting and
label assignment called Annotation for Query Result Records based on domain specific ontology. First, an
ontology domain is to be constructed using information from query interface and query result pages
obtained from the web. Next, using this domain ontology, a meaning label is assigned automatically to each
column of the extracted query result records.
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
This document summarizes a research paper on vision-based deep web data extraction from nested query result records. It proposes a technique to extract data from web pages using different font styles, sizes, and cascading style sheets. The extracted data is then aligned into a table using alignment algorithms, including pair-wise, holistic, and nested-structure alignment. The goal is to remove immaterial information from query result pages to facilitate analysis of the extracted data.
The document discusses using machine learning models for web content mining and news article classification. Specifically, it proposes using a support vector machine (SVM) model with features extracted from the document object model (DOM) tree to classify news articles into categories like title, date, body text, and noise. The SVM model is trained on a manually labeled dataset and can handle the nonlinear and complex patterns in the data better than rule-based models. The preprocessing step prunes noisy leaf nodes from the DOM tree before feature extraction and model training are performed to classify the remaining leaf nodes.
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-TreeIRJET Journal
This document discusses using machine learning models and DOM tree analysis to extract important content from news articles for the purpose of topic detection. Specifically, it proposes using a support vector machine (SVM) model with "leaf classification units" from the DOM tree to remove noise data like images, ads, and recommended articles. This approach is meant to generalize to different article structures compared to rule-based models. The document reviews related work using DOM trees and statistical data for web content extraction and visual wrappers. It also discusses using various kernel functions in SVMs for non-linearly separable data.
This document discusses extracting main content from deep web pages that contain multiple data regions. It proposes a hybrid approach with two steps: 1) Using visual features to identify the different data regions in the DOM tree. 2) Independently mining positive data records and data items from each data region using vision-based page segmentation. Related work on single-region deep web page extraction is also reviewed. The technique aims to automatically extract information from complex pages containing multiple, independent data listings.
IRJET- Noisy Content Detection on Web Data using Machine LearningIRJET Journal
The document presents a study on detecting noisy content on web data using machine learning techniques. It aims to identify and filter out advertisements, irrelevant data and other noisy content when users retrieve information from websites. The study uses algorithms like support vector machines, artificial neural networks, decision trees and k-nearest neighbors to classify web content as noisy or not. A proposed methodology involves training models using these machine learning algorithms and evaluating their performance for noisy content detection on web data.
This document proposes a rule-based system for classifying web pages. It discusses challenges with classifying unstructured web content and outlines various machine learning approaches used for web page classification, including association rule mining, naive Bayes, support vector machines, logistic regression, and decision trees. The proposed system uses a rule-based classifier to assign HTML documents to predefined categories based on checking for rule occurrences in the document content. The system is designed to improve web page classification by enabling online classification of pages.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
Business Intelligence Solution Using Search Engineankur881120
The document describes a business intelligence solution that uses a search engine to index and search web pages. It discusses using crawlers to index web pages and store them in a repository. An indexer then generates an inverted index from the repository to support keyword searches. The system architecture includes the repository, indexer, and search functionality. It also describes the database structure used to store crawled URLs, the index, and search results. The project aims to build a basic search engine to demonstrate the proposed business intelligence solution.
This document discusses web mining and divides it into three categories: web content mining, web structure mining, and web usage mining. Web content mining examines the actual content of web pages and can utilize techniques like keyword searching, classification, clustering, and natural language processing. Web structure mining analyzes the hyperlink structure between pages. Web usage mining examines log files that record how users interact with and move between websites. The document provides examples of how these different types of web mining can be applied, such as for targeted advertising.
1) The document provides an introduction to HTML, HTML5, Web 2.0, Web 3.0 and related technologies. It discusses the history and evolution of these technologies over time. 2) Key topics covered include the basic structure of an HTML document, common HTML tags like <head>, <body>, <header>, <footer>, and the features introduced in HTML5 like audio, video, and canvas. 3) The role of organizations like W3C and WHATWG in developing web standards is also summarized.
This document provides a survey of web clustering engines. It discusses how web clustering engines organize search results by topic to complement conventional search engines, which return a flat list of ranked results. The document outlines the key stages in developing a web clustering engine, including acquiring search results, preprocessing, clustering, and visualization. It also reviews several existing commercial and open source web clustering systems and discusses evaluating the retrieval performance of these systems.
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A ReviewIRJET Journal
This document summarizes research on multi-stage smart deep web crawling systems. It discusses challenges in efficiently locating deep web interfaces due to their large numbers and dynamic nature. It proposes a three-stage crawling framework to address these challenges. The first stage performs site-based searching to prioritize relevant sites. The second stage explores sites to efficiently search for forms. An adaptive learning algorithm selects features and constructs link rankers to prioritize relevant links for fast searching. Evaluation on real web data showed the framework achieves substantially higher harvest rates than existing approaches.
Similar to ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH (20)
NEW Current Issue - CALL FOR PAPERS - Electrical and Computer Engineering An ...ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed open access journal calling for papers on topics related to electrical and computer engineering, including communications, control systems, integrated circuits, power systems, and signal processing. Authors are invited to submit original papers by August 7, 2021 via the journal's online submission system, with notification of acceptance by September 7 and final manuscripts due by September 15 for publication dates determined by the Editor-in-Chief.
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed, open access journal that addresses the impacts and challenges of Electrical and Computer Engineering. The journal documents practical and theoretical results which make a fundamental contribution for the development Electrical and Computer Engineering.
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed,
open access journal that address the impacts and challenges of Electrical and Computer
Engineering. The journal documents practical and theoretical results which make a fundamental
contribution for the development Electrical and Computer Engineering.
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed,
open access journal that address the impacts and challenges of Electrical and Computer
Engineering. The journal documents practical and theoretical results which make a fundamental
contribution for the development Electrical and Computer Engineering.
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed,
open access journal that address the impacts and challenges of Electrical and Computer
Engineering. The journal documents practical and theoretical results which make a fundamental
contribution for the development Electrical and Computer Engineering
This work investigates and evaluates the electric energy interruptions to the residential sector resulting from severe power outages. The study results show that this sector will suffer tangible and intangible losses should these outages occur during specific times, seasons, and for prolonged durations. To reduce these power outages and hence mitigate their adverse consequences, the study proposes practical measures that
can be adopted without compromising the consumers’ needs, satisfaction, and convenience.
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...ecij
The standard grid codes suggested, that the wind generators should stay in connected and reliable active and reactive power should be provided during uncertainties. This paper presents an independent control of Grid Side Converter (GSC) for a doubly fed induction generator (DFIG). A novel GSC controller has
been designed by incorporating a new Enhanced hysteresis comparator (EHC) that utilizes the hysteresis band to produce the suitable switching signal to the GSC to get enhanced controllability during grid unbalance. The EHC produces higher duty-ratio linearity and larger fundamental GSC currents with
lesser harmonics. Thus achieve fast transient response for GSC. All these features are confirmed through
time domain simulation on a 15 KW DFIG Wind Energy Conversion System (WECS).
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed,
open access journal that address the impacts and challenges of Electrical and Computer
Engineering. The journal documents practical and theoretical results which make a fundamental
contribution for the development Electrical and Computer Engineering.
PREPARATION OF POROUS AND RECYCLABLE PVA-TIO2HYBRID HYDROGELecij
Nano TiO2, one of the most effective photocatalysts, has extensive usein fields such as air purification,
sweage treatment, water spitting, reduction of CO2, and solar cells. Nowadays, the most promising method to
recycle nano TiO2during the photocatalysis is to immobilize TiO2onto matrix, such as polyvinyl alcohol
(PVA). However, due to the slow water permeability of PVA after cross-linking, the pollutants could not
contact with nano TiO2photocatalyst in time. To overcome this problem, we dispersed calcium carbonate
particles into a PVA-TiO2 mixture and then filmed the glass. PVA-TiO2-CaCO3 films were obtained by
drying. Through thermal treatment, we obtained the cross-linked PVA-TiO2-CaCO3 films. Finally, the
calcium carbonate in the film was dissolved by hydrochloric acid, and the porous PVA-TiO2 composite
photocatalyst was obtained. The results show the addition of CaCO3 has no obvious effect on PVA
cross-linking and that a large number of cavities have been generated on the surface and inside of porous
PVA-TiO2 hybrid hydrogel film. The size of the holes is about 5-15μm, which is consistent with that of
CaCO3.The photocatalytic rate constant of porous PVA-TiO2 hybrid hydrogel film is 2.49 times higher than
that of nonporous PVA-TiO2 hybrid hydrogel film.
4th International Conference on Electrical Engineering (ELEC 2020)ecij
4th International Conference on Electrical Engineering (ELEC 2020)aims to bring together researchers and practitioners from academia and industry to focus on recent systems and techniques in the broad field of Electrical Engineering. Original research papers, state-of-the-art reviews are invited for publication in all areas of Electrical Engineering.
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed, open access journal that addresses practical and theoretical results in electrical and computer engineering. Topics of interest include communications, control systems, integrated circuits, power systems, and signal processing. Authors are invited to submit original papers by July 25, 2020 for peer-review and potential publication in September 2020.
4th International Conference on Bioscience & Engineering (BIEN 2020) ecij
The 4th International Conference on Bioscience & Engineering (BIEN 2020) will be held November 28-29, 2020 in Dubai, UAE. The conference will bring together researchers and practitioners from academia and industry to share knowledge on advances in bioscience and engineering. Authors are invited to submit original research papers and reviews by July 26, 2020 on topics including bioengineering, biochemistry, bioinformatics, biomedicine, and more. Selected papers will be published in conference proceedings and special issues of related journals.
Electrical & Computer Engineering: An International Journal (ECIJ)ecij
Scope & Topics
Electrical & Computer Engineering: An International Journal (ECIJ) is a peer-reviewed, open access journal that addresses the impacts and challenges of Electrical and Computer Engineering. The journal documents practical and theoretical results which make a fundamental contribution for the development Electrical and Computer Engineering.
Electrical & Computer Engineering: An International Journal (ECIJ)
ISSN: 2201-5957
https://wireilla.com/engg/ecij/index.html
Paper Submission
Authors are invited to submit papers for this journal through E-mail: ecijjournal@wireilla.com .
Important Dates
•Submission Deadline: March 28, 2020
Contact US
Here's where you can reach us: ecijjournal@wireilla.com
GRID SIDE CONVERTER CONTROL IN DFIG BASED WIND SYSTEM USING ENHANCED HYSTERES...ecij
The document presents a novel control strategy using an Enhanced Hysteresis Controller (EHC) for the Grid Side Converter (GSC) of a DFIG-based wind energy system. The EHC improves upon standard hysteresis control by incorporating the DC link voltage as an input to the integrator, allowing for higher duty ratio linearity, larger fundamental GSC currents with less harmonics. Simulation results on a 15kW DFIG system show the EHC provides fast transient response for the GSC and regulates the DC link voltage with smooth GSC currents and power during grid disturbances like voltage dips. Comparisons to a system without GSC control show significant reductions in oscillations through use of the proposed EHC strategy.
UNION OF GRAVITATIONAL AND ELECTROMAGNETIC FIELDS ON THE BASIS OF NONTRADITIO...ecij
The traditional principle of solving the problem of combining the gravitational and electromagnetic fields is associated with the movement of the transformation of parameters from the electromagnetic to the gravitational field on the basis of Maxwell and Lorentz equations. The proposed non-traditional principle
is associated with the movement of the transformation of parameters from the gravitational to the electromagnetic field, which simplifies the process. Nave principle solving this task by using special physical quantities found by M. Planck in 1900: - Planck’s length, time and mass), the uniqueness of which is that they are obtained on the basis of 3 fundamental physical constants: the velocity c of light in vacuum, the Planck’s constant h and the gravitational constant G, which reduces them to the fundamentals of the Universe. Strict physical regularities were obtained for the based on intercommunication of 3-th
fundamental physical constants c, h and G, that allow to single out wave characteristic νG from G which is identified with the frequency of gravitational field. On this base other wave and substance parameters were strictly defined and their numerical values obtained. It was proved that gravitational field with the given wave parameters can be unified only with electromagnetic field having the same wave parameters that’s why it is possible only on Plank’s level of world creation. The solution of given problems is substantiated by well-known physical laws and conformities and not contradiction to modern knowledge about of material world and the Universe on the whole. It is actual for development of physics and other branches of science and technique.
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
Nowadays, real-time systems and intelligent systems offer more and more control interface based on voice recognition or human language recognition. Robots and drones will soon be mainly controlled by voice. Other robots will integrate bots to interact with their users, this can be useful both in industry and entertainment. At first, researchers were digging on the side of "ontology reasoning". Given all the technical constraints brought by the treatment of ontologies, an interesting solution has emerged in last years: the construction of a model based on machine learning to connect a human language to a knowledge
base (based for example on RDF). We present in this paper our contribution to build a bot that could be used on real-time systems and drones/robots, using recent machine learning technologies.
MODELING AND SIMULATION OF SOLAR PHOTOVOLTAIC APPLICATION BASED MULTILEVEL IN...ecij
As the solar market is blooming and forecasted to continue this trend in the coming years. The efficiency and reliability of PV based system has always been a contention among researchers. Therefore, multilevel inverters are gaining more assiduity as it has multitude of benefits. It offers high power capability along with low output harmonics. The main disadvantage of MLI is its complexity and requirement of large
number of power devices and passive components. This paper presents a topology that achieves 37.5% reduction in number of passive components and power devices for five-level inverter. This topology is basically based on H-bridge with bi-directional auxiliary switch. This paper includes a stand-alone PV system in which designing and simulation of Boost converter connected with multilevel inverter for ac load is presented. Perturb and observe MPPT algorithm has been implemented to extract maximum power. The premier objective is to obtain Voltage with less harmonic distortion economically. Multicarrier Sinusoidal
PWM techniques have been implemented and analysed for modulation scheme. The Proposed system is simulated n MATLAB/Simulink platform.
Investigation of Interleaved Boost Converter with Voltage multiplier for PV w...ecij
This paper depicts the significance of Interleaved Boost Converter (IBC) with diode-capacitor multiplierwith PV as the input source. Maximum Power Point Tracking (MPPT) was used to obtain maximum power from the PV system. In this, interleaving topology is used to reduce the input current ripple, output voltage ripple, power loss and to suppress the ripple in battery current in the case of Plugin Hybrid Electric Vehicle (PHEV). Moreover, voltage multiplier cells are added in the IBC configuration to reduce the narrow turn-off periods. Two MPPT techniques are compared in this paper: i) Perturb and Observe (P&O) algorithm ii) Fuzzy Logic . The two algorithms are simulated using MATLAB and the comparison of performance parameters like the ripple is done and the results are verified.
A COMPARISON BETWEEN SWARM INTELLIGENCE ALGORITHMS FOR ROUTING PROBLEMSecij
Travelling salesman problem (TSP) is a most popular combinatorial routing problem, belongs to the class of NP-hard problems. Many approacheshave been proposed for TSP.Among them, swarm intelligence (SI) algorithms can effectively achieve optimal tours with the minimum lengths and attempt to avoid trapping in local minima points. The transcendence of each SI is depended on the nature of the problem. In our studies, there has been yet no any article, which had compared the performance of SI algorithms for TSP perfectly. In this paper,four common SI algorithms are used to solve TSP, in order to compare the performance of SI algorithms for the TSP problem. These algorithms include genetic algorithm, particle swarm optimization, ant colony optimization, and artificial bee colony. For each SI, the various parameters and operators were tested, and the best values were selected for it. Experiments oversome benchmarks fromTSPLIBshow that
artificial bee colony algorithm is the best one among the fourSI-basedmethods to solverouting problems like TSP.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH
1. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
DOI : 10.14810/ecij.2015.4305 61
ISOLATING INFORMATIVE BLOCKS FROM LARGE
WEB PAGES USING HTML TAG PRIORITY
ASSIGNMENT BASED APPROACH
Rasel Kabir, Shaily Kabir, and Shamiul Amin
Department of Computer Science & Engineering, University of Dhaka, Bangladesh
ABSTRACT
Searching useful information from the web, a popular activity, often involves huge irrelevant contents or
noises leading to difficulties in extracting useful information. Indeed, search engines, crawlers and
information agents may often fail to separate relevant information from noises indicating significance of
efficient search results. Earlier, some research works locate noisy data only at the edges of the web page;
while others prefer to consider the whole page for noisy data detection. In our paper, we propose a simple
priority-assignment based approach with a view to differentiating main contents of the page from the
noises. In our proposed technique, we first make partition of the whole page into a number of disjoint
blocks using HTML tag based technique. Next, we determine a priority level for each block based on
HTML tags priority while considering aggregate priority calculation. This assignment process gives a
priority value to each block which helps rank the overall search results in online searching. In our work,
the blocks with higher priority are termed as informative blocks and preserved in database for future use,
whereas lower priority blocks are considered as noisy blocks and are not used for further data searching
operation. Our experimental results show considerable improvement in noisy block elimination and in
online page ranking with limited searching time as compared to other known approaches. Moreover, the
obtained accuracy from our approach by applying the Naive Bayes text classification method is about 90
percent, quite high as compared to others.
KEYWORDS
Web Page Cleaning, Informative Block, Noise, HTML Tag Priority
1. INTRODUCTION
Data mining has attracted a great deal of attention in recent years, particularly in research,
industry, and media. It has gained importance as it required extracting useful information from a
wide availability of huge amount of data. In particular, discovery and analysis of useful
information or content from the World Wide Web becomes an everyday necessity as raw data is
constantly exploding. However, presence of various irrelevant information in the web pages such
as banner advertisements, navigation bars, copyright notices may hamper the performance of
search engine as well as web mining. Generally, the irrelevant contents in the pages are termed as
web noises and are classified as global noise and local noise. Though global noise includes mirror
sites, identical pages with different URL, local noise refers to intra-page redundancy (i.e., the
irrelevant contents within a page). Noise may mislead the search engine to index redundant
contents and retrieve irrelevant results. Moreover, this may hamper the performance of web
2. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
62
mining as it extracts patterns based on the whole page content instead of the informative content,
which in turn affects negatively in searching results and time. Therefore, removal of such noises
is critical for receiving more effective and accurate web mining results.
In this paper, we concentrate on the discovery and removal of local noises from the web pages.
We propose a simple algorithm based on HTML tag priority assignment for isolating the main
page contents from the noises. In our approach, we first make partition of the whole page into a
number of disjoint blocks by using the HTML tag. For each block, we determine a priority level
based on the HTML tags priority. Here, we apply the aggregate tag priority calculation. This
calculation assign a priority value to each block which helps rank the overall search results in
online searching. In our work, the blocks with higher priority are termed as informative blocks
and preserved in database for data searching operation, whereas lower priority blocks are
considered as noisy blocks and are not used further. Our experimental results show considerable
improvement in noisy block elimination and in online page ranking with limited searching time
as compared to other known approach Elimination of Noisy Information from Web Pages (ENIW)
introduced in [1].
The rest of this paper is organized as follows. Section 2 reviews previous works on the detection
and cleaning of noisy blocks from the web page. This section also pointed out weaknesses in
existing literature. In section 3, we introduce our proposed priority assignment based algorithm
for noisy block removal. Section 4 presents results from various experiments and section 8
concludes the paper together with future work.
2. LITERATURE REVIEWS
A significant number of researches have been conducted for isolating informative blocks from the
noises. Several papers preferred to employ Document Object Module (DOM) tree for block
identification. Among them, Cai et al. [3] proposed a vision based page segmentation (VIPS)
algorithm that fragments the web page into a number of blocks by using DOM tree with a
combination of human visual cues including tag cue, color cue, size cue, and others. Based on the
visual perception, it simulated how a user understands the web layout structure. Oza and Mishra
[1] proposed a method which extracts the blocks from the page by using Document Object
Module (DOM) tree. They considered all the tags of a page as a tree structure. They first removed
the tags with no text. Moreover, they discarded the tags unrelated to the page content such as
<script>, <style>, <form>, <marquee> and <meta>. Finally, they built the DOM tree structure
with the remaining tags. Finally, they identified the location of content from the tree and
extracted the content by the html parser. Nevertheless, it is important to mention that the tags
<marquee> and <meta> are not contained noisy information for all the time, even if there is no
priority value by which they can rank the online searching results. Moreover, they did not
consider the ratio comparison between inner and outer hyperlink to detect noisy data. Li and
Ezeife [2] proposed a system, WebPageCleaner, for eliminating the noise blocks from the web
pages. They applied the VIPS [3] algorithm for the block extraction. By analysing different block
features such as the block location, percentage of hyperlinks on the block, and level of similarity
of block contents to others, they identified the relevant page blocks. Important blocks were kept
to be used for the web content mining using Naive Bayes text classification [7]. However,
locating blocks based on the edge of the page is not correct always as at present many article-
based websites place the advertisement panel in different template location. Furthermore, Yi and
Liu [6] introduced a feature weighting technique to deal with the page noise. They first
3. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
63
constructed a compressed structure tree for getting the common structure and comparable blocks
in a set of pages. Next, importance of each node in the compressed tree is estimated by
employing an information based measure. Lastly, they assigned a weight to each word feature in
the content block based on the tree and its node importance values. The resulting weights are
used for noise detection.
Besides, many researchers used the HTML tag information for noise removal from the page. Lin
and Ho [4] proposed a method based on the HTML tag <TABLE> to partition a page into a
number of content blocks. They calculated the entropy value of each content block by measuring
the entropy value of features (terms) present in the block. They dynamically selected the entropy-
threshold for partitioning the blocks into informative or redundant. Instead, Bar-Yossef Z. et al.
[5] proposed a template detection algorithm based on the number of links in an HTML element.
This algorithm partitioned each page into several blocks, called as page-lets, where each page-let
is a unit with well-defined topic or functionality. Templates are then detected by identifying
duplicate page-lets. However, duplicity of page-lets may mislead the overall work.
3. PROPOSED WORK
Our proposed approach for isolating informative blocks from a large web page is mainly based on
the html tag tree. The tag tree is applied to segment the whole page into a number of non-
overlapping blocks. Here, a priority value is assigned to each block using different html tags at
the block identification time. This priority is later useful for selecting the informative blocks and
rejecting the noisy blocks.
3.1. Priority Assignment Based Approached (PAB)
Our proposed priority assignment based (PAB) approach takes a number of web pages as input
and returns a set of informative blocks for each page. The proposed approach consists of three
major steps - Block Extraction, Assignment of Block Priority, and Isolation of Informative
Blocks.
3.1.1 Block Extraction
In this step, a web page is divided into multiple blocks by using the html tag tree [8]. To identify
blocks, some important html tags are applied. The important tags are - Title (<title>), Head(<H1,
H2, H3, H4, H5, H6>), Paragraph (<p>), Bold (<b>), Strong (<strong>), Hyperlink (<a>), Italic
(<i>), and etc. It is noted that these tags are referred by the Search Engine Optimizers (SEO) [9].
Besides them, some medium level html tags are also utilized for detecting blocks. In our work,
Table (<table>) and List (<li>) are considered as the medium level tags. Moreover, some html
tags such as <script>, <style>, <iframe>, <object>, and <form> are considered as irrelevant tags.
All contents between these irrelevant tags are removed. This operation eliminates a large amount
of unnecessary data, which we term as noisy data. Figure 3.1 presents the segmentation of the
page into a number of blocks using the html tags in our approach. In block identification, our
approach first selects the block containing only the Title (<title>) tag for the page. Subsequently,
a sequential searching is performed from the Body (<body>) tag to choose the next available tag
<T>. Now, consider all the contents as a single block before the closing of <T>, i.e. </T>. All
other tags encountered before </T> are considered as inner tags. By following this technique all
the blocks of the page are identified. It is important to mention that if no tag is found between
<body> and </body>, then one single block is considered for the whole page.
4. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
64
3.1.2 Assignment of Block Priority
The extracted blocks are given a priority value in this step. For priority calculation, our approach
generates a html tags priority table (PT) for the blocks which is based on SEO [10]. In PT, a
category label and a priority value are heuristically assigned to the important tags. The Table 3.1
is useful for measuring the priority values for the blocks. Generally, title matching with the user
search query implies that the page contents meet the user requirement. Therefore, our approach
automatically gives a priority of 1.0 for the title block. In addition, for other blocks a priority
value is assigned to each block by accumulating the priorities of all inner and outer tags. Eq.(1) is
used for the block priority aggregation. For any block bi = {t1, t2, ..., tn}.
( ) ∑ ( ) Eq.(1)
Where tj represent the tag contained in bi. Moreover, the proposed method gives a priority of 0.0
to all blocks having more than 50% of outer hyperlinks.
3.1.3 Isolation of Informative Blocks
Our PAB approach removes all the blocks having the priority value 0.0 from the block list of each
page. These blocks are termed as the noisy blocks. The remaining blocks are kept as database
records for using in online data searching and data mining operations. These blocks are termed as
the informative blocks.
Figure 3.1 depicts a number of page blocks marked by using the html tag tree. The priority of
each block can be measured by applying the tag priority mentioned in Table 3.1. For Block-1
containing Header (<h1>) tag, a priority value of 1.0 is assigned. For Block-2, we can see that
this block contains 10 hyperlinks. Though this block starts with the Bold (<b>) tag, but all
hyperlinks are belong to other websites. In other words, if the total number of hyperlinks is 10
and the total outer links is 10, then the block contains 100% outer links. Therefore, Block-2 is
directly set to a priority of 0.0. Again, Block-3 contains inner tags. This Block is starting with
<p>, followed by two inner tags <b>. Thus, the priority value of Block-3 is 0.1 + 0.4 + 0.4 = 0.9,
where 0.1 for <p> and 0.4 is for <b>. In a similar manner, the priority value of Block-4 is set to
0.7. Table 3.2 presents the priority of all the blocks identified in Figure 3.1.
Table 3.1. Html Tags Priority Table (PT)
SL HTML Tags Category level Priority value
1 <TITLE>, <H1> Cat-1.0 1.0
2 <H2> Cat-0.9 0.9
3 <H3> Cat-0.8 0.8
4 <H4> Cat-0.7 0.7
5 <H5> Cat-0.6 0.6
6 <H6> Cat-0.5 0.5
7 Bold (<b>), Strong (<strong>) Cat-0.4 0.4
8 Image with alternative text <img> Cat-0.3 0.3
9 Hyperlink (<a>), italic Cat-0.2 0.2
10 Paragraph (<p>) Cat-0.1 0.1
11 Preserve for Noisy Block Cat-0.0 0.0
5. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
65
Figure 3.1. A Sample Large Web Page with Multiple Blocks and Advertisement
Table 3.2. Priority Tag (PT) Table
SL Block number Tags between block Priority values
1 Block-1 <H1> 1.0
2 Block-2 <b>,<a> but 100% outer links 0.0
3 Block-3 <p>, <b>, <b> 0.1+0.4+0.4 = 0.9
4 Block-4 <p>, <i>, <b> 0.1+0.2+0.4 = 0.7
5 Block-5 <p>, <b> 0.1+0.4 = 0.5
6 Block-6 <p> 0.1 = 0.1
7 Block-7 <p>, <b>,<a>, 0% Outer Links 0.1+0.2+ 0.2*(4) = 1.1
3.2. Proposed PAB Algorithm
Our proposed PAB algorithm has two phases. The first phase SetPriorityLevel( ) is taking a set of
web pages as input. By using the html tag tree, it divides the page into a number of blocks. It also
performs the first phase noisy data elimination by removing all unnecessary contents between the
irrelevant tags. The second phase DetectSecondNoisyBlock( ) identifies the second phase noisy
blocks based on the priority value. For priority measure, it aggregates all inner and outer html tag
priority. Note that the complexity of PAB algorithm is O(p*b*t), where p is the number of pages,
b is the number of total blocks per page, and t is the total html tags per block.
6. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
66
Algorithm 3.2.1: SetPriorityLevel( )
Input: A set of web pages
Output: An informative block database for the web page
Begin
1. Apply HTML tag tree to the pages and extract block features.
2. For each web page, do steps a, b and c
a. Remove all contents between the tags <script>, <style>, <iframe>, <object> and <form>.
b. Call DetectSecondNoisyBlock( ) to detect and remove the second phase noisy blocks.
c. Store all informative blocks with higher priority in a block database.
3. Return an informative block database for the pages.
End
Algorithm 3.2.2: DetectSecondNoisyBlock( )
Input: A set of blocks for a page and a priority table (PT).
Output: A set of informative blocks
Begin
1. Check the title block and assign a priority value of 1.0 by using PT.
2. Find all inner and outer tags of each block and calculate the priority value for the block by
aggregating tags' priority.
3. For all hyperlinks in a block, calculate the ratio of inner and outer links (in %). If the outer
links is more than 50% of the total links, then the block is consider as a noisy block and its
priority value is set to 0.0.
4. All blocks with priority value greater than 0.0 are considered as informative blocks.
5. Merge all duplicate informative blocks.
End
4. EXPERIMENTAL RESULTS
All experiment were performed on an AMD FX(tm)- 4100 quad-core processor each capable of
processing up to 3.60 GHz. The system also had a 6 GB RAM with Windows 7 64-bit operating
system. NetBeans IDE 7.3.1 is used as a tool for implementing the proposed PAB and the existing
ENIW algorithm. The two algorithms are implemented in PHP language.
For the experiments, we used 500 distinct web pages downloaded from the commercial article
web portal http://www.ezinearticles.com. The collected pages were categorized into 3 major
groups. They were Arts and Entertainment, Computers and Technology, and Food and Drink.
Details of distribution of the pages are shown in Table 4.1.
7. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
67
Table 4.1. Collection of web pages used in the experiment
Website Arts and
Entertainment
Computers
and
Technology
Food and
Drink
Total Web Pages
(Experimented)
www.ezinearticles.com 200 200 100 500
We evaluated the performance of PAB with respect to the following parameters: (1) No. of
extracted informative blocks, (2) Total size of informative block-database (3) Accuracy measure
based on Naive Bayes Text Classification method, (4) Standard Error (SE), (5) Online search
result based on a user search query and search time in second, (6) Average searching time in
second.
4.1. Number of Extracted Informative Blocks
After applying our PAB approach for informative block extraction, we got a total of 120,237
blocks of which 111,918 blocks are informative and the remaining 8,319 blocks are noisy. In case
of ENIW approach, we got 113,624 informative and 6,613 noisy. The comparison between PAB
and ENIW based on the total number of blocks, extracted informative blocks, noisy blocks and
their total time (sec) for informative block extraction is given in Table 4.1.1.
Table 4.1.1. Number of blocks comparison between PAB and ENIW
Total
Page
Total
Blocks
Approach Total Extracted
Informative
Blocks
Total Eliminated
Noisy Blocks
Total Time for
block extraction
(in sec)
500 120,237 PAB 111,918 8,319 89.56
ENIW 113,624 6,613 95.58
4.2. Total Size of Informative Block Database
Database size is one of the most important factors for the online search engine. A large database
not only involves a huge storage space, but also needs a longer searching time. As the user wants
to see the searching results as quickly as possible based on their search query, large sized
database directly affects the performance of the search engine in a negative way. For
experimenting purpose, we have used 500 articles from http://www.ezinearticles.com. After
applying our PAB approach, we got 111,918 informative blocks which contains 19.4 MB of the
database size, whereas for the ENIW we got 113,624 informative blocks involving 22.6 MB of
database size. Figure 4.2.1 shows the comparison of database size between PAB and ENIW.
Figure 4.2.1. Database size comparison between PAB and ENIW
8. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
68
4.3. Accuracy Measure based on Naive Bayes Text Classification Method
To compare the accuracy of PAB and ENIW, we applied the Naive Bayes Text Classification
method [7]. This text classification, a probabilistic learning method is able to detect the category
of hidden test set depending on different training set in various category. Here, the accuracy is
calculated by using the following equation:
Accuracy = Eq.(2)
From a total of 500 pages, we have randomly generated 7 different combinations of the training
and the test datasets in order to ensure that the accuracy of category identification is not sensitive
to a particular partition of datasets. The random combinations are given in Table 4.3.1. Besides,
the percentage of accuracy using Naive Bayes text classification method for PAB and ENIW
approaches is shown in Table 4.3.2.
Table 4.3.1. Random combinations of the training and test datasets
Case Training pages Total pages
1 490 10
2 480 20
3 470 30
4 460 40
5 450 50
6 440 60
7 430 70
Table 4.3.2. Percentage of accuracy based on Naive Bayes text classification method
Case Test pages ENIW PAB
1 10 80% 80%
2 20 85% 80%
3 30 83% 83%
4 40 85% 86%
5 50 86% 88%
6 60 83.3% 88.3%
7 70 80% 90%
From Table 4.3.2, we can observe that for Case-1, Naive Bayes text classification method detects
8 corrected web pages from randomly selected 10 hidden test pages for both ENIW and PAB. For
Case-2, it correctly categorises 15 web pages for ENIW and 16 web pages for PAB in case of 20
randomly selected hidden test pages. For Case-3, it correctly categorises 25 web pages for both
ENIW and PAB in case of 30 randomly selected hidden test pages. For Case-4, it correctly
classifies 34 web pages for ENIW and 35 web pages for PAB in case of 40 randomly selected
hidden test pages. For Case-5, it correctly catalogues 43 web pages for ENIW and 44 web pages
for PAB in case of 50 randomly selected hidden test pages. For Case-6, it correctly categorises 50
pages for ENIW and 53 pages for PAB in case of 60 randomly selected hidden test pages. For
Case-7, it correctly categorises 56 web pages for ENIW and 63 web pages for PAB in case of 70
randomly selected hidden test pages. Figure 4.3.1 presents the percentage of accuracy of PAB and
ENIW based on the seven distinct test sets.
9. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
69
Figure 4.3.1. Average accuracy based on Naive Bayes Text Classification method
4.4. Standard Error
Low standard error (SE) presents high accuracy. If the variance represented by sigma square (σ2
)
then the SE is represent:
√
Eq.(3)
By using Eq.(3), the comparison of SE between PAB and ENIW is shown in Table 4.4.1.
Table 4.4.1. Comparison of standard error between PAB and ENIW
Approach Standard Error
PAB 0.1496
ENIW 0.20396
4.5. Online Search Result based on User Search Query
We have compared PAB with ENIW and also with real web page data which we called raw data
with respect to the online search results. In PAB, we discarded more noisy data and as a result, we
received a lower number of informative blocks as compared to ENIW. However, the raw data is a
mixture of the noisy and informative blocks. For comparing the three approaches based on the
user search query, we applied a search keyword Computer.
Table 4.5.1. Comparison of Online Search Results between PAB, ENIW, and Raw data for the search
query Compute.
Keyword Approach Searching time
(in second)
Total Page Link Found in
Search Result
Computer
Raw data 0.278 140
ENIW 0.235 137
PAB 0.012 136
10. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
70
Table 4.5.1 presents the searching time and search result for Raw data, ENIW, and PAB. Here,
the keyword density is used to sort the searching results in case of Raw data and ENIW.
However, in case of PAB we applied the block priority as well as the density of the keyword to
calculate sort the searching. Eq.(4) shows the blocks priority calculation of PAB.
Eq.(4)
where, Bp is the block priority, Pb is the total tag priority value of the block and Kd is the keyword
density of that block. Though from Table 4.5.1, we can observe the similar number of page link
in the search results, but their searching time are varied. Moreover, the searching time of our PAB
is very small as compared to both the existing ENIW technique and Raw data. So for online user,
it is very efficient to get accurate result within very short time in a large database using our
proposed PAB algorithm.
4.6. Average Searching Time in Second
We performed searching based on the 50 randomly selected keywords available in our searching
database. We found that the average searching time of PAB is significantly small as compared to
both the ENIW and the Raw data. Figure 4.6.1 shows the average searching time comparison
among PAB, ENIW and Raw data.
Figure 4.6.1 Average searching time (in second) comparison among PAB, ENIW and Raw data
11. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
71
5. CONCLUSION AND FUTURE WORKS
Extraction of useful information and removal of noisy data is well-known in Web Content
mining. Despite many researches, weaknesses remain in existing algorithms to get rid of noisy
data. This paper intended to provide an improved solution to this end. The proposed PAB method
is developed to remove noisy blocks with a high accuracy when searching and sort them
according to their URL ranking. Our PAB algorithm scans the whole page through two passes. In
first pass, it partitions full page into multiple blocks. Later, it removes some blocks that are under
the pre-defined html tags as noisy block. Experimental results show considerable improvement in
noisy block elimination and in online page ranking with limited searching time from our PAB
algorithm as compared to other known approaches. In addition, the accuracy obtained from our
PAB by applying the Naive Bayes text classification method is about 90 percent, quite high as
compared to others. Our future research work aims to go further. We intend to group similar
informative blocks in a large web page by using clustering method which is likely to improve
experimental results in online searching.
REFERENCES
[1] Shailendra Mishra Alpa K. Oza. Elimination of noisy information from web pages. In International
Journal of Recent Technology and Engineering, volume 2, pages 574{581, Piplani-BHEL, Bhopal
India, 2013. IJRTE.
[2] Jing Li and C. I. Ezeife. Cleaning web pages for effective web content mining. In Proceedings of the
17th International Conference on Database and Expert Systems Applications, DEXA'06, pages
560{571, Berlin, Heidelberg, 2006. Springer-Verlag.
[3] Deng Cai, Shipeng Yu, Ji rong Wen, Wei ying Ma, Deng Cai, Shipeng Yu, Ji rong Wen, and Wei
ying Ma. 1 vips: a vision-based page segmentation algorithm, 2003.
[4] Shian-Hua Lin and Jan-Ming Ho. Discovering informative content blocks from web documents. In
Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD '02, pages 588-593, New York, NY, USA, 2002. ACM.
[5] Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In
Proceedings of the 11th International Conference on World Wide Web, WWW '02, pages 580{591,
New York, NY, USA, 2002. ACM.
[6] Lan Yi and Bing Liu. Web page cleaning for web mining through feature weighting. In Proceedings
of the 18th International Joint Conference on Artificial Intelligence, IJCAI'03, pages 43{48, San
Francisco, CA, USA, 2003. Morgan Kaufmann Publishers Inc.
[7] Ashraf M. Kibriya, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. Multinomial naive bayes
for text categorization revisited. In Proceedings of the 17th Australian Joint Conference on Advances
in Artificial Intelligence, AI'04, pages 488{499, Berlin, Heidelberg, 2004. Springer-Verlag.
[8] Html tag tree. http://www.openbookproject.net/tutorials/getdown/css/lesson4.html.
[9] List of main html tags. Online; accessed 25-march-2014. http://www.webseomasters.com/
forum/index.php?showtopic=81.
[10] Html tag priority table. Online; accessed 25-march-2014. http://wiki.showitfast.com/SEO.
AUTHORS
Rasel Kabir was born in Dhaka, Bangladesh in 1989. He received the B.Sc (2010) &
M.Sc (2012) degree in Computer Science & Engineering from University of Dhaka,
Dhaka. His research interests include data mining and search engine optimization.
Currently, he is working at software company as a software engineer.
12. Electrical & Computer Engineering: An International Journal (ECIJ) Volume 4, Number 3, September 2015
72
Shaily Kabir received B.Sc (Hons.) and M.S in Computer Science and Engineering
from University of Dhaka, Bangladesh and also M.Comp.Sc degree in Computer
Science from Concordia University, Canada. Currently she is working as an Associate
Professor in the Department of Computer Science and Engineering, University of
Dhaka, Bangladesh. Her research interests include computer networks and network
security, data and web mining, database management systems.
Shamiul Amin was born in Dhaka, Bangladesh in 1989. He received the B.Sc (2010)
& M.Sc (2012) degree in Computer Science & Engineering from University of
Dhaka, Dhaka. His research interests include data mining and search engine ranking
technique. Currently, he is Asst. Professor, Dept. Of CSE, Dhaka International
University.