This document proposes a new system for extracting data from JavaScript web applications. It aims to address the challenge of extracting dynamic content that is generated through asynchronous JavaScript calls. The key aspects of the proposed system include using a headless browser to fetch pages and deal with dynamically generated DOM, extracting JSON data cached in pages rather than parsing DOM trees, and allowing users to define extraction rules and schedules through a configuration system. The preliminary results showed the system could successfully extract data from modern JavaScript web applications.
In this world of information technology, everyone has the tendency to do business electronically. Today
lot of businesses are happening on World Wide Web (WWW), it is very important for the website owner to
provide a better platform to attract more customers for their site. Providing information in a better way is
the solution to bring more customers or users. Customer is the end-user, who accessing the information
in a way it yields some credit to the web site owners. In this paper we define web mining and present a
method to utilize web mining in a better way to know the users and website behaviour which in turn
enhance the web site information to attract more users. This paper also presents an overview of the
various researches done on pattern extraction, web content mining and how it can be taken as a catalyst
for E-business.
The document discusses several topics related to SharePoint 2010 enterprise content management including recommended approaches for data access, document sets, routing, managed metadata, taxonomy, performance and scalability, and digital asset management. It provides details on the REST interface, client side object model, and ASP.NET web services for data access. Document sets, routing, and managed metadata/taxonomy are described for organizing content. Considerations for performance, scalability, and digital asset management are also summarized.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document provides an introduction to web structure mining and discusses two popular methods: HITS and PageRank. It begins with an overview of web mining categories including web content mining, web structure mining, and web usage mining. Web structure mining focuses on the hyperlink structure of the web and analyzes link relationships between pages. HITS and PageRank are two algorithms that have been proposed to handle potential correlations between linked pages and improve predictive accuracy.
This document provides an overview of relevant approaches for accessing open data programmatically and data-as-a-service (DaaS) solutions. It discusses common data access methods like web APIs, OData, and SPARQL and describes several DaaS platforms that simplify publishing and consuming open data. It also outlines requirements for a proposed open DaaS platform called DaPaaS that aims to address challenges in open data management and application development.
This document provides background information on web analytics and summarizes the materials used in a study comparing hosted and server-side web analytics solutions. The study examines Google Analytics, a hosted solution, and AWStats Analytics, a server-side solution, to analyze metrics for the metkaweb.fi website over three months. Key metrics like unique visits, page views, and time on site are compared between the two tools to evaluate their results and insights. The case study aims to identify best practices for selecting analytics tools based on target metrics and capabilities.
In this world of information technology, everyone has the tendency to do business electronically. Today
lot of businesses are happening on World Wide Web (WWW), it is very important for the website owner to
provide a better platform to attract more customers for their site. Providing information in a better way is
the solution to bring more customers or users. Customer is the end-user, who accessing the information
in a way it yields some credit to the web site owners. In this paper we define web mining and present a
method to utilize web mining in a better way to know the users and website behaviour which in turn
enhance the web site information to attract more users. This paper also presents an overview of the
various researches done on pattern extraction, web content mining and how it can be taken as a catalyst
for E-business.
The document discusses several topics related to SharePoint 2010 enterprise content management including recommended approaches for data access, document sets, routing, managed metadata, taxonomy, performance and scalability, and digital asset management. It provides details on the REST interface, client side object model, and ASP.NET web services for data access. Document sets, routing, and managed metadata/taxonomy are described for organizing content. Considerations for performance, scalability, and digital asset management are also summarized.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document provides an introduction to web structure mining and discusses two popular methods: HITS and PageRank. It begins with an overview of web mining categories including web content mining, web structure mining, and web usage mining. Web structure mining focuses on the hyperlink structure of the web and analyzes link relationships between pages. HITS and PageRank are two algorithms that have been proposed to handle potential correlations between linked pages and improve predictive accuracy.
This document provides an overview of relevant approaches for accessing open data programmatically and data-as-a-service (DaaS) solutions. It discusses common data access methods like web APIs, OData, and SPARQL and describes several DaaS platforms that simplify publishing and consuming open data. It also outlines requirements for a proposed open DaaS platform called DaPaaS that aims to address challenges in open data management and application development.
This document provides background information on web analytics and summarizes the materials used in a study comparing hosted and server-side web analytics solutions. The study examines Google Analytics, a hosted solution, and AWStats Analytics, a server-side solution, to analyze metrics for the metkaweb.fi website over three months. Key metrics like unique visits, page views, and time on site are compared between the two tools to evaluate their results and insights. The case study aims to identify best practices for selecting analytics tools based on target metrics and capabilities.
Web Mining Research Issues and Future Directions – A SurveyIOSR Journals
This document summarizes research on web mining techniques. It begins with an abstract describing how web mining aims to extract useful information from vast amounts of unstructured web data. It then reviews various web mining techniques including web content mining, web structure mining, and web usage mining. The document surveys literature on pattern extraction techniques such as association rule mining, clustering, classification, and sequential pattern mining. It also discusses challenges in pre-processing web data and issues related to scaling up data mining algorithms for large web datasets. In closing, the document outlines future research directions in web mining including dealing with unstructured data and multimedia content.
The document describes LinkedIn's Segmentation & Targeting Platform, a big data application built on Hadoop. It allows users to define segments of users based on attributes and target them for marketing campaigns. Attributes can be computed from multiple data sources and consolidated. Segments are defined through a self-service portal using SQL-like queries. The platform processes complex queries fast and moves at business speed while handling LinkedIn's massive data volumes.
Data mining in web search engine optimizationBookStoreLib
This document presents a proposed approach for optimizing web search by incorporating user feedback to improve result rankings. The approach uses keyword analysis on the user query to initially retrieve and rank relevant web pages. It then analyzes user responses like likes/dislikes and visit counts to update the page rankings. Experimental results on sample education queries show how page rankings change as user responses increase likes for certain pages. The approach aims to provide more useful search results by better reflecting individual user preferences.
This document discusses using data mining and k-means cluster analysis to classify search engine optimization (SEO) techniques. It begins with an introduction to SEO and data mining. The paper aims to analyze various SEO techniques used by webmasters and classify them using a data mining approach. Specifically, it uses k-means cluster analysis on SEO techniques to group similar techniques together and identify those with the biggest impact on webpage ranking. The literature review covers past work analyzing SEO techniques and using data mining methods like clustering for search engine optimization.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A Generic Model for Student Data Analytic Web Service (SDAWS)Editor IJCATR
Any university management system accumulates a cartload of data and analytics can be applied on it to gather useful
information to aid the academic decision making process. This paper is a novel attempt to demonstrate the significance of a data
analytic web service in the education domain. This can be integrated with the University Management System or any other application
of the university easily. Analytics as a web service offers much benefits over the traditional analysis methods. The web service can be
hosted on a web server and accessed over the internet or on to the private cloud of the campus. The data from various courses from
different departments can be uploaded and analyzed easily. In this paper we design a web service framework to be used in educational
data mining that provide analysis as a service.
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
This document discusses two models for performing multi-stream joins in Apache Samza: All-In-One (AIO) joining and Step-By-Step (SBS) joining. It analyzes the performance tradeoffs of each model in terms of memory usage, latency, and other metrics. The authors implemented both models in Samza and evaluated their performance to determine the best approach for different scenarios.
This document introduces databases and database management systems (DBMS). It describes types of databases and applications, basic definitions, typical DBMS functionality including querying and updating data, an example of a university database, characteristics of the database approach, database users, advantages of databases, and situations when databases may not be needed.
This document provides an overview of databases and database management systems (DBMS). It discusses types of databases and applications, basic definitions, typical DBMS functionality including querying and updating data, and provides an example of a university database. It describes the main characteristics of the database approach, different types of database users, and advantages and limitations of using a DBMS.
The document summarizes updates from the University of Oregon's Information Services department, including:
- Active Directory was updated to the 2008 R2 functional level, the root domain was upgraded, and some child domains were removed.
- New integrated services like Jabber IM and Likewise (Unix/Mac integration with Active Directory) were introduced.
- Information Services will create a centralized Exchange 2010 email and calendaring solution for the university to replace existing legacy systems.
- Questions from attendees about account management, recommended client software, licensing, and migration process were answered.
A detail survey of page re ranking various web features and techniquesijctet
This document discusses techniques for page re-ranking on websites based on user behavior analysis. It describes how web usage mining involves analyzing web server logs to extract patterns in user behavior. Common techniques discussed for page re-ranking include Markov models, data mining approaches like clustering and association rule mining, and analyzing linked web page structures. The goal is to better understand user interests and predict future page access to improve information retrieval and optimize website design.
Microsoft Access is a database management system from Microsoft that combines the relational Microsoft Jet Database Engine with a graphical user interface and software-development tools. It allows users to create databases to store and organize data, design forms and reports, and build applications without advanced programming skills. Access stores all database objects like tables, queries, forms and reports in a single file with the .accdb or .mdb extension. It supports the use of Visual Basic for Applications for more advanced programming and automation.
Data preparation for mining world wide web browsing patterns (1999)OUM SAOKOSAL
The document discusses preparing web server log data for mining browsing patterns. It presents techniques for identifying unique users and user sessions from server logs. It also defines several methods for dividing user sessions into meaningful transactions for discovering association rules. The proposed transaction identification methods are evaluated on real world data using the WEBMINER data mining system.
This chapter discusses database management and discusses key topics such as:
1. The functions of database management systems and why data is important for organizations.
2. The differences between file processing systems and database approaches, and benefits of databases like reduced redundancy and improved access.
3. Popular database models like relational, object-oriented, and multidimensional databases.
4. Database administration tasks like design, security, backup and recovery procedures.
5. The roles of database analysts and administrators in managing, maintaining and protecting organizational data.
2013 MN IT Govt Symposium - Implement No Code Solutions with SharePoint and I...Don Donais
This document summarizes a presentation given by Donald Donais on implementing no-code solutions using SharePoint and InfoPath. The presentation covered assessing needs before building a solution, the tools used including SharePoint and InfoPath, basics of how InfoPath and SharePoint work together, features of InfoPath like validation and rules, tips for automation, considerations for deployment and customization. The goal was to demonstrate how InfoPath forms can be used to create automated workflows and interfaces without custom coding.
This document discusses web usage mining and clickstream data analysis. It provides an overview of web usage mining goals such as understanding user behavior and customizing websites. It also describes different types of clickstream data sources like web server log files, page tags, and cookies. Additionally, it covers various aspects of processing clickstream data like sessionization, path completion, and data integration to model usage patterns and identify frequent behaviors. The overall aim is to apply these analytical techniques to clickstream data from a website to help maximize revenue.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...Jonathan Ralton
This document summarizes a presentation on the new and improved Managed Metadata Service in SharePoint 2013. The presentation covers the content management capabilities in SharePoint, the services architecture including service applications and proxies, and the new information architecture features in the Term Store. Key changes discussed include content type syndication across site collections using the Content Type Hub and enhanced management of terms, term sets and term set groups in the centralized Term Store.
A language independent web data extraction using vision based page segmentati...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set
of strategies here in which we get information from the website instead of copying the data manually. Many Webbased data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of
using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software.
This document discusses building a simulation to optimize a data webhousing system and meta-search engine through hardware and software configuration and tuning techniques. It outlines steps for the configuration process, including setting up hardware infrastructure, developing the meta-search engine and public web server, creating a web application, initializing and monitoring the data webhouse, applying ranking models periodically, and refreshing the data. Implementation issues covered include user authentication, classifying and categorizing users, analyzing clickstream data, and an example scenario of clickstream data collection. The goal is to implement technologies like data webhousing and perform tuning to take advantage of their capabilities.
Web Mining Research Issues and Future Directions – A SurveyIOSR Journals
This document summarizes research on web mining techniques. It begins with an abstract describing how web mining aims to extract useful information from vast amounts of unstructured web data. It then reviews various web mining techniques including web content mining, web structure mining, and web usage mining. The document surveys literature on pattern extraction techniques such as association rule mining, clustering, classification, and sequential pattern mining. It also discusses challenges in pre-processing web data and issues related to scaling up data mining algorithms for large web datasets. In closing, the document outlines future research directions in web mining including dealing with unstructured data and multimedia content.
The document describes LinkedIn's Segmentation & Targeting Platform, a big data application built on Hadoop. It allows users to define segments of users based on attributes and target them for marketing campaigns. Attributes can be computed from multiple data sources and consolidated. Segments are defined through a self-service portal using SQL-like queries. The platform processes complex queries fast and moves at business speed while handling LinkedIn's massive data volumes.
Data mining in web search engine optimizationBookStoreLib
This document presents a proposed approach for optimizing web search by incorporating user feedback to improve result rankings. The approach uses keyword analysis on the user query to initially retrieve and rank relevant web pages. It then analyzes user responses like likes/dislikes and visit counts to update the page rankings. Experimental results on sample education queries show how page rankings change as user responses increase likes for certain pages. The approach aims to provide more useful search results by better reflecting individual user preferences.
This document discusses using data mining and k-means cluster analysis to classify search engine optimization (SEO) techniques. It begins with an introduction to SEO and data mining. The paper aims to analyze various SEO techniques used by webmasters and classify them using a data mining approach. Specifically, it uses k-means cluster analysis on SEO techniques to group similar techniques together and identify those with the biggest impact on webpage ranking. The literature review covers past work analyzing SEO techniques and using data mining methods like clustering for search engine optimization.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A Generic Model for Student Data Analytic Web Service (SDAWS)Editor IJCATR
Any university management system accumulates a cartload of data and analytics can be applied on it to gather useful
information to aid the academic decision making process. This paper is a novel attempt to demonstrate the significance of a data
analytic web service in the education domain. This can be integrated with the University Management System or any other application
of the university easily. Analytics as a web service offers much benefits over the traditional analysis methods. The web service can be
hosted on a web server and accessed over the internet or on to the private cloud of the campus. The data from various courses from
different departments can be uploaded and analyzed easily. In this paper we design a web service framework to be used in educational
data mining that provide analysis as a service.
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
This document discusses two models for performing multi-stream joins in Apache Samza: All-In-One (AIO) joining and Step-By-Step (SBS) joining. It analyzes the performance tradeoffs of each model in terms of memory usage, latency, and other metrics. The authors implemented both models in Samza and evaluated their performance to determine the best approach for different scenarios.
This document introduces databases and database management systems (DBMS). It describes types of databases and applications, basic definitions, typical DBMS functionality including querying and updating data, an example of a university database, characteristics of the database approach, database users, advantages of databases, and situations when databases may not be needed.
This document provides an overview of databases and database management systems (DBMS). It discusses types of databases and applications, basic definitions, typical DBMS functionality including querying and updating data, and provides an example of a university database. It describes the main characteristics of the database approach, different types of database users, and advantages and limitations of using a DBMS.
The document summarizes updates from the University of Oregon's Information Services department, including:
- Active Directory was updated to the 2008 R2 functional level, the root domain was upgraded, and some child domains were removed.
- New integrated services like Jabber IM and Likewise (Unix/Mac integration with Active Directory) were introduced.
- Information Services will create a centralized Exchange 2010 email and calendaring solution for the university to replace existing legacy systems.
- Questions from attendees about account management, recommended client software, licensing, and migration process were answered.
A detail survey of page re ranking various web features and techniquesijctet
This document discusses techniques for page re-ranking on websites based on user behavior analysis. It describes how web usage mining involves analyzing web server logs to extract patterns in user behavior. Common techniques discussed for page re-ranking include Markov models, data mining approaches like clustering and association rule mining, and analyzing linked web page structures. The goal is to better understand user interests and predict future page access to improve information retrieval and optimize website design.
Microsoft Access is a database management system from Microsoft that combines the relational Microsoft Jet Database Engine with a graphical user interface and software-development tools. It allows users to create databases to store and organize data, design forms and reports, and build applications without advanced programming skills. Access stores all database objects like tables, queries, forms and reports in a single file with the .accdb or .mdb extension. It supports the use of Visual Basic for Applications for more advanced programming and automation.
Data preparation for mining world wide web browsing patterns (1999)OUM SAOKOSAL
The document discusses preparing web server log data for mining browsing patterns. It presents techniques for identifying unique users and user sessions from server logs. It also defines several methods for dividing user sessions into meaningful transactions for discovering association rules. The proposed transaction identification methods are evaluated on real world data using the WEBMINER data mining system.
This chapter discusses database management and discusses key topics such as:
1. The functions of database management systems and why data is important for organizations.
2. The differences between file processing systems and database approaches, and benefits of databases like reduced redundancy and improved access.
3. Popular database models like relational, object-oriented, and multidimensional databases.
4. Database administration tasks like design, security, backup and recovery procedures.
5. The roles of database analysts and administrators in managing, maintaining and protecting organizational data.
2013 MN IT Govt Symposium - Implement No Code Solutions with SharePoint and I...Don Donais
This document summarizes a presentation given by Donald Donais on implementing no-code solutions using SharePoint and InfoPath. The presentation covered assessing needs before building a solution, the tools used including SharePoint and InfoPath, basics of how InfoPath and SharePoint work together, features of InfoPath like validation and rules, tips for automation, considerations for deployment and customization. The goal was to demonstrate how InfoPath forms can be used to create automated workflows and interfaces without custom coding.
This document discusses web usage mining and clickstream data analysis. It provides an overview of web usage mining goals such as understanding user behavior and customizing websites. It also describes different types of clickstream data sources like web server log files, page tags, and cookies. Additionally, it covers various aspects of processing clickstream data like sessionization, path completion, and data integration to model usage patterns and identify frequent behaviors. The overall aim is to apply these analytical techniques to clickstream data from a website to help maximize revenue.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
SPSNYC14 - Must Love Term Sets: The New and Improved Managed Metadata Service...Jonathan Ralton
This document summarizes a presentation on the new and improved Managed Metadata Service in SharePoint 2013. The presentation covers the content management capabilities in SharePoint, the services architecture including service applications and proxies, and the new information architecture features in the Term Store. Key changes discussed include content type syndication across site collections using the Content Type Hub and enhanced management of terms, term sets and term set groups in the centralized Term Store.
A language independent web data extraction using vision based page segmentati...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set
of strategies here in which we get information from the website instead of copying the data manually. Many Webbased data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of
using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software.
This document discusses building a simulation to optimize a data webhousing system and meta-search engine through hardware and software configuration and tuning techniques. It outlines steps for the configuration process, including setting up hardware infrastructure, developing the meta-search engine and public web server, creating a web application, initializing and monitoring the data webhouse, applying ranking models periodically, and refreshing the data. Implementation issues covered include user authentication, classifying and categorizing users, analyzing clickstream data, and an example scenario of clickstream data collection. The goal is to implement technologies like data webhousing and perform tuning to take advantage of their capabilities.
The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats, such as text, audio, video, and much more. In all, web scraping is one way. There is a set of strategies here in which we get information from the website instead of copying the data manually. Many webbased data extraction methods are designed to solve specific problems and work on ad hoc domains. Various tools and technologies have been developed to facilitate web scraping. Unfortunately, the appropriateness and ethics of using these web scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python, and Ruby. There is also open-source software and commercial software. Web-based software such as YahooPipes, Google Web Scrapers, and Firefox extensions for Outwit are the best tools for beginners in web cutting. Web extraction is basically used to cut this manual extraction and editing process and provide an easy and better way to collect data from a web page and convert it into the desired format and save it to a local or archive directory. In this study, among other kinds of scrub, we focus on those techniques that extract the content of a web page. In particular, we use scrubbing techniques for a variety of diseases with their own symptoms and precautions.
Implementation of Intelligent Web Server Monitoringiosrjce
This document discusses the implementation of intelligent web server monitoring. It begins by introducing how web usage mining can be used to analyze user access patterns on a website to improve website design. It then discusses how client-side data collection using AJAX can more precisely calculate page browsing times compared to only using server log files. The document provides details on data sources for web usage mining, common log file formats, and examples of web usage mining techniques like statistical analysis, graph mining, and sequential pattern mining. It also reviews related literature on web usage mining and discusses how AJAX works to facilitate client-side data collection.
This document summarizes techniques for monitoring web server usage through web usage mining and statistical analysis. It discusses collecting data from server logs, client-side through scripts and modified browsers, and proxies. Server logs only provide partial information and client-side collection requires user cooperation. The combination of statistical analysis and web usage mining using both server and client-side data can provide powerful insights into how users browse websites to help evaluate websites and improve their design.
What are the different types of web scraping approachesAparna Sharma
The importance of Web scraping is increasing day by day as the world is depending more and more on data and it will increase more in the coming future. And web applications like Newsdata.io news API that is working on Web scraping fundamentals. More and more web data applications are being created to satisfy the data-hungry infrastructures. And do check out the top 21 list of web scraping tools in 2022
a novel technique to pre-process web log data using sql server management studioINFOGAIN PUBLICATION
This document summarizes a research paper that proposes a novel technique for pre-processing web log data using SQL Server Management Studio. The paper first discusses how web log data contains irrelevant information that needs to be cleaned through pre-processing before analysis. It then describes the contents of a typical web log file and provides a sample of raw web log data. The paper presents an algorithm for data cleaning and implements it using SQL queries to clean the web log data by removing records with certain file extensions and incomplete URLs. It shows that the data was reduced from over 200,000 records to around 25,000 after pre-processing. The paper concludes that pre-processing is an important step for filtering and organizing data before applying data mining techniques.
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
The document proposes an innovative vision-based page segmentation (IVBPS) algorithm to improve hidden web content extraction. It aims to overcome limitations of existing approaches that rely heavily on HTML structure. IVBPS extracts blocks from the visual representation of a page and clusters them to segment the page semantically. It uses layout features like position and appearance to locate data regions and extract records. The algorithm analyzes the entire page structure rather than local regions, allowing it to retain content DOM tree methods may discard. This is expected to significantly improve hidden web extraction performance.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
The document describes a relational web wrapper that extracts related metadata from webpages including the title, description, and keywords. It presents the algorithm for the web wrapper and evaluates its performance on three test websites. The results show that the wrapper extracts data efficiently in terms of time (average extraction time of 4-7 milliseconds) and accuracy (over 95% for keywords and 100% for title and description). The wrapper stores the extracted metadata in Excel sheets. In conclusion, the relational web wrapper provides an efficient way to extract related metadata from webpages with minimal user effort.
This document describes a project to build a web crawler and search engine to provide student information to students. It will scrape data like exam results, college details, and fees from other websites and provide the information to students in a searchable online interface. The system will include a desktop application for scraping data and storing it in a SQL Server database. It will also have a web application for students to search for their results or compare results with other students. The project aims to make student exam data and materials easily available from a single portal.
Components of a Generic Web Application ArchitectureMadonnaLamin1
The web application is composed of a complex architecture of varied components and layers. The request generated by the user passes through all these layers. When a user makes a request on a website, various components of the applications, user interfaces, middleware systems, database, servers and the browser interact with each other
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work's result with existing mechanism and found our result much better for number webpage
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank WebsiteIRJET Journal
The document discusses using web scraping techniques to collect bank offer data from websites. It describes how web scraping works by analyzing website content, extracting relevant data, and formatting it into a structured database or spreadsheet. The paper then presents the process used to scrape bank offer data from Indian websites, including developing a Python script to automate scraping, scheduling regular scraping, cleaning the extracted data, and transforming it into a standardized format for analysis. The results section demonstrates the web scraping process and shows how the extracted data is further transformed using an ETL process into a clean dataset for analytics purposes.
IRJET- Web Scraping Techniques to Collect Bank Offer Data from Bank WebsiteIRJET Journal
This document discusses using web scraping techniques to collect bank offer data from websites. It describes how an offer scavenger software can automate the extraction of relevant data from websites and organize it in a predefined format like a database. The document then provides details on how the researchers collected bank offer data from websites like centralbank.net.in using web scraping and Python libraries. It explains the data extraction, transformation and loading process to clean the scraped data and load it into a database. Some preliminary statistics are also generated from the collected data. Finally, it discusses some legal aspects of using web scraping techniques.
Automatic recommendation for online users using web usage miningIJMIT JOURNAL
A real world challenging task of the web master of an organization is to match the needs of user and keep their attention in their web site. So, only option is to capture the intuition of the user and provide them with the recommendation list. Most specifically, an online navigation behavior grows with each passing day, thus extracting information intelligently from it is a difficult issue. Web master should use web usage mining method to capture intuition. A WUM is designed to operate on web server logs which contain user’s navigation. Hence, recommendation system using WUM can be used to forecast the navigation pattern of user and recommend those to user in a form of recommendation list. In this paper, we propose a two tier architecture for capturing users intuition in the form of recommendation list containing pages visited by user and pages visited by other user’s having similar usage profile. The practical implementation of proposed architecture and algorithm shows that accuracy of user intuition capturing is improved.
Automatic Recommendation for Online Users Using Web Usage Mining IJMIT JOURNAL
A real world challenging task of the web master of an organization is to match the needs of user and keep their attention in their web site. So, only option is to capture the intuition of the user and provide them with the recommendation list. Most specifically, an online navigation behavior grows with each passing day, thus extracting information intelligently from it is a difficult issue. Web master should use web usage mining method to capture intuition. A WUM is designed to operate on web server logs which contain user’s navigation. Hence, recommendation system using WUM can be used to forecast the navigation pattern of user and recommend those to user in a form of recommendation list. In this paper, we propose a two tier architecture for capturing users intuition in the form of recommendation list containing pages visited by user and pages visited by other user’s having similar usage profile. The practical implementation of proposed architecture and algorithm shows that accuracy of user intuition capturing is improved.
Identifying the Number of Visitors to improve Website Usability from Educatio...Editor IJCATR
Web usage mining deals with understanding the Visitor’s behaviour with a Website. It helps in understanding the concerns
such as present and future probability of every website user, relationship between behaviour and website usability. It has different
branches such as web content mining, web structure and web usage mining. The focus of this paper is on web mining usage patterns of
an educational institution web log data. There are three types of web related log data namely web access log, error log and proxy log
data. In this paper web access log data has been used as dataset because the web access log data is the typical source of navigational
behaviour of the website visitor. The study of web server log analysis is helpful in applying the web mining techniques.
Website architecture involves planning the technical, functional, and visual components of a website before development. It uses a logical layout to align the website with user and business needs. A 2-tier architecture involves clients connecting to web servers via browsers. A 3-tier architecture separates presentation, functional logic, and data management layers. Client-side static mashups run applications locally after downloading data from websites. Server-side static mashups combine multiple websites' content without downloads. Dynamic mashups assemble content from different sources on the client-side. General web application architecture involves users initiating applications across multiple cooperating websites.
Static Enabler: A Response Enhancer for Dynamic Web ApplicationsOsama M. Khaled
In this paper, we describe a solution for dynamic web applications that need to optimize their performance for
responsiveness. The solution is described as a design pattern to be shared among developers from different
technical backgrounds. The Static Enabler design pattern enhances the performance of a dynamic web
application by converting some of its pages into static ones without losing the reference to the original pages.
For citation:
10. Osama M. Khaled and Hoda M. Hosny. Static Enabler: A Response Enhancer for Dynamic Web Applications. In the (VikingPLOP 2004) September 16th-19th, Uppsala, Sweden.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
2. II. BACKGROUND AND RELATED WORK
A. Web Data Extraction
Web Data Extraction system has been used in a variety of
applications including document analysis, business intelligence,
social media, analytics, etc. Systems and tools are developed for
extracting data from unstructured documents (emails, business
forms or technical papers) and semi-structured documents i.e.
Web documents [3]. In this research, we focus on extracting data
from Web documents which are massive of semi-structured
information presented in HTML formats. To efficiently pull out
data from HTML documents, the existing system adopts the
broad class of techniques that is text processing, DOM parsing,
natural language processing and machine learning [4, 5]. The
high-efficiency algorithms and state of the art approaches are
implemented as open-source libraries and commercial software.
The designing of Web Data Extraction consists of two parts
that apply two different techniques. The first part is algorithms
for extracting data from HTML documents. Since a Web
document is a hierarchy of HTML elements that are usually
represented as DOM (Document Object Model), most of data
extraction systems rely on tree-based techniques i.e. addressing,
matching and weighing of tree nodes [6]. The second part is
called Wrapper, which is a procedure that implements one or
multiple data extraction algorithms [7]. To carry on with data
extraction process, wrapper continuously runs the data
extraction algorithms, transforms and merges them into a
structured format. Obviously, exploiting DOM of Web
documents is the key mechanism in existing Web Data
Extraction approaches.
B. JavaScript Web Application
The growing popularity of JavaScript has changed the way
of developing Web applications. JavaScript Web applications
evolve the process of rendering Web page and DOM
manipulation to offer better Web browsing experience. One of
the widely adopted architecture for JavaScript Web application
is Single Page Application (SPA). SPA is web application that
fully loads all resources in the initial request. The individual
components, including DOM, can be replaced or updated
independently depending on user's interaction [8]. The other
technique that adds the interactive capability to SPA is
Asynchronous JavaScript and XML (AJAX). Instead of making
a new request and update the whole page with new data every
time. SPA acquires only data, usually in JSON format, from the
server by creating a background process for sending
asynchronous requests to the server. When the requested data
has arrived, SPA injects the data into HTML elements by
dynamically manipulating DOM.
Since asynchronous data transfer and dynamic DOM
manipulation is the key features of SPA and modern JavaScript
Web applications, extracting data from this kind of Web
applications has become a new limitation of existing Web Data
Extraction system, which mainly relies on DOM processing.
Therefore, this research aims to find an optimal solution which
enables data extraction from modern JavaScript Web
applications.
III. PROPOSED APPROACH
The key idea of our approach is the new design and
implementation of data extraction engine. We propose a new
Web Data Extraction engine that utilizes the headless browser
for fetching Web documents and dealing with dynamically
generated DOM.
Rather than performing tree-based techniques on DOM, our
approach focuses on extracting JSON data that is
asynchronously transferred and cached inside the target Web
documents. To accommodate end-users in the data extraction
tasks, our approach implements a task scheduler for semi-
automatic repeating the data extraction process. A configuration
system is used to allow users to define extraction rules and other
configurations. Data transformation is designed to translate
semi-structured data into structured data that can be an input of
data analytics process. In the following subsections, we describe
the overview architecture of our approach. We also highlight the
key ideas and techniques that we have applied to deal with
challenges in enabling data extraction of JavaScript Web
application.
A. Overview of Architecture
An overview of the proposed architecture shows in Fig 1.
The proposed system consists of 5 major modules as the
following:
Fig. 1. An overview of the proposed architecture
751
3. Extraction Engine. Once the extraction process starts,
Extraction Engine is responsible for fetching and extracting data
corresponding to the extraction rules provided by Configuration
Manager.
Task Scheduler. To control schedule and frequency of
running Extraction Engine, Task Scheduler reads configuration
from Configuration Manager and invokes Extraction Engine
following the user-defined schedules.
Raw Data Storage. This module is a NoSQL data storage
designed to keep the original JSON data which is the output of
Extraction Engine. Once JSON data is extracted from a target
Web document, Extraction Engine stores original version of
JSON data in this storage.
Data Transformation. With pre-defined rules from
Configuration Manager, JSON data from Raw Data Storage is
transformed into relational data and stored in a relational
database management system by this module. Since the
transformation rules can be added or changed later, keeping raw
data in a separated storage gives flexibility and benefits to data
transformation process and future data analytics.
Configuration Manager. To enable semi-automatic data
extraction and transformation, Configuration Manager provides
user-defined configurations which include data extraction rules,
invocation schedule and data transformation rules for supporting
execution of other modules.
B. Extraction Engine
To deal with dynamically generated DOM used in modern
JavaScript Websites, the extraction engine employs headless
browser technique for fetching the target Web documents. In
this way, other processes of Extraction Engine can extract data
from DOM of the fetched documents as ordinary HTML
documents that contain a static DOM. The detail process of
Extraction Engine is illustrated in Fig 2.
During our experiment, by analyzing DOM of tested
Websites, we found that modern JavaScript Websites acquire
data from back-end Web services using JSON format and store
the received JSON data in some part of its DOM as cached data.
As a result, our approach leverages the embed JSON data for the
data extraction process by using the DOM parser to extract the
embed JSON data instead of getting text value inside HTML
tags as implemented in the existing approaches. By extracting
cached JSON data from the target Web documents, our proposed
method can overcome the common problem of Web extraction
engine, i.e. dealing with changing of DOM elements, because
JSON data are rarely changed. Thus, the major task of
Extraction Engine is finding and extracting JSON chunk of data
that represents required information from the target Web pages.
The major challenge of Extraction Engine is finding cached
JSON data in DOM of the documents. Our solution is allowing
users to specify target data in the form of data extraction rules.
An example of data extraction process is shown in Fig. 3. The
processing flow of the example can be described as following:
Step 1. Read the DOM extraction rules from Configuration
Manager.
Step 2. Parse the fetched documents to create DOM tree
Fig. 3. An example of data extraction process
Fig. 2. The detail process of Extraction Engine
752
4. Step 3. Traverse each element in DOM tree.
Step 4. Compare the current DOM element with the value
specified in the extraction rules.
According to the “matching_condition: all”, DOM Parser
traverses each element of the fetched documents and finds the
element that contains all the values specified in the extraction
rules i.e. “@type”, “image”, “name”, “offers”, and “url”. If all
the specific elements are found in a specific DOM element,
DOM Parser extracts one parent level of JSON data from the
target document, i.e. “itemListElements”, which is the JSON
collection that contains data of all products of the target
document. Otherwise, move to the next DOM elements (Step 3).
Since the output JSON chuck may contain unrelated or
uninterested data, the JSON parser is responsible for selecting
data according to the criteria specified in the extraction rules.
Finally, the selected JSON data is stored in the Raw Data
Storage for the data transformation process.
C. Configuration File
The configuration file is defined as a well-formed JSON
document that contains setting parameters for each module in a
separated configuration section. The template of the
configuration file and short explanation of configuration values
in each section is illustrated in TABLE. I.
TABLE. I. TEMPLATE OF THE CONFIGURATION FILE
Configuration Sections
Target Web Document
Task Scheduler
DOM Parser
Data Transformation
IV. IMPLEMENTATION
A. System Development
In order to enable users to easily extract data from Web
documents and to experiment our solutions with the real-world
JavaScript Web applications, we have developed our proposed
system as a command-line application using Node.js. We have
selected PhantomJS [9] for implementing the headless browser
in Extraction Engine. MongoDB and MySQL are applied for
NoSQL data storage and RDBMS respectively. A user can start
the extraction process by calling the system via a command-line
tool with a valid configuration file as a parameter. The output
of the system is the relational data that is stored in a relational
database corresponding to the configuration.
B. Experimental
We have conducted a preliminary evaluation of our system
using a data extraction scenarios.
Scenario. Extract product information from a shopping
Websites: This scenario simulates a data gathering task that
aims to collect product information, including name,
description, and prices, from a shopping Website. We
selected Lazada [10], the top shopping online Website of
Thailand, as the target Website for this scenario. The
extraction rules and configurations were set to retrieve
name, description and price of selected product categories.
The expected extraction result is a time series of data that
can express pricing trend and support price prediction in the
future. The frequency of extraction was set to one time per
day due to the discount campaign of Lazada which not
commonly change within a day. An example of target Web
document of this scenario shows in Fig. 4.
Fig. 4. An Example of target document of the experiment
753
5. C. Result and Discussion
The experiment was designed to let our system extracts 30
product information from 3 different groups of the product
represented by 3 URL Parameters. Each URL Parameter was
tested with 5 different DOM extraction configurations. The
result column represents the number of the record that is
extracted and stored in the relational database. The extracted
items were compared with original information to determine
accuracy and invalidity, as displayed in the invalid column. The
evaluation result is illustrated in TABLE. II.
The result shows that our system has successfully extracted
the interested data from the target Web documents. However,
invalid items were found when the DOM extraction
configurations are not correctly set. We found that JSON data of
the target Website is distributed in many locations of DOM.
Consequently, the first and second try of our experiment has
failed because the number of matching keywords in the
extraction rule is not enough to distinguish the cached JSON
data. Thus, selecting efficient keywords for DOM extraction
rule is an important factor for our approach. Moreover, there are
some considerations and limitations that should be discussed.
1) JavaScript page navigation: In some scenario, modern
Javascript Web applications implement its navigation system
using lazy loading technique. The first request will receive only
first set of data. The remaining data will be fetched by clicking
a button, which is sending another request to request next set of
data from the back-end system. As a result, we have to add more
URL parameter to the configuration to simulate a request that
fetches least recent data. In order to automatically navigate
through all data, this feature should be added to Extraction
Engine.
2) Duplicate data: Since we cannot perfectly set the
frequency of extraction to match the changes of data in the
target Web sites, repeatedly extracting the same documents has
caused duplicate records in both scenarios. Thus, we have
added a configuration setting, i.e. “allow_dupplicate”, to enable
the user to specify wheater duplication is allow or not.
3) Limitation: Our proposed system is designed to extract
the JSON data from the target Web documents. The target Web
sites are required to expose JSON data in their DOM. However,
to be seen and ranked by the search engines, some modern
JavaScript applications render their pages from server-side as
ordinary client-server Web applications. Thus, we believe that
extracting data from DOM is still an important technique for
the Web data extraction. As for the future work, we have
planned to integrate state of the art DOM extraction techniques
with our proposed method to allow universal Web data
extraction.
V. CONCLUSION
This paper has presented a semi-automatic Web data
extraction system that aims for extracting data from modern
JavaScript Web applications. The proposed system enables
users to select data from existing Web documents by defining
extraction rules, schedules, storage and data transformation
logic. The design of Web data extraction engine that utilizes the
headless browser for fetching dynamic Web documents and the
application of DOM parser techniques to extract the embed
JSON data have presented and implemented as a command-line
tool. The preliminary evaluation results have shown that our
proposed system has successfully extract data from modern
JavaScript Web applications i.e. online shopping. The
limitations and future work have also been discussed.
REFERENCES
[1] C. Chang, M. Kayed, M. R. Girgis, K. F. Shaalan, C. Chang, M. Kayed,
M. R. Girgis, and K. F. Shaalan, “A Survey of Web Information
Extraction Systems,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp.
1411–1428, 2006.
[2] E. Ferrara, G. Fiumara, and R. Baumgartner, “Web Data Extraction ,
Applications and Techniques : A Survey,” ACM Trans. Comput. Log.,
vol. V, no. June, pp. 1–20, 2010.
[3] N. Negm, P. Elkafrawy, and A. B. Salem, “A Survey of Web Information
Extraction Tools,” Int. J. Comput. Appl., vol. 43, no. 7, pp. 975–8887,
2012.
[4] T. Gogar, O. Hubacek, and J. Sedivy, “Deep neural networks for web page
information extraction,” in IFIP Advances in Information and
Communication Technology, 2016, vol. 475, pp. 154–163.
[5] D. Laishram and M. Sebastian, “Extraction of Web News from Web
Pages Using a Ternary Tree Approach,” in Proceedings - 2015 2nd IEEE
International Conference on Advances in Computing and Communication
Engineering, ICACCE 2015, 2015, pp. 628–633.
[6] S. López, J. Silva, and D. Insa, “Using the DOM Tree for Content
Extraction,” Electron. Proc. Theor. Comput. Sci., vol. 98, pp. 46–59,
2012.
[7] S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli, “Web
wrapper induction: a brief survey,” AI Commun., vol. 17, no. 2, pp. 57–
61, 2004.
[8] M. A. Jadhav, B. R. Sawant, and A. Deshmukh, “Single Page Application
using AngularJS,” in International Journal of Computer Science and
Information Technologies, 2015, vol. 6, no. 3, pp. 2876–2879.
[9] PhantomJS, “Full web stack No browser required.” [Online]. Available:
http://phantomjs.org/.
[10] Lazada, “Lazada.” [Online]. Available at: http://www.lazada.co.th/.
TABLE. II. EVALUATION RESULT
754