Majority of the computer or mobile phone enthusiasts make use of the web for searching
activity. Web search engines are used for the searching; The results that the search engines get
are provided to it by a software module known as the Web Crawler. The size of this web is
increasing round-the-clock. The principal problem is to search this huge database for specific
information. To state whether a web page is relevant to a search topic is a dilemma. This paper
proposes a crawler called as “PDD crawler” which will follow both a link based as well as a
content based approach. This crawler follows a completely new crawling strategy to compute
the relevance of the page. It analyses the content of the page based on the information contained
in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Existing work on keyword search relies on an element-level model (data graphs) to compute keyword query results.Elements mentioning keywords are retrieved from this model and paths between them are explored to compute Steiner graphs. KRG (keyword Element Relationship Graph) captures relationships at the keyword level.Relationships captured by a KRG are not direct edges between tuples but stand for paths between keywords.
We propose to route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all sources. A multilevel scoring mechanism is proposed for computing the relevance of routing plans based on scores at the level of keywords, data elements,element sets, and subgraphs that connect these elements
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Keyword search is an intuitive paradigm for searching linked data sources on the web. We propose to route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all sources.
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Existing work on keyword search relies on an element-level model (data graphs) to compute keyword query results.Elements mentioning keywords are retrieved from this model and paths between them are explored to compute Steiner graphs. KRG (keyword Element Relationship Graph) captures relationships at the keyword level.Relationships captured by a KRG are not direct edges between tuples but stand for paths between keywords.
We propose to route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all sources. A multilevel scoring mechanism is proposed for computing the relevance of routing plans based on scores at the level of keywords, data elements,element sets, and subgraphs that connect these elements
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Keyword search is an intuitive paradigm for searching linked data sources on the web. We propose to route keywords only to relevant sources to reduce the high cost of processing keyword search queries over all sources.
A news search engine is a software system that is designed to search news articles published in various news sites. We
develop a near real time news search for Marathi and Hindi languages using the open source crawler, Nutch and the
indexer, Solr. In this work, we in particular address the question of what architectural changes need to be done in Nutch
and Solr to build a news search system which demands continuous updating on its crawl and index. We list all the
challenges faced to have a near real time index for news search and provide solutions for the same. Our customized open
source framework can be adapted to news search solution for any language.
We describe a methodology for maintaining freshness of the crawl by creating an incremental crawling architecture. We
show how this architecture maintains crawl freshness through a case study of Marathi and Hindi news search. We propose
a method of prioritizing URLs for fetching in a resource constrained scenario. We describe a technique to score a URL
before it is fetched, based on its own features and features from its parent URLs that were crawled previously.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
What IA, UX and SEO Can Learn from Each OtherIan Lurie
Google has become the arbiter how users experience a website. Their data-driven determinants of what constitute good UX directly influence how a site is found. This is wrong because people, not machines, should determine experience; Google does not tell the SEO or UX community what data is used to measure experience and many elements of experience cannot be measured.This presentation reveals why Google uses UX signals to determine placement in search results and how to create a customer pleasing and highly visible user experience for your website.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Dynamic Organization of User Historical QueriesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
Smart nanotechnology materials have been recently utilized in sensing applications. Carbon
nanotube (CNT) based SoC sensor systems have potential applications in various fields,
including medical, energy, consumer electronics, computers, and HVAC (heating, ventilation,
and air conditioning) among others. In this study, a nanotechnology multisensory system was
designed and simulated using Labview Software. The mathematical models were developed for
sensing three physical quantities: temperature, gas, and pressure. Four CNT groups on a chip
(two for gas sensor, one for temperature, and a fourth one for pressure) were utilized in order to
perform sensing multiple parameters. The proposed fabrication processes and the materials
used were chosen to avoid the interference of these parameters on each other when detecting
one of them. The simulation results were translated into analog voltage from Labview software,
transmitted via Bluetooth network, and received on desktop computers within the vicinity of the
sensor system. The mathematical models and simulation results showed as high as 95%
accuracy in measuring temperature, and the 5% error was caused from the interference of the surrounding gas. Within 7% change in pressure was impacted by both temperature and gas interference.
A news search engine is a software system that is designed to search news articles published in various news sites. We
develop a near real time news search for Marathi and Hindi languages using the open source crawler, Nutch and the
indexer, Solr. In this work, we in particular address the question of what architectural changes need to be done in Nutch
and Solr to build a news search system which demands continuous updating on its crawl and index. We list all the
challenges faced to have a near real time index for news search and provide solutions for the same. Our customized open
source framework can be adapted to news search solution for any language.
We describe a methodology for maintaining freshness of the crawl by creating an incremental crawling architecture. We
show how this architecture maintains crawl freshness through a case study of Marathi and Hindi news search. We propose
a method of prioritizing URLs for fetching in a resource constrained scenario. We describe a technique to score a URL
before it is fetched, based on its own features and features from its parent URLs that were crawled previously.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
What IA, UX and SEO Can Learn from Each OtherIan Lurie
Google has become the arbiter how users experience a website. Their data-driven determinants of what constitute good UX directly influence how a site is found. This is wrong because people, not machines, should determine experience; Google does not tell the SEO or UX community what data is used to measure experience and many elements of experience cannot be measured.This presentation reveals why Google uses UX signals to determine placement in search results and how to create a customer pleasing and highly visible user experience for your website.
We are good IEEE java projects development center in Chennai and Pondicherry. We guided advanced java technologies projects of cloud computing, data mining, Secure Computing, Networking, Parallel & Distributed Systems, Mobile Computing and Service Computing (Web Service).
For More Details:
http://jpinfotech.org/final-year-ieee-projects/2014-ieee-projects/java-projects/
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Dynamic Organization of User Historical QueriesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Computing semantic similarity measure between words using web search enginecsandit
Semantic Similarity measures between words plays an important role in information retrieval,
natural language processing and in various tasks on the web. In this paper, we have proposed a
Modified Pattern Extraction Algorithm to compute the supervised semantic similarity measure
between the words by combining both page count method and web snippets method. Four
association measures are used to find semantic similarity between words in page count method
using web search engines. We use a Sequential Minimal Optimization (SMO) support vector
machines (SVM) to find the optimal combination of page counts-based similarity scores and
top-ranking patterns from the web snippets method. The SVM is trained to classify synonymous
word-pairs and non-synonymous word-pairs. The proposed Modified Pattern Extraction
Algorithm outperforms by 89.8 percent of correlation value.
Smart nanotechnology materials have been recently utilized in sensing applications. Carbon
nanotube (CNT) based SoC sensor systems have potential applications in various fields,
including medical, energy, consumer electronics, computers, and HVAC (heating, ventilation,
and air conditioning) among others. In this study, a nanotechnology multisensory system was
designed and simulated using Labview Software. The mathematical models were developed for
sensing three physical quantities: temperature, gas, and pressure. Four CNT groups on a chip
(two for gas sensor, one for temperature, and a fourth one for pressure) were utilized in order to
perform sensing multiple parameters. The proposed fabrication processes and the materials
used were chosen to avoid the interference of these parameters on each other when detecting
one of them. The simulation results were translated into analog voltage from Labview software,
transmitted via Bluetooth network, and received on desktop computers within the vicinity of the
sensor system. The mathematical models and simulation results showed as high as 95%
accuracy in measuring temperature, and the 5% error was caused from the interference of the surrounding gas. Within 7% change in pressure was impacted by both temperature and gas interference.
OpenFlow is one of the most commonly used protocols for communication between the
controller and the forwarding element in a software defined network (SDN). A model based on
M/M/1 queues is proposed in [1] to capture the communication between the forwarding element
and the controller. Albeit the model provides useful insight, it is accurate only for the case when
the probability of expecting a new flow is small.
Secondly, it is not straight forward to extend the model in [1] to more than one forwarding
element in the data plane. In this work we propose a model which addresses both these
challenges. The model is based on Jackson assumption but with corrections tailored to the
OpenFlow based SDN network. Performance analysis using the proposed model indicates that
the model is accurate even for the case when the probability of new flow is quite large. Further
we show by a toy example that the model can be extended to more than one node in the data plane.
Model transformation is widely recognized as a key issue in model engineering approaches. In
previous work, we have developed an ATL transformation that implements a strategy to obtain a
set of Raise Specification Language (RSL) modules from Feature Models (FM). In this paper,
we present an improvement to this strategy by defining another complementary and independent
model, allowing the incorporation of traceability information to the original transformation.
The proposed mechanism allows capturing and representing the relationships created by the
application of the transformation rules.
The Aviation Administration policy prohibits the use of mobile phones in Aircraft during
transition for the reason it may harm their communication system due to Electromagnetic
interference. In case the user wants to access cellular network at higher altitudes, base station
access is a problem. Large number of channels are allocated to a single user moving at high
speed by various Base Stations in the vicinity to service the request requiring more resources.
Low Altitude Platforms (LAPs) are provided in the form of Base stations in the Airships with
antennas projected upwards which has direct link with the Ground Station. LAPs using Long-
Endurance Multi-Intelligence Vehicle (LEMVs) equipped with an engine for mobility and stable
positioning against rough winds are utilized. This paper proposes a system that allows the
passengers to use their mobiles in Aircraft using LAPs as an intermediate system between
Aircraft and Ground station. As the Aircraft is dynamic, it has to change its link frequently with
the Airships, MANETs using AODV protocol is established in the prototype using NS2 to
provide the service and the results are encouraging.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
http://lab.lvduit.com:7850/~lvduit/crawl-the-web/
Special thanks to Hatforrent, csail.mit.edu, Amitavroy, ...
Contact us at dafculi.xc@gmail.com or duyet2000@gmail.com
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
<<< Slides can be found at http://www.slideshare.net/denshe/intelligent-crawling-shestakovwiiat13 >>>
-------------------
Web crawling, a process of collecting web pages in
an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. We start with background on web crawling and the structure of the Web. We then discuss different crawling strategies and describe adaptive web crawling techniques leading to better overall crawl performance. We finally overview some of the challenges in web crawling by presenting such topics as collaborative web crawling, crawling the deep Web and crawling multimedia content. Our goals are to introduce the intelligent systems community to the challenges in web crawling research, present intelligent web crawling approaches, and engage researchers and practitioners for open issues and research problems. Our presentation could be of interest to web intelligence and intelligent agent technology communities as it particularly focuses on the usage of intelligent/adaptive techniques in the web crawling domain.
-------------------
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium
sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the
environment, regional end-to-end public transport services are established by analyzing online travel data.
The usage of computer programs for processing of the web page is necessary for accessing to a large
number of the carpool data. In the paper, web crawlers are designed to capture the travel data from
several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used.
The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient
method of data collecting to the program.
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
Now the public traffics make the life more and more convenient. The amount of vehicles in large or medium sized cities is also in the rapid growth. In order to take full advantage of social resources and protect the environment, regional end-to-end public transport services are established by analyzing online travel data. The usage of computer programs for processing of the web page is necessary for accessing to a large number of the carpool data. In the paper, web crawlers are designed to capture the travel data from several large service sites. In order to maximize the access to traffic data, a breadth-first algorithm is used. The carpool data will be saved in a structured form. Additionally, the paper has provided a convenient method of data collecting to the program.
Optimization of Search Results with Duplicate Page Elimination using Usage DataIDES Editor
The performance and scalability of search engines
are greatly affected by the presence of enormous amount of
duplicate data on the World Wide Web. The flooded search
results containing a large number of identical or near identical
web pages affect the search efficiency and seek time of the users
to find the desired information within the search results. When
navigating through the results, the only information left behind
by the users is the trace through the pages they accessed. This
data is recorded in the query log files and usually referred to
as Web Usage Data. In this paper, a novel technique for
optimizing search efficiency by removing duplicate data from
search results is being proposed, which utilizes the usage data
stored in the query logs. The duplicate data detection is
performed by the proposed Duplicate Data Detection (D3)
algorithm, which works offline on the basis of favored user
queries found by pre-mining the logs with query clustering.
The proposed result optimization technique is supposed to
enhance the search engine efficiency and effectiveness to a large
scale.
A Survey on: Utilizing of Different Features in Web Behavior PredictionEditor IJMTER
As the web user increases day by day, there are many websites which have a large
number of visitors at the same instant. So handing of these user required different technique. Out of
these requirements one emerging field is next page prediction, where as per the user navigation
pattern different features has been studied and predict the next page for the user. By this overall web
server response time is reduce. In this paper a detailed study of the different researcher paper has
shown, there techniques outcomes and list of features utilization such as web structure, web log, web
content.
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEBIJDKP
The cost of acquiring training data instances for induction of data mining models is one of the main concerns in real-world problems. The web is a comprehensive source for many types of data which can be used for data mining tasks. But the distributed and dynamic nature of web dictates the use of solutions which can handle these characteristics. In this paper, we introduce an automatic method for topical data acquisition from the web. We propose a new type of topical crawlers that use a hybrid link context extraction method for topical crawling to acquire on-topic web pages with minimum bandwidth usage and with the lowest cost. The new link context extraction method which is called Block Text Window (BTW), combines a text window method with a block-based method and overcomes challenges of each of these methods using the advantages of the other one. Experimental results show the predominance of BTW in comparison with state of the art automatic topical web data acquisition methods based on standard metrics.
Efficient intelligent crawler for hamming distance based on prioritization of...IJECEIAES
Search engines play a crucial role in today's Internet landscape, especially with the exponential increase in data storage. Ranking models are used in search engines to locate relevant pages and rank them in decreasing order of relevance. They are an integral component of a search engine. The offline gathering of the document is crucial for providing the user with more accurate and pertinent findings. With the web’s ongoing expansions, the number of documents that need to be crawled has grown enormously. It is crucial to wisely prioritize the documents that need to be crawled in each iteration for any academic or mid-level organization because the resources for continuous crawling are fixed. The advantages of prioritization are implemented by algorithms designed to operate with the existing crawling pipeline. To avoid becoming the bottleneck in pipeline, these algorithms must be fast and efficient. A highly efficient and intelligent web crawler has been developed, which employs the hamming distance method for prioritizing the pages to be downloaded in each iteration. This cutting-edge search engine is specifically designed to make the crawling process more streamlined and effective. When compared with other existing methods, the implemented hamming distance method achieves a high value of 99.8% accuracy.
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, effectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus increasing the quality of downloaded pages.
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...ijwscjournal
The size of the internet is large and it had grown enormously search engines are the tools for Web site navigation and search. Search engines maintain indices for web documents and provide search facilities by continuously downloading Web pages for processing. This process of downloading web pages is known as web crawling. In this paper we propose the architecture for Effective Migrating Parallel Web Crawling approach with domain specific and incremental crawling strategy that makes web crawling system more effective and efficient. The major advantages of migrating parallel web crawler are that the analysis portion of the crawling process is done locally at the residence of data rather than inside the Web search engine repository. This significantly reduces network load and traffic which in turn improves the performance, effectiveness and efficiency of the crawling process. The another advantage of migrating parallel crawler is that as the size of the Web grows, it becomes necessary to parallelize a crawling process, in order to finish downloading web pages in a comparatively shorter time. Domain specific crawling will yield high quality pages. The crawling process will migrate to host or server with specific domain and start downloading pages within specific domain. Incremental crawling will keep the pages in local database fresh thus increasing the quality of downloaded pages.
Farthest first clustering in links reorganizationIJwest
Website can be easily design but to efficient user navigation is not a easy task since user behavior is keep changing and developer view is quite different from what user wants, so to improve navigation one way is reorganization of website structure. For reorganization here proposed strategy is farthest first traversal clustering algorithm perform clustering on two numeric parameters and for finding frequent traversal path of user Apriori algorithm is used. Our aim is to perform reorganization with fewer changes in website structure.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
2. 246 Computer Science & Information Technology (CS & IT)
page. This technique ensures that similar pages get downloaded and hence the name Focused web
crawler[3]. Web crawler needs to search for information among web pages identified by URLs. If
we consider each web page as a node, then the World Wide Web can be seen as a data structure
that resembles a Graph‘. To traverse a graph like data structure our crawler will need traversal
mechanisms much similar to those needed for traversing a graph like Breadth First Search (BFS)
or Depth First Search (DFS). Proposed Crawler follows a simple Breadth First Search‘approach.
The start URL given as input to the crawler can be seen as a start node‘ in the graph. The
hyperlinks extracted from the web page associated with this link will serve as its child nodes and
so on. Thus, a hierarchy is maintained in this structure. Each child can point to its parent is the
web page associated with the child node URL contains a hyperlink which is similar to any of the
parent node URLs. Thus, this is a graph and not a tree. Web crawling can be considered as putting
items in a queue and picking a single item from it each time. When a web page is crawled, the
extracted hyperlinks from that page are appended to the end of the queue and the hyperlink at the
front of the queue is picked up to continue the crawling loop. Thus, a web crawler deals with the
infinite crawling loop which is iterative in nature. Since crawler is a software module which deals
with World Wide Web, a few constraints [4] have to be dealt with: High speed internet
connectivity, Memory to be utilized by data structures, Processor for algorithm processing and
parsing process Disk storage to handle temporary data.
2. RELATED WORK
Mejdl S. Safran, Abdullah Althagafi and Dunren Che in Improving Relevance Prediction for
Focused Web Crawlers‘[4] propose that a key issue in designing a focused Web crawler is how to
determine whether an unvisited URL is relevant to the search topic. this paper proposes a new
learning-based approach to improve relevance prediction in focused Web crawlers. For this study,
Naïve Bayesian is used as the base prediction model, which however can be easily switched to a
different prediction model. Experimental result shows that approach is valid and more efficient
than related approaches.
S. Lawrence and C. L. Giles in Searching the World Wide Web‘[1] state that the coverage and
recency of the major World Wide Web search engines when analyzed, yield some surprising
results. Analysis of the overlap between pairs of engines gives an estimated lower bound on the
size of the indexable Web of 320 million pages. Pavalam S. M., S. V. Kasmir Raja, Jawahar M.,
and Felix K. Akorli in Web Crawler in Mobile Systems‘ [6] propose that with the advent of
internet technology, data has exploded to a considerable amount. Large volumes of data can be
explored easily through search engines, to extract valuable information. Carlos Castillo , Mauricio
Marin , Andrea Rodriguez in Scheduling Algorithms for Web Crawling‘[7] presents a
comparative study of strategies for Web crawling. It shows that a combination of breadth-first
ordering with the largest sites first is a practical alternative since it is fast, simple to implement,
and able to retrieve the best ranked pages at a rate that is closer to the optimal than other
alternatives. The study was performed on a large sample of the Chilean Web which was crawled
by using simulators, so that all strategies were compared under the same conditions, and actual
crawls to validate our conclusions. The study also explored the effects of large scale parallelism
in the page retrieval task and multiple-page requests in a single connection for effective
amortization of latency times.
Junghoo Cho and Hector Garcia-Molina in Effective Page Refresh Policies for Web Crawlers‘ [8]
study how to maintain local copies of remote data sources fresh, when the source data is updated
autonomously and independently. In particular, authors study the problem of web crawler that
maintains local copies of remote web pages for web search engines. In this context, remote data
sources (web sites) do not notify the copies (Web crawlers) of new changes, so there is a need to
periodically poll the sources, it is very difficult to keep the copies completely up-to-date. This
3. Computer Science & Information Technology (CS & IT) 247
paper proposes various refresh policies and studies their effectiveness. First formalize the notion
of Freshness of copied data by defining two freshness metrics, and then propose a Poisson
process as a change model of data sources. Based on this framework, examine the effectiveness of
the proposed refresh policies analytically and experimentally. Results show that a Poisson process
is a good model to describe the changes of Web pages and results also show that proposed refresh
policies improve the freshness of data very significantly. In certain cases, author got orders of
magnitude improvement from existing policies. The Algorithm design Manual‘[9] by Steven S.
Skiena is a book intended as a manual on algorithm design, providing access to combinatorial
algorithm technology for both students and computer professionals. It is divided into two parts:
Techniques and Resources. The former is a general guide to techniques for the design and
analysis of computer algorithms. The Resources section is intended for browsing and reference,
and comprises the catalog of algorithmic resources, implementations, and an extensive
bibliography. Artificial Intelligence illuminated‘ [10] by Ben Coppin introduces a number of
methods that can be used to search, and it discusses discuss how effective they are in different
situations. Depth-first search and breadth-first search are the best-known and widest-used search
methods, and it is examined why this is and how they are implemented. A look is also given at a
number of properties of search methods, including optimality and completeness[11][12], that can
be used to determine how useful a search method will be for solving a particular problem.
3. DESIGN METHODOLOGY
In general, a web crawler must provide the features discussed [13] ,
1. A web crawler must be robust in nature Spider traps are a part of the hyperlink structure
in the World Wide Web. There are servers that create spider traps, which mislead
crawlers to infinitely travel a certain unnecessary part of the web. Our crawler must be
made spider trap proof.
2. Web pages relate to web servers, i.e. different machines hosting these web pages and
each web page has its own crawling policy, thus our crawler must respect the boundaries
that each server draws.
3.1 Existing Methodology [14] [15] [16]
Figure1. High-level architecture of a standard Web crawler
4. 248 Computer Science & Information Technology (CS & IT)
Components of the web crawler are described as [4]:
1. Seed - It is the starting URL from where the where crawler starts traversing the World
Wide Web recursively.
2. Multithreaded Downloader: It downloads page source code whose URL is specified by
seed.
3. Queue: Contains unvisited hyperlinks extracted from the pages.
4. Scheduler: It is FCFS scheduler used to schedule pages in Queue.
5. Storage: Storage may be volatile or non-volatile data storage component.
3.2 Proposed Approach
1. This paper proposes a PDD crawler‘, which is both links based and content, based. The
content that had been unused by the previous crawlers will also take part in the parsing
activity. Thus, Rank Crawler is a link as well as content-based crawler.
2. Since true analysis of the web page is taking place (i.e. the entire web page is searched
for the content fired by the user), this crawler is very well suited for business analysis
purposes.
3.3 Architecture
Figure 2. Proposed Architecture
Before describing the actual working of proposed Crawler, it is important to note that:
1. Every URL has got a web page associated with it. Web pages are in the form of HTML
source code in which information is divided in various tags. Thus, page parsing and
information retrieval are solely based on the tags of HTML source code.
2. Web pages have a certain defined structure, a hierarchy of sorts. This hierarchy defines
the importance of tags with respect to a web page. Meta‘is the most important tag or the
tag with the highest priority, whereas body‘ is the one with the lowest priority.
5. Computer Science & Information Technology (CS & IT) 249
Proposed PDD Crawler‘ follows these steps:
1. The seed URL used to connect to the World Wide Web. Along with the seed, a search
string is to be provided by the user. The search string will act as the query that the user is
searching for. If the URL is valid and the string is valid, then the crawler proceeds else it
stops.
2. Every Web page has got tags that may contain information. Weights in the form of
integers have been assigned to the tags of the page. If the content is found on the page
(once or multiple times), the weight is to be multiplied by the number of occurrences in
each tag and the total weight is added up to get the total page weight.
3. The Seed downloaded and all the URLs on the seed pages are extracted. The seed page is
then checked to see whether it contains any relevant information. If yes, then the weight
is calculated and it is set aside for further use, else the page is irrelevant to users query
and has to be thrown away.
4. The URLs extracted are then scheduled to be downloaded one by one and the above
process is followed (i.e. download a page and check its contents). If it returns a positive
page weight, then set it aside and recursively do the same for every page.
5. The page with the most weight has the highest content. Thus, the results have to be
descending sorted.
3.4 Architecture Components
1. Page Repository: Repository refers to the web pages that have been traversed and parsed.
It may be a temporary storage from where the crawlers can refer the pages or it can be the
browsers cache or storage from where it can refer the pages for faster access.
2. Page Parsing Mechanism: Page parsing mechanism takes pages one at a time from the
page repository and searches for the keyword or the string in the page and based on that
assign weight to the page.
3. Content Analysis: Final phase, which decides whether the page is relevant or whether it
has to be discarded. Relevancy is decided on the basis of relevancy threshold which we
have maintained in the algorithm.
3.5 Working of ‘PDD Crawler
Proposed PDD Crawler‘ works using the source code of the page i.e. the downloader uses the
source code to analyze the tags and contents of the page so as to get the page weight and calculate
the degree of relevancy of a page. To make a simple web page you need to know the following
tags:
1. < HTML > tells the browser your page is written in HTML format
2. < HEAD > this is a kind of preface of vital information that does not appear on the
screen.
3. < TITLE > Write the title of the web page here - this is the information that viewers see
on the upper bar of their screen.
4. < BODY > This is where you put the content of your page, the words and pictures that
people read on the screen.
5. <META> This element used to provide structured metadata about a Web page. Multiple
Meta elements with different attributes are often used on the same page. Meta elements
can be used to specify page description, keywords and any other metadata not provided
through the other head elements and attributes.
6. 250 Computer Science & Information Technology (CS & IT)
Consider the source code associated with the web page of the URL:
http://www.myblogindia.com/html/default.asp as given below-
<html>
<head>
<meta name="description" content="Free HTML Web tutorials">
<meta name="keywords" content="HTML, CSS, XML">
<meta name="author" content="RGCER">
<meta charset="UTF-8">
< title > HTML title of page< /title >
</head>
< body>
This is my very own HTML page. This page is just for reference.
< /body >
< /html >
The downloader gets this content from a page. Downloading a page refers to the function of
getting the source code of the page and analyzing as well as performing some actions on it. This
method is what we call Parsing. The analysis part of the source code is as follows:
1. Let the Total weight of the page be t‘ units
2. The body tag has got the weight B‘ units
3. The title tag has got the weight T‘ units
4. The Meta tag has got the weight M‘ units
5. The heading tag (h1 through h6) has weight H‘ units
6. The URL has weight U‘ units
7. The no. of occurrence of the search string in body be Nb
8. The no. of occurrence of the search string in title be Nt
9. The no. of occurrence of the search string in META be Nm
10. The no. of occurrence of the search string in heading be Nh
11. The no. of occurrence of the search string in URL be Nu
12.
The total weight of the page would be –
t = (Nb*B)+(Nt*T)+(Nm*M)+(Nh*H)+(Nu*U)
Assumptions for calculating page weight are defined below: M = 5 units, U = 4 units, T = 3 units,
H = 2 units, B = 1 units Suppose the search string the user entered is: html (not case-sensitive).
The number of occurrences of html in the following tags is as follows: Nb = 1, Nt = 1, Nm = 2,
Nh = 0, Nu =1 The total weight of the page comes out to be:
t = (1*1) + (1*3) + (2*5) + (0*2) + (1*4)
t = 1 + 3 + 10 + 4
t = 18
Conclusion:
1. The page weight (t) > 3, thus the page is relevant.
2. Page is Relevant if and only if t > 3.
3. All pages with t<=3 will be discarded.
4. Will only work for static pages.
5. Static pages are the one which do not change or update or alter their content on regular
basis i.e. the data change that is there on pages is either periodic or none what so ever.
6. Some pages are non-crawl able (e.g. the pages with META as robot).
7. Computer Science & Information Technology (CS & IT) 251
4. RESULT
The proposed PDD Crawler‘ was implemented on the following hardware and software
specifications: OS Name- Microsoft Windows 7 Ultimate, Version-6.1.7600 Build 7600,
Processor- Intel(R) Core(TM) i5-3230M CPU @ 2.60 Ghz, 2601.and compared its performance
with intelligent web crawler which is link based crawler , after running both the crawlers in same
hardware and software environment with same input parameters, observed following rate of
precisions for both the crawlers as shown in Table 1. evaluate the performance of proposed
framework by comparing it with Intelligent web crawler by applying both the crawlers to the
same problem domains such as Book Show, Book to read, Cricket match and Match making. The
experiments are conducted on different social aspects database (web) having same words with
different meaning as book meaning reserving some resource for the domain Book show while as
book meaning a physical entity made up by binding pages for the domain book to read. From the
above test cases it is vibrant that if correct Seed URLs are provided according to domain sense of
the word then:
1. The precision range comes out to be 20 to 70% which is fairly acceptable for the crawl.
2. The test results show that the search is now narrowed down and very specific results are
obtained in sorted manner.
3. The sorting of the results cause the most relevant link to be displayed first to the user so
as to save his valuable search time
Table 1: Result Comparisons
4. The testing of the framework is carried out on live web, thus by precision reported
accuracy of proposed Focused Web Crawler is promising.
8. 252 Computer Science Information Technology (CS IT)
Above test results completely depend on Seed URLs. The Seed URLs provide the correct domain
for the search. Seed URLs help to initiate and guide the search in interested domain. Later on the
Focused nature of the proposed Crawler will always deviate the search towards target links.
5. CONCLUSION AND FUTURE WORK
The main advantage of proposed crawler over the intelligent crawler and other Focused Crawlers
is that it does not need any Relevance Feedback or training procedure in order to act intelligently;
two kinds of change were found after comparing result of both the crawlers.
1. The number of extracted documents was reduced. Link analyzed, and deleted a great deal
of irrelevant web page.
2. Crawling time is reduced. After a great deal of irrelevant web page is deleted, crawling
lode is reduced.
In conclusion, after link analysis and page analysis in proposed crawler, crawling precision is
increased and crawling rate is also increased. This will be an important tool to the search engines
and thus will facilitate the newer versions of the search engines
REFERENCES
[1] S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98.100,1998.
[2] StatMarket. Search engine referrals nearly double worldwide.http://websidestory.com/pressroom/-
pressreleases.html?id=181, 2003.
[3] Rajshree Shettar, Dr. Shobha G, Web Crawler on Client Machine, Proceedings of the International
MultiConference of Engineers and Computer Scientists 2008 Vol II, IMECS 2008, 19-21 March,
2008, Hong Kong.
[4] Mejdl S. Safran, Abdullah Althagafi and DunrenChe Improving Relevance Prediction for Focused
Web Crawlers, in the proceding of 2012 IEEE/ACIS 11th International Conference on Computer and
Information Science.
[5] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze, An Introduction to Information
Retrieval, Cambridge University Press, Cambridge, England.
[6] Pavalam S M, Jawahar M, Felix K Akorli, S V Kashmir Raja Web Crawler in Mobile Systems in the
proceedings of International Conference on Machine Learning (ICMLC 2011),
[7] Carlos Castillo, Mauricio Marin, Andrea Rodriguez,Scheduling Algorithms for Web Crawling in
the proceedings of WebMedia and LA-Web, 2004.
[8] Junghoo Cho and Hector Garcia-Molina Effective Page Refresh Policies for Web Crawlers ACM
Transactions on Database Systems, 2003.
[9] Steven S. Skiena The Algorithm design Manual Second Edition, Springer Verlag London Limited,
2008, Pg 162.
[10] Ben Coppin Artificial Intelligence illuminated Jones and Barlett Publishers, 2004, Pg 77.
[11] Alexander Shen Algorithms and Programming: Problems and solutions Second edition Springer 2010,
Pg 135
[12] NarasinghDeo Graph theory with applications to engineering and computer science PHI, 2004 Pg 301
[13] DebashisHati, BiswajitSahoo, A. K. Adaptive Focused Crawling Based on Link Analysis, 2nd
International Conference on Education Technology and Computer (ICETC), 2010.
[14] Brian Pinkerton, Finding what people want: Experiences with the Web Crawler, Proceedings of first
World Wide Web conference, Geneva, Switzerland, 1994.
[15] Gautam Pant, Padmini Srinivasan, Filippo Menczer, Crawling the Web, pp. 153-178, Mark Levene,
Alexandra Poulovassilis (Ed.), Web Dynamics: Adapting to Change in Content, Size, Topology and
Use, Springer-Verlag, Berlin, Germany, November 2004.
[16] B. Brewington and G. Cybenko, How dynamic is the web?, Proceedings of the 9th World Wide Web,
Conference (WWW9), World Wide Web, Conference , 2000.
9. Computer Science Information Technology (CS IT) 253
AUTHORS
Prashant Dahiwale: Is a Doctoral research fellow at G H Raisoni college of
Engineering and research , Nagpur under RTM Nagpur University Nagpur in
computer science and engineering. Having 5 years of teaching experience to teach
M.tech and B.E comp Sc Engg courses and 1 year of industrial experience. His area
of research is information retrieval and web mining. Completed Post PG diploma
from CDAC in 2010. M.Tech in Comp Sc Engg form Visvesvaraya National
Institute of Technology, (VNIT) Nagpur in 2009,B.E from RTM Nagpur university
Nagpur in 2006. Published various research papers at International and National
journals and conferences. Also author is a four time finalist of Asia regional ACM
ICPC since 2010.organised various workshop on programming language fundamentals and advance
concepts.
Dr M M Raghuwanshi: Professor dept of Comp sc Engg ,RGCER, Nagpur India .
having total 25 plus years of teaching experience and 5 years of industrial
experience. His research area is genetic algorithm, he has chaired multiple
international and national conferences, he is IEEE reviewer, Ph D form Visvesvraya
National Institute of Technology, Nagpur, and M.Tech from IIT Kharagpur. he is a
author of text book on Algorithm and Data structure , He thought multiple subjects
to M.Tech and B.E courses. he has organised multiple workshops on Theory of
computations and Genetic algorithm. Published various research papers at
International and National journals and conferences.