SlideShare a Scribd company logo
Rotto Link Web Crawler 
Summer Internship Report 
Submitted By 
Akshay Pratap Singh 
2011ECS01 
Under the Guidance of 
Mr. Saurabh Kumar 
Senior Developer 
Ophio Computer Solutions Pvt. Ltd. 
in partial fulfillment of Summer Internship for the award of the degree 
of 
Bachelor of Technology 
in 
Computer Science & Engineering 
SHRI MATA VAISHNO DEVI UNIVERSITY 
KATRA, JAMMU & KASHMIR 
JUNE­JULY 
2014
UNDERTAKING 
I hereby certify that the Colloquium Report entitled “Rotto Link Web Crawler”, submitted in 
the partial fulfillment of the requirements for the award of Bachelor of Technology in Computer 
Science And Engineering to the School of Computer Science & Engineering of Shri Mata 
Vaishno Devi University, Katra, J&K is an authentic record of my own study carried out during 
a period from June­July, 
2014. 
The matter presented in this report has not been submitted by me for the award of any other 
degree elsewhere. The content of the report does not violate any copyright and due credit is 
given in to the source of information if any. 
Name :­Akshay 
Pratap Singh 
Entry Number :­2011ECS01 
Place :­SMVDU, 
Katra 
Date:­1 
December 2014
Certificate
About the Company 
Ophio is a private company where passionate, dedicated programmers and creative designers 
team develop outstanding services and applications for Web, iPhone, iPad, Android and the Mac. 
Their approach is simple, they take well designed products make them function beautifully. 
Specializing in the creation of unique, immersive and stunning web and mobile applications, 
videos and animations. At Ophio, they literally make digital dreams, reality. 
The Ophio team are a core group of skilled development experts, allowing us to bring projects to 
life, adding an extra dimension of interactivity into all our work. Whether it be responsive builds, 
CMS sites, microsites or full e­Commerce 
systems, they have the team to create superb products. 
With launch of iPhone 5 and opening of 3G spectrum in asian countries, there is huge demand 
for iPhones. They help others product owner reach out to your customers with creative & 
interactive applications built by our team of experts. 
Future lies in the open source. Android platform is a robust opening system meant for rapid 
development, developers at Ophio exploit it to full and build content rich application for your 
mobile device. 
Ophio is comprising of 20 members team, out of which 16 are developers, 2 are motion 
designers and 2 are QA Analyst. Team is best defined as youthful, ambitious, amiable and 
passionate. Delivering high quality work on time, bringing value to the project and our clients.
Table of Contents 
1. Abstract 
2. Project Description 
2.1 Web Crawler 
2.2 Rotto Link Web Crawler 
2.3 Back­End 
2.3.1 Web Api 
2.3.2 Crawler Module 
2.3.2.1 GRequests 
2.3.2.2 BeautifulSoup 
2.3.2.3 NLTK 
2.3.2.4 SqlAlchemy 
2.3.2.5 Redis 
2.3.2.6 LogBook 
2.3.2.2 Smtp 
2.4 Front­End 
2.5 Screenshots of Application 
3. Scope 
4. Conclusion
Abstract 
Rotto(a.k.a Broken) link web crawler is an application tool to extract the broken link (i.e dead 
links) within a complete website. This application takes a seed link, a link of a website to be 
crawl, and visits every page of a website and search for broken links. As the crawler visits these 
URLs, it identifies all the hyperlinks in the page and adds the non­broken 
hyperlinks to the list of 
URLs to visit, called the crawl frontier (i.e Worker Queue).And, broken links are added into the 
database alongwith the matched keywords in the content of a webpage. This process continues 
untill site crawl completey. 
This application follow REST architecture to design its web API. Web API take a target/seed url 
and set of keywords to be searched in a page having broken hyperlinks and return a result 
containing a set of link of pages having broken hyperlinks. 
Web API has two end­points, 
which performs two actions : 
● An HTTP GET request containing a seed/target url and a array of keywords in an JSON 
form.This returns a user a JOB ID which can be used to get the results. 
● An HTTP GET request containing a job­Id 
.This returns a result as a set of pages 
matched a keywords sent by earlier request and also having a broken links. 
All request and response are in JSON form. 
Application uses two Reddis workers for dispatching websites from database those are pending 
to a worker queue and crawling websites in queue. As the crawler visits all pages of website and 
stores all result in database with their respective JOB ID. 
Application is using sqlite engine. Database implements SqlAlchemy for doing database events. 
For UI purpose, Application interface is designed using AngularJS framework.
Project Description 
2.1 Web Crawler 
A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these 
URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called 
the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If 
the crawler is performing archiving of websites it copies and saves the information as it goes. 
Such archives are usually stored such that they can be viewed, read and navigated as they were 
on the live web, but are preserved as ‘snapshots'. 
The large volume implies that the crawler can only download a limited number of the Web pages 
within a given time, so it needs to prioritize its downloads. The high rate of change implies that 
the pages might have already been updated or even deleted. 
Fig. A sequence flow of a web crawler
2.2 Rotto Link Web Crawler 
Rotto(a.k.a Broken) link web crawler extracts the broken link (i.e dead links) within a complete 
website. This application takes a seed link, a link of a website to be crawl, and visits every page 
of a website and search for broken links. As the crawler visits these URLs, it identifies all the 
hyperlinks in the web page and these hyperlinks are distributed into two parts : 
● internal links(i.e refer to internal site web page) and, 
● external links(i.e refer to outside site web page) 
Then , all hyperlinks are checked by requesting header and checks that is hyperlink return 404 
error or not. Here, HTTP Error Code 404 considered as broken. All internal non­broken 
hyperlinks are pushed into the list of URLs to visit, called the crawl frontier (i.e Worker Queue). 
A content of web page is also extracted and cleaned it to process and search the keywords in a 
content within the keywords given by user.For searching/matching a keyword out of web page 
content, a very popular algorithm is implements named as Aho­Corasick 
String Matching 
Algorithm. 
“ Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred 
V. Aho and Margaret J. Corasick. It is a kind of dictionary­matching 
algorithm that locates 
elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns 
simultaneously. The complexity of the algorithm is linear in the length of the patterns plus the 
length of the searched text plus the number of output matches.” 
A separate python module is written for implementing this algorithm in this application.This 
module has a class named as AhoCorasick, and have methods to search a list of keywords in a 
text .So, this module is used by crawler to search/match a keywords from a web page content.If 
If the broken links are found in the page then matched keywords alongwith list of all broken 
links are stored in the database. 
This whole process iteratively follow above sequence flow untill all the web pages in the worker 
queue is processed. 
This application is primarily divided into two parts : Back­end 
section that deals with crawling 
and front­end 
section that provides data set for crawling.
Fig. A Flowchart of a Crawler 
2.3 Back ­End 
Back­end 
of the application is designed on Flask Python Microframework. 
Flask is a lightweight web application framework written in Python and based on the WSGI 
toolkit and Jinja2 template engine. It is BSD licensed. 
Flask takes the flexible Python programming language and provides a simple template for web 
development. Once imported into Python, Flask can be used to save time building web 
applications. An example of an application that is powered by the Flask framework is the 
community web page for Flask.
Back end of an application further consist two parts, REST web API and crawler modules. 
● WEB API act an interface between Frontend and backend of an application. 
● Crawler modules consists of whole works like dispatching, scraping, storing data, 
mailing. 
2.3.1 WEB API 
An application web api conforms REST standard and has two main endpoints, one for taking 
input the request of a website to be crawled and other one , for returning result on requesting by 
input job id.Web API only accepts a HTTP JSON request and responds with a JSON object as 
output. 
The detailed description of these two endpoints are as follows: 
● /api/v1.0/crawl/ ­Takes 
input a three fields i.e seed url of website, a array of keywords, 
and a email­id 
of an user in JSON object.Returns an JSON object containing serialized 
Website class object having fields like website/job id, status etc. 
Example of HTTP GET request: 
Example of HTTP GET Response:
● /api/v1.0/crawl/<website/job­id> 
­Takes 
an input a website/job id as prepended in the 
endpoint.Returns an JSON object containing an Website class model object .This object 
contains many fields related to website as described above. Results are return as a field of 
object as an array of hyperlinks of web pages contains broken links , a list of all broken 
links are returns alongwith these pages and the a set of keywords matched on a particular 
web page out of entered keywords by the user. 
Example of HTTP GET Response:
Web Api also returns an decriptive HTTP Errors in response headers alongwith a message 
containing error : 
● HTTP Error Code 500 : Internal server error 
● HTTP Error Code 405 : Method not allowed 
● HTTP Error Code 400 : Bad Request 
● HTTP Error Code 404 : Not Found 
Example of HTTP Error Response: 
2.3.2 Crawler Module 
Crawler module is the heart of this application which performs several vital process like 
Dispatching set of websites from the database to worker queue, crawling a webpage popped from 
worker queue, store data into database and mail back the result link to the user. 
In implementing the web crawler, several python packages are used to extracting, manipulating 
the web pages. The list of python packages which are used in this application are as follows: 
● GRequest ­to 
fetch a content of a web page by giving input as a url of web page. 
● BeautifulSoup ­to 
extract the plain text and links from contents of web page 
● Nltk ­to 
convert the utf­8 
text into plain text. 
● SqlAlchemy ­an 
ORM (Object Relational Mapper) for database intensive tasks. 
● Redis ­for 
implementing workers to spawn a crawling process from worker queue. 
● LogBook ­for 
logging of an application. 
● Smtp ­Python 
mailing module for sending mail. 
2.3.2.1 GRequest 
GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests 
easily.
2.3.2.2 BeautifulSoup 
Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, 
and modifying the parse tree.
2.3.2.3 NLTK 
NLTK is a leading platform for building Python programs to work with human language data. It 
provides easy­to­use 
interfaces to over 50 corpora and lexical resources such as WordNet, along with a 
suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and 
semantic reasoning. 
2.3.2.4 SqlAlchemy 
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application 
developers the full power and flexibility of SQL. 
It provides a full suite of well known enterprise­level 
persistence patterns, designed for efficient and 
high­performing 
database access, adapted into a simple and Pythonic domain language. 
There are two class model which are used for storing data related to user and websites results. 
● Website Class Model: List of fields related to website results. 
○ id : Unique id of website. 
○ url : root url of website 
○ last_time_crawled : time stamp of last crawled. 
○ status : status of website 
○ keywords : keywords to be searched in webpage 
○ result : result of website in json form 
○ user­id 
: id of a user who requested crawling process of this website
● User Class Model: List of fields related to user. 
○ id : unique id of user. 
○ email_id : Mail id of user where result will be mailed. 
○ websites : users requested website. 
2.3.2.5 Reddis 
RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background 
with workers. It is backed by Redis and it is designed to have a low barrier to entry. It should be 
integrated in your web stack easily. 
This application uses two reddis worker. 
● DISPATCHER ­Dispatcher 
is a worker which pops out five websites to be crawled from the 
database and pushed into the worker queue.
● CRAWLER ­Crawler 
is a worker which pops a web hyperlink from a worker queue and 
process the page, extract the broken links, enqueue new hyperlinks to be crawl into the worker 
queue, insert the result into the database and mail a link back to the user to access the result.
2.3.2.6 LogBook 
Logbook is based on the concept of loggers that are extensible by the application.Each logger and 
handler, as well as other parts of the system, may inject additional information into the logging record 
that improves the usefulness of log entries. 
It also supports the ability to inject additional information for all logging calls happening in a specific 
thread or for the whole application. For example, this makes it possible for a web application to add 
request­specific 
information to each log record such as remote address, request URL, HTTP method 
and more.
The logging system is (besides the stack) stateless and makes unit testing it very simple. If context 
managers are used, it is impossible to corrupt the stack, so each test can easily hook in custom log 
handlers. 
2.3.2.7 SMTP 
The smtplib module defines an SMTP client session object that can be used to send mail to any 
Internet machine with an SMTP or ESMTP listener daemon.
2.4. Front­End 
For Application User­Interface 
more interacting, AngularJS Front end framework is used. 
HTML is great for declaring static documents, but it falters when we try to use it for declaring 
dynamic views in web­applications. 
AngularJS lets you extend HTML vocabulary for your 
application. The resulting environment is extraordinarily expressive, readable, and quick to 
develop. 
AngularJS is a toolset for building the framework most suited to your application development. 
It is fully extensible and works well with other libraries. Every feature can be modified or 
replaced to suit your unique development workflow and feature needs. 
UI of Application takes input in 3 stages : 
● Target Website URL : Contains Valid Hyperlink to be crawled. 
● Keywords : Keywords to be searched on pages contains dead links. 
● User Mail : Mail id of user to mail back the result link after crawling done. 
This UI application make a HTTP GET request to the backend web API on submitting the form 
by user. The request contains above described three input fields and their respected value in a 
JSON form.
2.5 Screenshots of Application 
● Input Field for a seed url of website to be crawl 
● Input Field for a set of keywords to be match
● Input Field for a email­id 
of user as result hyperlink is to be sent to this 
email on completing crawling 
● Confirm Details and Submit Request Page
● Result Page shows the list of hyperlinks of pages contains broken links and 
also show the broken links and set of keywords matched in a page in a 
nested form.
Scope 
Hidden Web data integration is a major challenge nowadays. Because of 
autonomous and heterogeneous nature of hidden web content, traditional search 
engine has now become an ineffective way to search this kind of data. They can 
neither integrate the data nor they can query the hidden web sites. Hidden Web 
data needs syntactic and semantic matching to achieve fully automatic 
integration. 
Rotto Web Crawler can be widely used in the web industry to search for links and 
contents. Many companies have a heavy website like news, blogging, 
Educational sites, Government sites etc. They add large number of pages and 
hyperlinks refer to internal links or to other websites daily. Old Content in these 
sites are never reviewed by the admin to check for correctness. As the time pass 
by, the url mentioned in pages turns into dead link and it never notified to admin. 
Here Application like this can be very useful for searching broken links in their 
website and this is helpful for the admin of the site in maintaining with less flaw 
contents. 
Application search keywords service helps owner of the site to find an article 
around which links are broken. This helps owner to maintain pages related to 
specific topic errorless. 
This crawler enhances overall user experience and robustness of web platform.
Conclusion 
During the project development, We studied Web crawling at many different 
levels. Our main objectives were to develop a model for Web crawling, to study 
crawling strategies and to build a Web crawler implementing them. 
In this work, various challenges in the area of Hidden web data extraction and 
their possible solutions have been discussed. Although this system extracts, 
collects and integrates the data from various websites successfully, this work 
could be extended in near future. In this work, a search crawler has been created 
which was tested on a particular domain i.e ( text and hyperlinks). This work 
could be extended for other domains by integrating this work with the unified 
search interface.

More Related Content

What's hot

Ahref seo checker tool | ahref tool ppt
Ahref seo checker tool | ahref tool pptAhref seo checker tool | ahref tool ppt
Ahref seo checker tool | ahref tool ppt
SoftProdigy - We know software!
 
Challenges in web crawling
Challenges in web crawlingChallenges in web crawling
Challenges in web crawling
Burhan Ahmed
 
Detecting headless browsers
Detecting headless browsersDetecting headless browsers
Detecting headless browsers
Sergey Shekyan
 
Migraciones SEO para eCommerce - VisibilidadOn - Presentación Josep Deulofeu
Migraciones SEO para eCommerce - VisibilidadOn - Presentación Josep DeulofeuMigraciones SEO para eCommerce - VisibilidadOn - Presentación Josep Deulofeu
Migraciones SEO para eCommerce - VisibilidadOn - Presentación Josep Deulofeu
JosepDeulofeu
 
Creative Seo Proposal
Creative Seo ProposalCreative Seo Proposal
Creative Seo Proposal
nishalegend
 
SMX_DevTools_Monaco_2.pdf
SMX_DevTools_Monaco_2.pdfSMX_DevTools_Monaco_2.pdf
SMX_DevTools_Monaco_2.pdf
Sara Moccand-Sayegh
 
Search Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar ReportSearch Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar Report
Nandu B Rajan
 
Search Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar ReportSearch Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar Report
Nandu B Rajan
 
Metodología de Trabajo SEO en Agencias
Metodología de Trabajo SEO en AgenciasMetodología de Trabajo SEO en Agencias
Metodología de Trabajo SEO en Agencias
Luis M Villanueva
 
Acunetix - Web Vulnerability Scanner
Acunetix -  Web Vulnerability ScannerAcunetix -  Web Vulnerability Scanner
Acunetix - Web Vulnerability Scanner
Comguard India
 
Advance SEO Training - Professional SEO Techniques
Advance SEO Training - Professional SEO TechniquesAdvance SEO Training - Professional SEO Techniques
Advance SEO Training - Professional SEO Techniques
Gaurang Trivedi
 
Secure Code Warrior - Remote file inclusion
Secure Code Warrior - Remote file inclusionSecure Code Warrior - Remote file inclusion
Secure Code Warrior - Remote file inclusion
Secure Code Warrior
 
How to Increase Website Traffic by 250,000+ Monthly Visitors
How to Increase Website Traffic by 250,000+ Monthly VisitorsHow to Increase Website Traffic by 250,000+ Monthly Visitors
How to Increase Website Traffic by 250,000+ Monthly Visitors
Ross Hudgens
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
mynameismrslide
 
ppt presentation Google algorithm
ppt presentation Google algorithmppt presentation Google algorithm
ppt presentation Google algorithm
joeydutta
 
Website Analysis Report
Website Analysis ReportWebsite Analysis Report
Website Analysis Report
AuroIN
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage ClassificationPacharaStudio
 
SEO Proposal eMarket Agency
SEO Proposal eMarket AgencySEO Proposal eMarket Agency
SEO Proposal eMarket Agency
eMarket Education
 

What's hot (20)

Ahref seo checker tool | ahref tool ppt
Ahref seo checker tool | ahref tool pptAhref seo checker tool | ahref tool ppt
Ahref seo checker tool | ahref tool ppt
 
Challenges in web crawling
Challenges in web crawlingChallenges in web crawling
Challenges in web crawling
 
Detecting headless browsers
Detecting headless browsersDetecting headless browsers
Detecting headless browsers
 
Migraciones SEO para eCommerce - VisibilidadOn - Presentación Josep Deulofeu
Migraciones SEO para eCommerce - VisibilidadOn - Presentación Josep DeulofeuMigraciones SEO para eCommerce - VisibilidadOn - Presentación Josep Deulofeu
Migraciones SEO para eCommerce - VisibilidadOn - Presentación Josep Deulofeu
 
Creative Seo Proposal
Creative Seo ProposalCreative Seo Proposal
Creative Seo Proposal
 
SMX_DevTools_Monaco_2.pdf
SMX_DevTools_Monaco_2.pdfSMX_DevTools_Monaco_2.pdf
SMX_DevTools_Monaco_2.pdf
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Search Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar ReportSearch Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar Report
 
Search Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar ReportSearch Engine Optimization (SEO) Seminar Report
Search Engine Optimization (SEO) Seminar Report
 
Metodología de Trabajo SEO en Agencias
Metodología de Trabajo SEO en AgenciasMetodología de Trabajo SEO en Agencias
Metodología de Trabajo SEO en Agencias
 
Acunetix - Web Vulnerability Scanner
Acunetix -  Web Vulnerability ScannerAcunetix -  Web Vulnerability Scanner
Acunetix - Web Vulnerability Scanner
 
Website Audit Presentation
Website Audit PresentationWebsite Audit Presentation
Website Audit Presentation
 
Advance SEO Training - Professional SEO Techniques
Advance SEO Training - Professional SEO TechniquesAdvance SEO Training - Professional SEO Techniques
Advance SEO Training - Professional SEO Techniques
 
Secure Code Warrior - Remote file inclusion
Secure Code Warrior - Remote file inclusionSecure Code Warrior - Remote file inclusion
Secure Code Warrior - Remote file inclusion
 
How to Increase Website Traffic by 250,000+ Monthly Visitors
How to Increase Website Traffic by 250,000+ Monthly VisitorsHow to Increase Website Traffic by 250,000+ Monthly Visitors
How to Increase Website Traffic by 250,000+ Monthly Visitors
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
ppt presentation Google algorithm
ppt presentation Google algorithmppt presentation Google algorithm
ppt presentation Google algorithm
 
Website Analysis Report
Website Analysis ReportWebsite Analysis Report
Website Analysis Report
 
Webpage Classification
Webpage ClassificationWebpage Classification
Webpage Classification
 
SEO Proposal eMarket Agency
SEO Proposal eMarket AgencySEO Proposal eMarket Agency
SEO Proposal eMarket Agency
 

Similar to Colloquim Report - Rotto Link Web Crawler

Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
Sunny Gupta
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
IOSR Journals
 
REST full API Design
REST full API DesignREST full API Design
REST full API Design
Christian Guenther
 
An SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfAn SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdf
Mindfire LLC
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
IRJET Journal
 
Introduction to SharePoint 2013
Introduction to SharePoint 2013Introduction to SharePoint 2013
Introduction to SharePoint 2013
girish goudar
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
IRJESJOURNAL
 
A Complete Guide to Python Web Development
A Complete Guide to Python Web DevelopmentA Complete Guide to Python Web Development
A Complete Guide to Python Web Development
Sparx IT Solutions Pvt Ltd
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
Ijcem Journal
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the Internet
Razzakul Chowdhury
 
G017254554
G017254554G017254554
G017254554
IOSR Journals
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
iosrjce
 
Week10web Poster
Week10web PosterWeek10web Poster
Week10web Posters1150245
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
iosrjce
 
E017624043
E017624043E017624043
E017624043
IOSR Journals
 
Guide to Using React Router V6 in React Apps.pdf
Guide to Using React Router V6 in React Apps.pdfGuide to Using React Router V6 in React Apps.pdf
Guide to Using React Router V6 in React Apps.pdf
AdarshMathuri
 
Fundamentals of Web Development For Non-Developers
Fundamentals of Web Development For Non-DevelopersFundamentals of Web Development For Non-Developers
Fundamentals of Web Development For Non-Developers
Lemi Orhan Ergin
 
How to make React Applications SEO-friendly
How to make React Applications SEO-friendlyHow to make React Applications SEO-friendly
How to make React Applications SEO-friendly
Fibonalabs
 
Backlinks SEO tools.pdf
Backlinks SEO tools.pdfBacklinks SEO tools.pdf
Backlinks SEO tools.pdf
onlineinfatuation
 

Similar to Colloquim Report - Rotto Link Web Crawler (20)

Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 
REST full API Design
REST full API DesignREST full API Design
REST full API Design
 
An SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdfAn SEO optimized website is best charged up.pdf
An SEO optimized website is best charged up.pdf
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Introduction to SharePoint 2013
Introduction to SharePoint 2013Introduction to SharePoint 2013
Introduction to SharePoint 2013
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
 
A Complete Guide to Python Web Development
A Complete Guide to Python Web DevelopmentA Complete Guide to Python Web Development
A Complete Guide to Python Web Development
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
Resource Discovery Paper.PDF
Resource Discovery Paper.PDFResource Discovery Paper.PDF
Resource Discovery Paper.PDF
 
Discovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the InternetDiscovering Heterogeneous Resources in the Internet
Discovering Heterogeneous Resources in the Internet
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
Week10web Poster
Week10web PosterWeek10web Poster
Week10web Poster
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
E017624043
E017624043E017624043
E017624043
 
Guide to Using React Router V6 in React Apps.pdf
Guide to Using React Router V6 in React Apps.pdfGuide to Using React Router V6 in React Apps.pdf
Guide to Using React Router V6 in React Apps.pdf
 
Fundamentals of Web Development For Non-Developers
Fundamentals of Web Development For Non-DevelopersFundamentals of Web Development For Non-Developers
Fundamentals of Web Development For Non-Developers
 
How to make React Applications SEO-friendly
How to make React Applications SEO-friendlyHow to make React Applications SEO-friendly
How to make React Applications SEO-friendly
 
Backlinks SEO tools.pdf
Backlinks SEO tools.pdfBacklinks SEO tools.pdf
Backlinks SEO tools.pdf
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Colloquim Report - Rotto Link Web Crawler

  • 1. Rotto Link Web Crawler Summer Internship Report Submitted By Akshay Pratap Singh 2011ECS01 Under the Guidance of Mr. Saurabh Kumar Senior Developer Ophio Computer Solutions Pvt. Ltd. in partial fulfillment of Summer Internship for the award of the degree of Bachelor of Technology in Computer Science & Engineering SHRI MATA VAISHNO DEVI UNIVERSITY KATRA, JAMMU & KASHMIR JUNE­JULY 2014
  • 2. UNDERTAKING I hereby certify that the Colloquium Report entitled “Rotto Link Web Crawler”, submitted in the partial fulfillment of the requirements for the award of Bachelor of Technology in Computer Science And Engineering to the School of Computer Science & Engineering of Shri Mata Vaishno Devi University, Katra, J&K is an authentic record of my own study carried out during a period from June­July, 2014. The matter presented in this report has not been submitted by me for the award of any other degree elsewhere. The content of the report does not violate any copyright and due credit is given in to the source of information if any. Name :­Akshay Pratap Singh Entry Number :­2011ECS01 Place :­SMVDU, Katra Date:­1 December 2014
  • 4. About the Company Ophio is a private company where passionate, dedicated programmers and creative designers team develop outstanding services and applications for Web, iPhone, iPad, Android and the Mac. Their approach is simple, they take well designed products make them function beautifully. Specializing in the creation of unique, immersive and stunning web and mobile applications, videos and animations. At Ophio, they literally make digital dreams, reality. The Ophio team are a core group of skilled development experts, allowing us to bring projects to life, adding an extra dimension of interactivity into all our work. Whether it be responsive builds, CMS sites, microsites or full e­Commerce systems, they have the team to create superb products. With launch of iPhone 5 and opening of 3G spectrum in asian countries, there is huge demand for iPhones. They help others product owner reach out to your customers with creative & interactive applications built by our team of experts. Future lies in the open source. Android platform is a robust opening system meant for rapid development, developers at Ophio exploit it to full and build content rich application for your mobile device. Ophio is comprising of 20 members team, out of which 16 are developers, 2 are motion designers and 2 are QA Analyst. Team is best defined as youthful, ambitious, amiable and passionate. Delivering high quality work on time, bringing value to the project and our clients.
  • 5. Table of Contents 1. Abstract 2. Project Description 2.1 Web Crawler 2.2 Rotto Link Web Crawler 2.3 Back­End 2.3.1 Web Api 2.3.2 Crawler Module 2.3.2.1 GRequests 2.3.2.2 BeautifulSoup 2.3.2.3 NLTK 2.3.2.4 SqlAlchemy 2.3.2.5 Redis 2.3.2.6 LogBook 2.3.2.2 Smtp 2.4 Front­End 2.5 Screenshots of Application 3. Scope 4. Conclusion
  • 6. Abstract Rotto(a.k.a Broken) link web crawler is an application tool to extract the broken link (i.e dead links) within a complete website. This application takes a seed link, a link of a website to be crawl, and visits every page of a website and search for broken links. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds the non­broken hyperlinks to the list of URLs to visit, called the crawl frontier (i.e Worker Queue).And, broken links are added into the database alongwith the matched keywords in the content of a webpage. This process continues untill site crawl completey. This application follow REST architecture to design its web API. Web API take a target/seed url and set of keywords to be searched in a page having broken hyperlinks and return a result containing a set of link of pages having broken hyperlinks. Web API has two end­points, which performs two actions : ● An HTTP GET request containing a seed/target url and a array of keywords in an JSON form.This returns a user a JOB ID which can be used to get the results. ● An HTTP GET request containing a job­Id .This returns a result as a set of pages matched a keywords sent by earlier request and also having a broken links. All request and response are in JSON form. Application uses two Reddis workers for dispatching websites from database those are pending to a worker queue and crawling websites in queue. As the crawler visits all pages of website and stores all result in database with their respective JOB ID. Application is using sqlite engine. Database implements SqlAlchemy for doing database events. For UI purpose, Application interface is designed using AngularJS framework.
  • 7. Project Description 2.1 Web Crawler A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information as it goes. Such archives are usually stored such that they can be viewed, read and navigated as they were on the live web, but are preserved as ‘snapshots'. The large volume implies that the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted. Fig. A sequence flow of a web crawler
  • 8. 2.2 Rotto Link Web Crawler Rotto(a.k.a Broken) link web crawler extracts the broken link (i.e dead links) within a complete website. This application takes a seed link, a link of a website to be crawl, and visits every page of a website and search for broken links. As the crawler visits these URLs, it identifies all the hyperlinks in the web page and these hyperlinks are distributed into two parts : ● internal links(i.e refer to internal site web page) and, ● external links(i.e refer to outside site web page) Then , all hyperlinks are checked by requesting header and checks that is hyperlink return 404 error or not. Here, HTTP Error Code 404 considered as broken. All internal non­broken hyperlinks are pushed into the list of URLs to visit, called the crawl frontier (i.e Worker Queue). A content of web page is also extracted and cleaned it to process and search the keywords in a content within the keywords given by user.For searching/matching a keyword out of web page content, a very popular algorithm is implements named as Aho­Corasick String Matching Algorithm. “ Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary­matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches.” A separate python module is written for implementing this algorithm in this application.This module has a class named as AhoCorasick, and have methods to search a list of keywords in a text .So, this module is used by crawler to search/match a keywords from a web page content.If If the broken links are found in the page then matched keywords alongwith list of all broken links are stored in the database. This whole process iteratively follow above sequence flow untill all the web pages in the worker queue is processed. This application is primarily divided into two parts : Back­end section that deals with crawling and front­end section that provides data set for crawling.
  • 9. Fig. A Flowchart of a Crawler 2.3 Back ­End Back­end of the application is designed on Flask Python Microframework. Flask is a lightweight web application framework written in Python and based on the WSGI toolkit and Jinja2 template engine. It is BSD licensed. Flask takes the flexible Python programming language and provides a simple template for web development. Once imported into Python, Flask can be used to save time building web applications. An example of an application that is powered by the Flask framework is the community web page for Flask.
  • 10. Back end of an application further consist two parts, REST web API and crawler modules. ● WEB API act an interface between Frontend and backend of an application. ● Crawler modules consists of whole works like dispatching, scraping, storing data, mailing. 2.3.1 WEB API An application web api conforms REST standard and has two main endpoints, one for taking input the request of a website to be crawled and other one , for returning result on requesting by input job id.Web API only accepts a HTTP JSON request and responds with a JSON object as output. The detailed description of these two endpoints are as follows: ● /api/v1.0/crawl/ ­Takes input a three fields i.e seed url of website, a array of keywords, and a email­id of an user in JSON object.Returns an JSON object containing serialized Website class object having fields like website/job id, status etc. Example of HTTP GET request: Example of HTTP GET Response:
  • 11. ● /api/v1.0/crawl/<website/job­id> ­Takes an input a website/job id as prepended in the endpoint.Returns an JSON object containing an Website class model object .This object contains many fields related to website as described above. Results are return as a field of object as an array of hyperlinks of web pages contains broken links , a list of all broken links are returns alongwith these pages and the a set of keywords matched on a particular web page out of entered keywords by the user. Example of HTTP GET Response:
  • 12. Web Api also returns an decriptive HTTP Errors in response headers alongwith a message containing error : ● HTTP Error Code 500 : Internal server error ● HTTP Error Code 405 : Method not allowed ● HTTP Error Code 400 : Bad Request ● HTTP Error Code 404 : Not Found Example of HTTP Error Response: 2.3.2 Crawler Module Crawler module is the heart of this application which performs several vital process like Dispatching set of websites from the database to worker queue, crawling a webpage popped from worker queue, store data into database and mail back the result link to the user. In implementing the web crawler, several python packages are used to extracting, manipulating the web pages. The list of python packages which are used in this application are as follows: ● GRequest ­to fetch a content of a web page by giving input as a url of web page. ● BeautifulSoup ­to extract the plain text and links from contents of web page ● Nltk ­to convert the utf­8 text into plain text. ● SqlAlchemy ­an ORM (Object Relational Mapper) for database intensive tasks. ● Redis ­for implementing workers to spawn a crawling process from worker queue. ● LogBook ­for logging of an application. ● Smtp ­Python mailing module for sending mail. 2.3.2.1 GRequest GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.
  • 13. 2.3.2.2 BeautifulSoup Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
  • 14. 2.3.2.3 NLTK NLTK is a leading platform for building Python programs to work with human language data. It provides easy­to­use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. 2.3.2.4 SqlAlchemy SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. It provides a full suite of well known enterprise­level persistence patterns, designed for efficient and high­performing database access, adapted into a simple and Pythonic domain language. There are two class model which are used for storing data related to user and websites results. ● Website Class Model: List of fields related to website results. ○ id : Unique id of website. ○ url : root url of website ○ last_time_crawled : time stamp of last crawled. ○ status : status of website ○ keywords : keywords to be searched in webpage ○ result : result of website in json form ○ user­id : id of a user who requested crawling process of this website
  • 15. ● User Class Model: List of fields related to user. ○ id : unique id of user. ○ email_id : Mail id of user where result will be mailed. ○ websites : users requested website. 2.3.2.5 Reddis RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers. It is backed by Redis and it is designed to have a low barrier to entry. It should be integrated in your web stack easily. This application uses two reddis worker. ● DISPATCHER ­Dispatcher is a worker which pops out five websites to be crawled from the database and pushed into the worker queue.
  • 16. ● CRAWLER ­Crawler is a worker which pops a web hyperlink from a worker queue and process the page, extract the broken links, enqueue new hyperlinks to be crawl into the worker queue, insert the result into the database and mail a link back to the user to access the result.
  • 17. 2.3.2.6 LogBook Logbook is based on the concept of loggers that are extensible by the application.Each logger and handler, as well as other parts of the system, may inject additional information into the logging record that improves the usefulness of log entries. It also supports the ability to inject additional information for all logging calls happening in a specific thread or for the whole application. For example, this makes it possible for a web application to add request­specific information to each log record such as remote address, request URL, HTTP method and more.
  • 18. The logging system is (besides the stack) stateless and makes unit testing it very simple. If context managers are used, it is impossible to corrupt the stack, so each test can easily hook in custom log handlers. 2.3.2.7 SMTP The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.
  • 19. 2.4. Front­End For Application User­Interface more interacting, AngularJS Front end framework is used. HTML is great for declaring static documents, but it falters when we try to use it for declaring dynamic views in web­applications. AngularJS lets you extend HTML vocabulary for your application. The resulting environment is extraordinarily expressive, readable, and quick to develop. AngularJS is a toolset for building the framework most suited to your application development. It is fully extensible and works well with other libraries. Every feature can be modified or replaced to suit your unique development workflow and feature needs. UI of Application takes input in 3 stages : ● Target Website URL : Contains Valid Hyperlink to be crawled. ● Keywords : Keywords to be searched on pages contains dead links. ● User Mail : Mail id of user to mail back the result link after crawling done. This UI application make a HTTP GET request to the backend web API on submitting the form by user. The request contains above described three input fields and their respected value in a JSON form.
  • 20. 2.5 Screenshots of Application ● Input Field for a seed url of website to be crawl ● Input Field for a set of keywords to be match
  • 21. ● Input Field for a email­id of user as result hyperlink is to be sent to this email on completing crawling ● Confirm Details and Submit Request Page
  • 22. ● Result Page shows the list of hyperlinks of pages contains broken links and also show the broken links and set of keywords matched in a page in a nested form.
  • 23. Scope Hidden Web data integration is a major challenge nowadays. Because of autonomous and heterogeneous nature of hidden web content, traditional search engine has now become an ineffective way to search this kind of data. They can neither integrate the data nor they can query the hidden web sites. Hidden Web data needs syntactic and semantic matching to achieve fully automatic integration. Rotto Web Crawler can be widely used in the web industry to search for links and contents. Many companies have a heavy website like news, blogging, Educational sites, Government sites etc. They add large number of pages and hyperlinks refer to internal links or to other websites daily. Old Content in these sites are never reviewed by the admin to check for correctness. As the time pass by, the url mentioned in pages turns into dead link and it never notified to admin. Here Application like this can be very useful for searching broken links in their website and this is helpful for the admin of the site in maintaining with less flaw contents. Application search keywords service helps owner of the site to find an article around which links are broken. This helps owner to maintain pages related to specific topic errorless. This crawler enhances overall user experience and robustness of web platform.
  • 24. Conclusion During the project development, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web crawler implementing them. In this work, various challenges in the area of Hidden web data extraction and their possible solutions have been discussed. Although this system extracts, collects and integrates the data from various websites successfully, this work could be extended in near future. In this work, a search crawler has been created which was tested on a particular domain i.e ( text and hyperlinks). This work could be extended for other domains by integrating this work with the unified search interface.