MINI PROJECT REPORT
on
WEB SCRAPER IN PHP
INTRODUCTION
 WHAT IS WEB SCRAPING ?
 Web scraping is a computer software technique of extracting information from websites.
Web scraping is closely related to web indexing, which indexes information on the web using
a bot or web crawler and is a universal technique adopted by most search engines. In
contrast, web scraping focuses more on the transformation of unstructured data on the web,
typically in HTML format, into structured data that can be stored and analysed in a central
local database or spread sheet. Uses of web scraping include online price comparison,
contact scraping, weather data monitoring, website change detection,
research, web mash up and web data integration.
TEAM MEMBER
1.MANISH BHATTACHARYA
(MRT11UGBCS025)
2. RITESH SINGH
(MRT11UGBCS048)
3.SAIFUR REHMAN
(MRT11UGBCS033)
S.R.S
 SYSTEM REQUIREMENT
o WINDOWS 7 OS
o 1GB RAM
 SOFTWARE REQUIREMENT
o PHP VERSION 5.4.16
o APACHE 2.4.4
o MY SQL 5.6.12
HOW IT WORKS?
 Web scraping involves downloading the website's responses (which
are all text) and then doing some analysis and extracting the data
you need out of it using a programming language.
 It loads the HTML code of any website , then search for particular
tags using REGEX.
 There are two styles of writing RegEx :
1) POSIX regular expressions
2) PERL style regular expression
 we have used perl style regular expression for e.g.
• ["/<title>(.*)</title>/“] will extract the title of the web page
 Perl or Python scripting languages are recommended by most
professionals.
 It doesn't matter which language you use as long as it supports
advanced text parsing capabilities like Regular Expressions and
HTML parsers.
HOW THIS IDEA CAME UP ?
 MANISH :- I have written a crawler to find all the link from a page in
past(5 months ago , you can find it here
https://github.com/introvertmac/MY_HOF ) so
 Thought to use that concept to create a better crawler to find link,
meta data, keywords, author ,images and all.
USES
 It give the brief idea of how a search engine work ,
search engines works on same concept.
 They use bots/crawler/scrapper to get the webpage and
index them into their own servers and provide search
result on keywords.
 If the data feed fails or any problem related to codes,
you can set up your script to email you alerting you so
that you can correct the errors.
 IT can also be used for various purposes :-
o For Research
o For Businesses : Market Analysis
o For Marketing : Lead Generation
IS IT LEGAL ?
 Web scraping may be against the terms of use of some
websites.
 The enforceability of these terms is unclear. While
outright duplication of original expression will in many
cases be illegal, in the United States the courts ruled that
duplication of facts is allowable.
 U.S. courts have acknowledged that users of "scrapers" "
may be held liable for committing trespass to chattels,
which involves a computer system itself being
considered personal property upon which the user of a
scraper is trespassing.
ANY QUESTIONS?
THANKYOU

Web scraper using PHP

  • 1.
  • 2.
    INTRODUCTION  WHAT ISWEB SCRAPING ?  Web scraping is a computer software technique of extracting information from websites. Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analysed in a central local database or spread sheet. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mash up and web data integration.
  • 3.
    TEAM MEMBER 1.MANISH BHATTACHARYA (MRT11UGBCS025) 2.RITESH SINGH (MRT11UGBCS048) 3.SAIFUR REHMAN (MRT11UGBCS033)
  • 4.
    S.R.S  SYSTEM REQUIREMENT oWINDOWS 7 OS o 1GB RAM  SOFTWARE REQUIREMENT o PHP VERSION 5.4.16 o APACHE 2.4.4 o MY SQL 5.6.12
  • 5.
    HOW IT WORKS? Web scraping involves downloading the website's responses (which are all text) and then doing some analysis and extracting the data you need out of it using a programming language.  It loads the HTML code of any website , then search for particular tags using REGEX.  There are two styles of writing RegEx : 1) POSIX regular expressions 2) PERL style regular expression  we have used perl style regular expression for e.g. • ["/<title>(.*)</title>/“] will extract the title of the web page  Perl or Python scripting languages are recommended by most professionals.  It doesn't matter which language you use as long as it supports advanced text parsing capabilities like Regular Expressions and HTML parsers.
  • 6.
    HOW THIS IDEACAME UP ?  MANISH :- I have written a crawler to find all the link from a page in past(5 months ago , you can find it here https://github.com/introvertmac/MY_HOF ) so  Thought to use that concept to create a better crawler to find link, meta data, keywords, author ,images and all.
  • 7.
    USES  It givethe brief idea of how a search engine work , search engines works on same concept.  They use bots/crawler/scrapper to get the webpage and index them into their own servers and provide search result on keywords.  If the data feed fails or any problem related to codes, you can set up your script to email you alerting you so that you can correct the errors.  IT can also be used for various purposes :- o For Research o For Businesses : Market Analysis o For Marketing : Lead Generation
  • 8.
    IS IT LEGAL?  Web scraping may be against the terms of use of some websites.  The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled that duplication of facts is allowable.  U.S. courts have acknowledged that users of "scrapers" " may be held liable for committing trespass to chattels, which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing.
  • 9.
  • 10.