Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.
2. Content
• What is Web Scraping?
• Why Web Scraping is done?
• How Web Scraping is done?
• References
3. What is Web Scraping?
• Scraping
Using tools to gather
meaningful data.
A wide range of web
scraping techniques and
tools exist. These can be
as simple as copy/paste
and increase in complexity
to automation tools, HTML
parsing, APIs and
programming.
4. • HTTP
HyperText Transfer Protocol
Machine interchange
information transported over
the Internet to enable multi-
media data exchange, [AKA
WWW]. The protocol defines
aspects of authentication,
requests, status codes,
persistent connections,
client/server request/response.
etc.
Access a server on port 80; the
declarative Document Type
Definition ( HTML, XML, JSON,
etc.)
5. • HTML
HyperText Markup
Language
The standard markup
language on the Web
As the web evolves so
does the proliferation of
technical wrappers
surrounding the visible
content of websites (text
and data)
6. • Parsing
The act of analyzing
the strings and
symbols to reveal
only the data you
need.
It also means to
resolve a a particular
type of component
into desired type.
7. • Crawling
Moving across or through a
website in an attempt to gather
data from more than one URL or
page.
A web crawler (also known as
a web spider or web robot) is a
program or automated script
which browses the World
Wide Web.
Many legitimate sites, in
particular search engines, use
spidering as a means of
providing up-to-date data.
8. Why Web Scraping is done?
• To gather the data for websites.
• To collect training data.
• Marketing.
• Scrape search engine results for SEO tracking.
• Scrape people profiles from social networks
for tracking online reputation.
9. How Web Scraping is done?
Web Scraping can be done by any of following
ways:
» Manual
» Automated Tools
» By Using Scripts
10. • Manual
1. Open the website.
2. Open it’s page
source.
3. Search for
particular tag.
4. Copy the desired
information.
5. Put it in the file.
11. • Automated Tools
There are variety of
automated tools
present in market in
which you just need
to specify the tag ,
the output file and
it’s format.
12. HTTtrack
• It is free and open source Web crawler and offline
browser, designed to download websites.
• HTTrack allows users to download World Wide
Web sites from the Internet to a local
computer. By default, HTTrack arranges the
downloaded site by the original site's relative
link-structure. The downloaded (or "mirrored")
website can be browsed by opening a page of the
site in a browser.
14. Import.io
• It is market leading SaaS solution, free and paid
versions available.
• import.io is a web-based platform for extracting
data from websites without writing any code.
• The tool allows people to converted
unstructured web data into a structured format
for use in Machine Learning, Artificial
Intelligence, Retail Price Monitoring, Store
Locators as well as academic and other research.
It is also used extensively by investigative
journalists.
15. • By Using Scripts
In this method to extract
data from website user has
to write the complete
scripts to extract the
desired data from website.
Image source:
https://i.stack.imgur.com/UdEFd.jpg