SlideShare a Scribd company logo
1 of 21
Web Scraping
Submitted By:
Bhawesh Rajpal
Submitted To:
Mr. Kuldeep Yadav
I.T. Department
Content
• What is Web Scraping?
• Why Web Scraping is done?
• How Web Scraping is done?
• References
What is Web Scraping?
• Scraping
Using tools to gather
meaningful data.
A wide range of web
scraping techniques and
tools exist. These can be
as simple as copy/paste
and increase in complexity
to automation tools, HTML
parsing, APIs and
programming.
• HTTP
HyperText Transfer Protocol
Machine interchange
information transported over
the Internet to enable multi-
media data exchange, [AKA
WWW]. The protocol defines
aspects of authentication,
requests, status codes,
persistent connections,
client/server request/response.
etc.
Access a server on port 80; the
declarative Document Type
Definition ( HTML, XML, JSON,
etc.)
• HTML
HyperText Markup
Language
The standard markup
language on the Web
As the web evolves so
does the proliferation of
technical wrappers
surrounding the visible
content of websites (text
and data)
• Parsing
The act of analyzing
the strings and
symbols to reveal
only the data you
need.
It also means to
resolve a a particular
type of component
into desired type.
• Crawling
Moving across or through a
website in an attempt to gather
data from more than one URL or
page.
A web crawler (also known as
a web spider or web robot) is a
program or automated script
which browses the World
Wide Web.
Many legitimate sites, in
particular search engines, use
spidering as a means of
providing up-to-date data.
Why Web Scraping is done?
• To gather the data for websites.
• To collect training data.
• Marketing.
• Scrape search engine results for SEO tracking.
• Scrape people profiles from social networks
for tracking online reputation.
How Web Scraping is done?
Web Scraping can be done by any of following
ways:
» Manual
» Automated Tools
» By Using Scripts
• Manual
1. Open the website.
2. Open it’s page
source.
3. Search for
particular tag.
4. Copy the desired
information.
5. Put it in the file.
• Automated Tools
There are variety of
automated tools
present in market in
which you just need
to specify the tag ,
the output file and
it’s format.
HTTtrack
• It is free and open source Web crawler and offline
browser, designed to download websites.
• HTTrack allows users to download World Wide
Web sites from the Internet to a local
computer. By default, HTTrack arranges the
downloaded site by the original site's relative
link-structure. The downloaded (or "mirrored")
website can be browsed by opening a page of the
site in a browser.
HTTtrack (continued)
Image Source: https://en.wikipedia.org/wiki/HTTrack
Import.io
• It is market leading SaaS solution, free and paid
versions available.
• import.io is a web-based platform for extracting
data from websites without writing any code.
• The tool allows people to converted
unstructured web data into a structured format
for use in Machine Learning, Artificial
Intelligence, Retail Price Monitoring, Store
Locators as well as academic and other research.
It is also used extensively by investigative
journalists.
• By Using Scripts
In this method to extract
data from website user has
to write the complete
scripts to extract the
desired data from website.
Image source:
https://i.stack.imgur.com/UdEFd.jpg
Using Python
• Beautiful Soup
Image Source: https://first-web-scraper.readthedocs.io/en/latest/
• Using Scrapy
Image Source: https://doc.scrapy.org/en/latest/intro/tutorial.html
Using Node.js
• Cheerio
Image Source: https://www.codementor.io/johnnyb/how-to-write-a-web-scraper-in-nodejs-du108266t
Is it LEGAL????
Maybe…
References
1. https://docs.google.com/presentation/d/1QVUR3B4QDgM5fLBtFditwKyGwij0hM1
qDCUL56vs34k/edit#slide=id.p [What is web scraping & basic definitions]
2. https://en.wikipedia.org/wiki/Web_scraping [Basic Definitions]
3. https://en.wikipedia.org/wiki/HTTrack [HTTtrack]
4. https://en.wikipedia.org/wiki/Import.io [Import.io]
5. https://first-web-scraper.readthedocs.io/en/latest/ [Beautiful Soup]
6. https://doc.scrapy.org/en/latest/intro/tutorial.html [Scrapy]
7. https://www.codementor.io/johnnyb/how-to-write-a-web-scraper-in-nodejs-
du108266t [Cheerio]
Web scraping & browser automation

More Related Content

What's hot

What's hot (20)

Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Web Mining
Web MiningWeb Mining
Web Mining
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Introduction to Web Development
Introduction to Web DevelopmentIntroduction to Web Development
Introduction to Web Development
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Web development
Web developmentWeb development
Web development
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Full stack web development
Full stack web developmentFull stack web development
Full stack web development
 
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDFCS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I  PPT  IN PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
 

Similar to Web scraping & browser automation

WEB-DBMS A quick reference
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick reference
Marc Dy
 
Introduction_to_Intndhjehddhjdhrjkrhernet.pptx
Introduction_to_Intndhjehddhjdhrjkrhernet.pptxIntroduction_to_Intndhjehddhjdhrjkrhernet.pptx
Introduction_to_Intndhjehddhjdhrjkrhernet.pptx
rohitkumar54448
 
Network Basics & Internet
Network Basics & InternetNetwork Basics & Internet
Network Basics & Internet
VNSGU
 

Similar to Web scraping & browser automation (20)

Eba ppt rajesh
Eba ppt rajeshEba ppt rajesh
Eba ppt rajesh
 
web development process WT
web development process WTweb development process WT
web development process WT
 
Wt unit 1 ppts web development process
Wt unit 1 ppts web development processWt unit 1 ppts web development process
Wt unit 1 ppts web development process
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basics
 
WEB-DBMS A quick reference
WEB-DBMS A quick referenceWEB-DBMS A quick reference
WEB-DBMS A quick reference
 
Internet
InternetInternet
Internet
 
Internet
InternetInternet
Internet
 
WT_TOTAL.pdf
WT_TOTAL.pdfWT_TOTAL.pdf
WT_TOTAL.pdf
 
Internet.ppt
Internet.pptInternet.ppt
Internet.ppt
 
Internet.ppt
Internet.pptInternet.ppt
Internet.ppt
 
Internet and Web - Week 9.ppt
Internet and Web - Week 9.pptInternet and Web - Week 9.ppt
Internet and Web - Week 9.ppt
 
Internet
InternetInternet
Internet
 
Introduction_to_Intndhjehddhjdhrjkrhernet.pptx
Introduction_to_Intndhjehddhjdhrjkrhernet.pptxIntroduction_to_Intndhjehddhjdhrjkrhernet.pptx
Introduction_to_Intndhjehddhjdhrjkrhernet.pptx
 
Network Basics & Internet
Network Basics & InternetNetwork Basics & Internet
Network Basics & Internet
 
Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022Top 17 web scraping tools for data extraction in 2022
Top 17 web scraping tools for data extraction in 2022
 
Intro to internet 1
Intro to internet 1Intro to internet 1
Intro to internet 1
 
Unit 01 (1).pdf
Unit 01 (1).pdfUnit 01 (1).pdf
Unit 01 (1).pdf
 
Week two lecture
Week two lectureWeek two lecture
Week two lecture
 
E commerce infrastructure
E commerce infrastructureE commerce infrastructure
E commerce infrastructure
 
Evolution Of The Web Platform & Browser Security
Evolution Of The Web Platform & Browser SecurityEvolution Of The Web Platform & Browser Security
Evolution Of The Web Platform & Browser Security
 

Recently uploaded

一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书
F
 
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
mikehavy0
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
ayvbos
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
AS
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
AS
 
一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理
SS
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
hfkmxufye
 
一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理
F
 

Recently uploaded (20)

一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书一比一原版贝德福特大学毕业证学位证书
一比一原版贝德福特大学毕业证学位证书
 
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
Abortion Clinic in Germiston +27791653574 WhatsApp Abortion Clinic Services i...
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303Lowongan Kerja LC Yogyakarta Terbaru 085746015303
Lowongan Kerja LC Yogyakarta Terbaru 085746015303
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
A LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptx
A LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptxA LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptx
A LOOK INTO NETWORK TECHNOLOGIES MAINLY WAN.pptx
 
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
一比一原版(USYD毕业证书)悉尼大学毕业证原件一模一样
 
一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理一比一原版英国格林多大学毕业证如何办理
一比一原版英国格林多大学毕业证如何办理
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
[Hackersuli] Élő szövet a fémvázon: Python és gépi tanulás a Zeek platformon
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
一比一原版(Dundee毕业证书)英国爱丁堡龙比亚大学毕业证如何办理
 
一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理一比一原版澳大利亚迪肯大学毕业证如何办理
一比一原版澳大利亚迪肯大学毕业证如何办理
 
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
如何办理(UCLA毕业证)加州大学洛杉矶分校毕业证成绩单本科硕士学位证留信学历认证
 
一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理一比一原版帝国理工学院毕业证如何办理
一比一原版帝国理工学院毕业证如何办理
 

Web scraping & browser automation

  • 1. Web Scraping Submitted By: Bhawesh Rajpal Submitted To: Mr. Kuldeep Yadav I.T. Department
  • 2. Content • What is Web Scraping? • Why Web Scraping is done? • How Web Scraping is done? • References
  • 3. What is Web Scraping? • Scraping Using tools to gather meaningful data. A wide range of web scraping techniques and tools exist. These can be as simple as copy/paste and increase in complexity to automation tools, HTML parsing, APIs and programming.
  • 4. • HTTP HyperText Transfer Protocol Machine interchange information transported over the Internet to enable multi- media data exchange, [AKA WWW]. The protocol defines aspects of authentication, requests, status codes, persistent connections, client/server request/response. etc. Access a server on port 80; the declarative Document Type Definition ( HTML, XML, JSON, etc.)
  • 5. • HTML HyperText Markup Language The standard markup language on the Web As the web evolves so does the proliferation of technical wrappers surrounding the visible content of websites (text and data)
  • 6. • Parsing The act of analyzing the strings and symbols to reveal only the data you need. It also means to resolve a a particular type of component into desired type.
  • 7. • Crawling Moving across or through a website in an attempt to gather data from more than one URL or page. A web crawler (also known as a web spider or web robot) is a program or automated script which browses the World Wide Web. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data.
  • 8. Why Web Scraping is done? • To gather the data for websites. • To collect training data. • Marketing. • Scrape search engine results for SEO tracking. • Scrape people profiles from social networks for tracking online reputation.
  • 9. How Web Scraping is done? Web Scraping can be done by any of following ways: » Manual » Automated Tools » By Using Scripts
  • 10. • Manual 1. Open the website. 2. Open it’s page source. 3. Search for particular tag. 4. Copy the desired information. 5. Put it in the file.
  • 11. • Automated Tools There are variety of automated tools present in market in which you just need to specify the tag , the output file and it’s format.
  • 12. HTTtrack • It is free and open source Web crawler and offline browser, designed to download websites. • HTTrack allows users to download World Wide Web sites from the Internet to a local computer. By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser.
  • 13. HTTtrack (continued) Image Source: https://en.wikipedia.org/wiki/HTTrack
  • 14. Import.io • It is market leading SaaS solution, free and paid versions available. • import.io is a web-based platform for extracting data from websites without writing any code. • The tool allows people to converted unstructured web data into a structured format for use in Machine Learning, Artificial Intelligence, Retail Price Monitoring, Store Locators as well as academic and other research. It is also used extensively by investigative journalists.
  • 15. • By Using Scripts In this method to extract data from website user has to write the complete scripts to extract the desired data from website. Image source: https://i.stack.imgur.com/UdEFd.jpg
  • 16. Using Python • Beautiful Soup Image Source: https://first-web-scraper.readthedocs.io/en/latest/
  • 17. • Using Scrapy Image Source: https://doc.scrapy.org/en/latest/intro/tutorial.html
  • 18. Using Node.js • Cheerio Image Source: https://www.codementor.io/johnnyb/how-to-write-a-web-scraper-in-nodejs-du108266t
  • 20. References 1. https://docs.google.com/presentation/d/1QVUR3B4QDgM5fLBtFditwKyGwij0hM1 qDCUL56vs34k/edit#slide=id.p [What is web scraping & basic definitions] 2. https://en.wikipedia.org/wiki/Web_scraping [Basic Definitions] 3. https://en.wikipedia.org/wiki/HTTrack [HTTtrack] 4. https://en.wikipedia.org/wiki/Import.io [Import.io] 5. https://first-web-scraper.readthedocs.io/en/latest/ [Beautiful Soup] 6. https://doc.scrapy.org/en/latest/intro/tutorial.html [Scrapy] 7. https://www.codementor.io/johnnyb/how-to-write-a-web-scraper-in-nodejs- du108266t [Cheerio]