SlideShare a Scribd company logo
1 of 17
Name: Shubham Pralhad
Jaybhaye
ROLL NO: 2111023
SUBJECT: DATA STRUCTURE
CE-1 PRESENTATION
TOPIC: WEB SCAPING
•What is Web Scraping
•Workflow of webscraper
•Useful libraries available
•Which library to use for which job
•Leagality
Web scraping is a technique for gathering
data or information on web pages. You could
revisit your favorite web site every time it
updates for new information.
Or you could write a web scraper to have it
do it for you!
WEB SCRAPING
WHAT IS IT ?
WEB SCAPING
•It is a method to extract data from a website
that does not have an API or we want to
extract a LOT of data which we can not do
through an API due to rate limiting.
•Through web scraping we can extract any
data which we can see while browsing the
web
USAGE
WEB SCRAPING IN REAL LIFE
•Extract product information
•Extract job postings and internships
•Extract offers and discounts from deal-of-the-
day websites
•Extract data to make a search engine
•Gathering weather data
•etc.
ADVANTAGES
WEB SCRAPING VS. USING AN API
•Web Scraping is not rate limited
•Anonymously access the website and
gather data
•Some websites do not have an API
•Some data is not accessible through an API
•and many more !
WORKFLOW
ESSENTIAL PARTS OF WEB SCRAPING
Web Scraping follows this workflow:
• Get the website - using HTTP library
• Parse the html document - using any parsing library
• Store the results - either a db, csv, text file, etc
We will focus more on parsing.
LIBRARIES
USEFUL LIBRARIES AVAILABLE
•BeautifulSoup (bs4)
•lxml
•selenium
•re
•scrapy
HTTP LIBRARIES
USEFUL LIBRARIES AVAILABLE
• Requests
r = requests.get('https://www.google.com').html
•urllib/urllib2
html = urllib2.urlopen('http://python.org/').read()
• httplib/httplib2
h = httplib2.Http(".cache")
(resp_headers, content) =
h.request("http://pydelhi.org/", "GET")
PARSING LIBRARIES
USEFUL LIBRARIES AVAILABLE
•BeautifulSoup (bs4)
tree = BeautifulSoup(html_doc)
tree.title
•lxml
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()’)
•re
title = re.findall('<title>(.*?)</title>', html_doc)
BEAUTIFULSOUP
PROS AND CONS !
•We can learn it fast
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
•very easy to use
•purely in Python
•slow :(
LXML
PROS AND CONS !
The lxml XML toolkit provides Pythonic bindings for the C
libraries libxml2 and libxslt without sacrificing speed
•very fast
•not purely in Python
•If you have no "pure Python" requirement use
lxml
•lxml works with all python versions from 2.x
to 3.x
RE
PROS AND CONS !
•requires you to learn its symbols e.g
'.',*,$,^,b,w
•can become complex
•purely baked in Python
•a part of standard library
•very fast –
•every Python version
COMPARISON
BS4 VS. LXML VS. RE
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)
RESULT
•manoj@manoj:~/Desktop$ python test.py
•bs_test took 1851.457 ms
•lxml_test took 232.942 ms
•regex_test took 7.186 ms
•lxml took 32x more time than re,
BeautifulSoup took 245x! more time than re
Is Web scraping legal?
•In short, the action of web scraping isn't illegal.
However, some rules need to be followed. Web scraping
becomes illegal when non publicly available data
becomes extracted.
THANK YOU

More Related Content

What's hot

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learningTonmoy Bhagawati
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | EdurekaEdureka!
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text MiningHemant Sharma
 
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...Edureka!
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaEdureka!
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt Poojamanic
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Machine learning ppt
Machine learning pptMachine learning ppt
Machine learning pptRajat Sharma
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 

What's hot (20)

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Machine Learning Course | Edureka
Machine Learning Course | EdurekaMachine Learning Course | Edureka
Machine Learning Course | Edureka
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
Data Science vs Machine Learning – What’s The Difference? | Data Science Cour...
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaMachine Learning in 10 Minutes | What is Machine Learning? | Edureka
Machine Learning in 10 Minutes | What is Machine Learning? | Edureka
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Machine learning ppt
Machine learning pptMachine learning ppt
Machine learning ppt
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 

Similar to Web Scraping Basics and Useful Libraries

Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrRobert Douglass
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solrguest432cd6
 
Real World REST with Atom/AtomPub
Real World REST with Atom/AtomPubReal World REST with Atom/AtomPub
Real World REST with Atom/AtomPubPeter Keane
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting HackingMike Ellis
 
Sugblr sitecore search - absolute basics
Sugblr sitecore search - absolute basicsSugblr sitecore search - absolute basics
Sugblr sitecore search - absolute basicsAnindita Bhattacharya
 
Untangling spring week11
Untangling spring week11Untangling spring week11
Untangling spring week11Derek Jacoby
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for DrupalChris Caple
 
國民雲端架構 Django + GAE
國民雲端架構 Django + GAE國民雲端架構 Django + GAE
國民雲端架構 Django + GAEWinston Chen
 
Contributing to rails
Contributing to railsContributing to rails
Contributing to railsLukas Eppler
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBAndrew Siemer
 
Caching strategies with lucee
Caching strategies with luceeCaching strategies with lucee
Caching strategies with luceeGert Franz
 

Similar to Web Scraping Basics and Useful Libraries (20)

Better Search UX
Better Search UXBetter Search UX
Better Search UX
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
Real World REST with Atom/AtomPub
Real World REST with Atom/AtomPubReal World REST with Atom/AtomPub
Real World REST with Atom/AtomPub
 
Scraping Scripting Hacking
Scraping Scripting HackingScraping Scripting Hacking
Scraping Scripting Hacking
 
DrupalCon 2011 Highlight
DrupalCon 2011 HighlightDrupalCon 2011 Highlight
DrupalCon 2011 Highlight
 
Sitecore search absolute basics
Sitecore search absolute basicsSitecore search absolute basics
Sitecore search absolute basics
 
Sugblr sitecore search - absolute basics
Sugblr sitecore search - absolute basicsSugblr sitecore search - absolute basics
Sugblr sitecore search - absolute basics
 
Untangling spring week11
Untangling spring week11Untangling spring week11
Untangling spring week11
 
Rest web services
Rest web servicesRest web services
Rest web services
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Yql with geo
Yql with geoYql with geo
Yql with geo
 
國民雲端架構 Django + GAE
國民雲端架構 Django + GAE國民雲端架構 Django + GAE
國民雲端架構 Django + GAE
 
Contributing to rails
Contributing to railsContributing to rails
Contributing to rails
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Caching strategies with lucee
Caching strategies with luceeCaching strategies with lucee
Caching strategies with lucee
 
YQL & Yahoo! Apis
YQL & Yahoo! ApisYQL & Yahoo! Apis
YQL & Yahoo! Apis
 

More from Shubham Jaybhaye

Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxShubham Jaybhaye
 
YOLO ( You Only Look Once) Deep Learning.pptx
YOLO ( You Only Look Once) Deep Learning.pptxYOLO ( You Only Look Once) Deep Learning.pptx
YOLO ( You Only Look Once) Deep Learning.pptxShubham Jaybhaye
 
Banking Management System Report .docx
Banking Management System Report .docxBanking Management System Report .docx
Banking Management System Report .docxShubham Jaybhaye
 
Spam Mail Prediction Report.docx
Spam Mail Prediction Report.docxSpam Mail Prediction Report.docx
Spam Mail Prediction Report.docxShubham Jaybhaye
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxShubham Jaybhaye
 

More from Shubham Jaybhaye (6)

Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
 
YOLO ( You Only Look Once) Deep Learning.pptx
YOLO ( You Only Look Once) Deep Learning.pptxYOLO ( You Only Look Once) Deep Learning.pptx
YOLO ( You Only Look Once) Deep Learning.pptx
 
Banking Management System Report .docx
Banking Management System Report .docxBanking Management System Report .docx
Banking Management System Report .docx
 
Spam Mail Prediction Report.docx
Spam Mail Prediction Report.docxSpam Mail Prediction Report.docx
Spam Mail Prediction Report.docx
 
Geopandas.pptx
Geopandas.pptxGeopandas.pptx
Geopandas.pptx
 
Naïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptxNaïve Bayes Classifier Algorithm.pptx
Naïve Bayes Classifier Algorithm.pptx
 

Recently uploaded

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Web Scraping Basics and Useful Libraries

  • 1. Name: Shubham Pralhad Jaybhaye ROLL NO: 2111023 SUBJECT: DATA STRUCTURE CE-1 PRESENTATION
  • 2. TOPIC: WEB SCAPING •What is Web Scraping •Workflow of webscraper •Useful libraries available •Which library to use for which job •Leagality
  • 3. Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information. Or you could write a web scraper to have it do it for you! WEB SCRAPING WHAT IS IT ?
  • 4. WEB SCAPING •It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting. •Through web scraping we can extract any data which we can see while browsing the web
  • 5. USAGE WEB SCRAPING IN REAL LIFE •Extract product information •Extract job postings and internships •Extract offers and discounts from deal-of-the- day websites •Extract data to make a search engine •Gathering weather data •etc.
  • 6. ADVANTAGES WEB SCRAPING VS. USING AN API •Web Scraping is not rate limited •Anonymously access the website and gather data •Some websites do not have an API •Some data is not accessible through an API •and many more !
  • 7. WORKFLOW ESSENTIAL PARTS OF WEB SCRAPING Web Scraping follows this workflow: • Get the website - using HTTP library • Parse the html document - using any parsing library • Store the results - either a db, csv, text file, etc We will focus more on parsing.
  • 8. LIBRARIES USEFUL LIBRARIES AVAILABLE •BeautifulSoup (bs4) •lxml •selenium •re •scrapy
  • 9. HTTP LIBRARIES USEFUL LIBRARIES AVAILABLE • Requests r = requests.get('https://www.google.com').html •urllib/urllib2 html = urllib2.urlopen('http://python.org/').read() • httplib/httplib2 h = httplib2.Http(".cache") (resp_headers, content) = h.request("http://pydelhi.org/", "GET")
  • 10. PARSING LIBRARIES USEFUL LIBRARIES AVAILABLE •BeautifulSoup (bs4) tree = BeautifulSoup(html_doc) tree.title •lxml tree = lxml.html.fromstring(html_doc) title = tree.xpath('/title/text()’) •re title = re.findall('<title>(.*?)</title>', html_doc)
  • 11. BEAUTIFULSOUP PROS AND CONS ! •We can learn it fast soup = BeautifulSoup(html_doc) last_a_tag = soup.find("a", id="link3") all_b_tags = soup.find_all("b") •very easy to use •purely in Python •slow :(
  • 12. LXML PROS AND CONS ! The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed •very fast •not purely in Python •If you have no "pure Python" requirement use lxml •lxml works with all python versions from 2.x to 3.x
  • 13. RE PROS AND CONS ! •requires you to learn its symbols e.g '.',*,$,^,b,w •can become complex •purely baked in Python •a part of standard library •very fast – •every Python version
  • 14. COMPARISON BS4 VS. LXML VS. RE import re import time import urllib2 from bs4 import BeautifulSoup from lxml import html as lxmlhtml def timeit(fn, *args): t1 = time.time() for i in range(100): fn(*args) t2 = time.time() print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0) def bs_test(html): soup = BeautifulSoup(html) return soup.html.head.title def lxml_test(html): tree = lxmlhtml.fromstring(html) return tree.xpath('//title')[0].text_content() def regex_test(html): return re.findall('', html)[0] if __name__ == '__main__': url = 'http://pydelhi.org' html = urllib2.urlopen(url).read() for fn in (bs_test, lxml_test, regex_test): timeit(fn, html)
  • 15. RESULT •manoj@manoj:~/Desktop$ python test.py •bs_test took 1851.457 ms •lxml_test took 232.942 ms •regex_test took 7.186 ms •lxml took 32x more time than re, BeautifulSoup took 245x! more time than re
  • 16. Is Web scraping legal? •In short, the action of web scraping isn't illegal. However, some rules need to be followed. Web scraping becomes illegal when non publicly available data becomes extracted.