SlideShare a Scribd company logo
1 of 17
Download to read offline
• Captcha for learning and collecting data
GOOLE ENERGY REDUCTION
GOOLE ENERGY REDUCTION
RECAPTCHA
RECAPTCHA
RECAPTCHA
MACHINE LEARNING
web crawlers part 2
ISYOUR BROWSER UNIQUE?
• https://amiunique.org
• https://panopticlick.eff.org
ROBOTS.TXT
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
SITEMAPS
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
CRAWLERTRAPS
• Captcha
• Hidden fields
• Hidden in CSS
ONION ;)
import socks
import socket
from urllib.request import urlopen
socks.set_default_proxy(socks.SOCKS5, "localhost", 9150)
socket.socket = socks.socksocket
print(urlopen('http://icanhazip.com').read())
URLOPEN
from urllib.request import urlopen
html = urlopen("http://sampleshop.pl/product/happy-ninja-2/")
print(html.read())
“BeautifulSoup tries to make sense of the nonsensical;
it helps format and organize the messy web by fixing
bad HTML and presenting us with easily-traversible
Python objects representing XML structures.”
BEAUTIFUL SOUP
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://sampleshop.pl")
bsObj = BeautifulSoup(html.read(), "html.parser")
print(bsObj.h1)
REQUESTS
import requests
from bs4 import BeautifulSoup
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,
*/*;q=0.8'
}
url = "https://www.whatismybrowser.com/developers/
what-http-headers-is-my-browser-sending"
req = session.get(url, headers = headers)
bsObj = BeautifulSoup(req.text, "html.parser")
print(bsObj.find("table", {"class": "table-striped"}).get_text)
REQUESTS
import requests
from bs4 import BeautifulSoup
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,
*/*;q=0.8'
}
url = "http://sampleshop.pl/shop"
req = session.get(url, headers = headers)
bsObj = BeautifulSoup(req.text, "html.parser")
print(bsObj.find("p", {"class": "site-description"}).get_text())

More Related Content

What's hot

Apache Web Services
Apache Web ServicesApache Web Services
Apache Web Serviceslkurriger
 
BDD / cucumber /Capybara
BDD / cucumber /CapybaraBDD / cucumber /Capybara
BDD / cucumber /CapybaraShraddhaSF
 
NodeWay in my project & sails.js
NodeWay in my project & sails.jsNodeWay in my project & sails.js
NodeWay in my project & sails.jsDmytro Ovcharenko
 
Integration made easy with Apache Camel
Integration made easy with Apache CamelIntegration made easy with Apache Camel
Integration made easy with Apache CamelRosen Spasov
 
Spring Booted, But... @JCConf 16', Taiwan
Spring Booted, But... @JCConf 16', TaiwanSpring Booted, But... @JCConf 16', Taiwan
Spring Booted, But... @JCConf 16', TaiwanPei-Tang Huang
 
PHP MVC Tutorial 2
PHP MVC Tutorial 2PHP MVC Tutorial 2
PHP MVC Tutorial 2Yang Bruce
 
Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...
Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...
Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...Alexander Lisachenko
 
Varnish more than a cache
Varnish more than a cacheVarnish more than a cache
Varnish more than a cachebloeffeld
 
Selenium Tips & Tricks - StarWest 2015
Selenium Tips & Tricks - StarWest 2015Selenium Tips & Tricks - StarWest 2015
Selenium Tips & Tricks - StarWest 2015Andrew Krug
 
Puppet Camp London 2014: MCollective as an Integration Layer
Puppet Camp London 2014: MCollective as an Integration LayerPuppet Camp London 2014: MCollective as an Integration Layer
Puppet Camp London 2014: MCollective as an Integration LayerPuppet
 
EasyEngine - Command-Line tool to manage WordPress Sites on Nginx
EasyEngine - Command-Line tool to manage WordPress Sites on NginxEasyEngine - Command-Line tool to manage WordPress Sites on Nginx
EasyEngine - Command-Line tool to manage WordPress Sites on NginxrtCamp
 
Single page apps with drupal 7
Single page apps with drupal 7Single page apps with drupal 7
Single page apps with drupal 7Chris Tankersley
 
플랫폼 통합을 위한 Client Module 개발 & 배포
플랫폼 통합을 위한 Client Module 개발 & 배포플랫폼 통합을 위한 Client Module 개발 & 배포
플랫폼 통합을 위한 Client Module 개발 & 배포흥래 김
 
CIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMCIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMICF CIRCUIT
 
Integration Of Mulesoft and Apache Active MQ
Integration Of Mulesoft and Apache Active MQIntegration Of Mulesoft and Apache Active MQ
Integration Of Mulesoft and Apache Active MQGaurav Talwadker
 
Packing it all: JavaScript module bundling from 2000 to now
Packing it all: JavaScript module bundling from 2000 to nowPacking it all: JavaScript module bundling from 2000 to now
Packing it all: JavaScript module bundling from 2000 to nowDerek Willian Stavis
 
[20180226] I understand webpacker perfectly
[20180226] I understand webpacker perfectly[20180226] I understand webpacker perfectly
[20180226] I understand webpacker perfectlyYuta Suzuki
 

What's hot (20)

Apache Web Services
Apache Web ServicesApache Web Services
Apache Web Services
 
Sails.js Intro
Sails.js IntroSails.js Intro
Sails.js Intro
 
Nuxt.js - Introduction
Nuxt.js - IntroductionNuxt.js - Introduction
Nuxt.js - Introduction
 
BDD / cucumber /Capybara
BDD / cucumber /CapybaraBDD / cucumber /Capybara
BDD / cucumber /Capybara
 
NodeWay in my project & sails.js
NodeWay in my project & sails.jsNodeWay in my project & sails.js
NodeWay in my project & sails.js
 
Integration made easy with Apache Camel
Integration made easy with Apache CamelIntegration made easy with Apache Camel
Integration made easy with Apache Camel
 
Spring Booted, But... @JCConf 16', Taiwan
Spring Booted, But... @JCConf 16', TaiwanSpring Booted, But... @JCConf 16', Taiwan
Spring Booted, But... @JCConf 16', Taiwan
 
PHP MVC Tutorial 2
PHP MVC Tutorial 2PHP MVC Tutorial 2
PHP MVC Tutorial 2
 
Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...
Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...
Handling 10k requests per second with Symfony and Varnish - SymfonyCon Berlin...
 
Varnish more than a cache
Varnish more than a cacheVarnish more than a cache
Varnish more than a cache
 
Selenium Tips & Tricks - StarWest 2015
Selenium Tips & Tricks - StarWest 2015Selenium Tips & Tricks - StarWest 2015
Selenium Tips & Tricks - StarWest 2015
 
Puppet Camp London 2014: MCollective as an Integration Layer
Puppet Camp London 2014: MCollective as an Integration LayerPuppet Camp London 2014: MCollective as an Integration Layer
Puppet Camp London 2014: MCollective as an Integration Layer
 
EasyEngine - Command-Line tool to manage WordPress Sites on Nginx
EasyEngine - Command-Line tool to manage WordPress Sites on NginxEasyEngine - Command-Line tool to manage WordPress Sites on Nginx
EasyEngine - Command-Line tool to manage WordPress Sites on Nginx
 
Single page apps with drupal 7
Single page apps with drupal 7Single page apps with drupal 7
Single page apps with drupal 7
 
플랫폼 통합을 위한 Client Module 개발 & 배포
플랫폼 통합을 위한 Client Module 개발 & 배포플랫폼 통합을 위한 Client Module 개발 & 배포
플랫폼 통합을 위한 Client Module 개발 & 배포
 
CIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEMCIRCUIT 2015 - Monitoring AEM
CIRCUIT 2015 - Monitoring AEM
 
Integration Of Mulesoft and Apache Active MQ
Integration Of Mulesoft and Apache Active MQIntegration Of Mulesoft and Apache Active MQ
Integration Of Mulesoft and Apache Active MQ
 
Packing it all: JavaScript module bundling from 2000 to now
Packing it all: JavaScript module bundling from 2000 to nowPacking it all: JavaScript module bundling from 2000 to now
Packing it all: JavaScript module bundling from 2000 to now
 
[20180226] I understand webpacker perfectly
[20180226] I understand webpacker perfectly[20180226] I understand webpacker perfectly
[20180226] I understand webpacker perfectly
 
10 common cf server challenges
10 common cf server challenges10 common cf server challenges
10 common cf server challenges
 

Similar to Web crawlers part-2-20161104

DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014Amazon Web Services
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamAndreas Grabner
 
Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13Walter Ebert
 
[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화
[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화
[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화InfraEngineer
 
High performance website
High performance websiteHigh performance website
High performance websiteChamnap Chhorn
 
Website Performance Basics
Website Performance BasicsWebsite Performance Basics
Website Performance Basicsgeku
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站areyouok
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站topgeek
 
EscConf - Deep Dive Frontend Optimization
EscConf - Deep Dive Frontend OptimizationEscConf - Deep Dive Frontend Optimization
EscConf - Deep Dive Frontend OptimizationJonathan Klein
 
Front End Website Optimization
Front End Website OptimizationFront End Website Optimization
Front End Website OptimizationGerard Sychay
 
A web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationA web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationJustin Dorfman
 
Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010Daniel Toomey
 
Csdn Drdobbs Tenni Theurer Yahoo
Csdn Drdobbs Tenni Theurer YahooCsdn Drdobbs Tenni Theurer Yahoo
Csdn Drdobbs Tenni Theurer Yahooguestb1b95b
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsBoris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsShapeBlue
 
vdocuments.site_nginx-essential.pdf
vdocuments.site_nginx-essential.pdfvdocuments.site_nginx-essential.pdf
vdocuments.site_nginx-essential.pdfcrezzcrezz
 
How to use database component using stored procedure call
How to use database component using stored procedure callHow to use database component using stored procedure call
How to use database component using stored procedure callprathyusha vadla
 
Rich Portlet Development in uPortal
Rich Portlet Development in uPortalRich Portlet Development in uPortal
Rich Portlet Development in uPortalJennifer Bourey
 

Similar to Web crawlers part-2-20161104 (20)

DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 PotsdamFrom Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
From Zero to Performance Hero in Minutes - Agile Testing Days 2014 Potsdam
 
Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13Web Performance Optimierung - DWX13
Web Performance Optimierung - DWX13
 
[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화
[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화
[MeetUp][1st] 오픈소스를 활용한 xflow 수집-시각화
 
High performance website
High performance websiteHigh performance website
High performance website
 
Website Performance Basics
Website Performance BasicsWebsite Performance Basics
Website Performance Basics
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
EscConf - Deep Dive Frontend Optimization
EscConf - Deep Dive Frontend OptimizationEscConf - Deep Dive Frontend Optimization
EscConf - Deep Dive Frontend Optimization
 
T5 Oli Aro
T5 Oli AroT5 Oli Aro
T5 Oli Aro
 
Front End Website Optimization
Front End Website OptimizationFront End Website Optimization
Front End Website Optimization
 
A web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentationA web perf dashboard up & running in 90 minutes presentation
A web perf dashboard up & running in 90 minutes presentation
 
Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010Consuming REST Services in BizTalk 2010
Consuming REST Services in BizTalk 2010
 
Csdn Drdobbs Tenni Theurer Yahoo
Csdn Drdobbs Tenni Theurer YahooCsdn Drdobbs Tenni Theurer Yahoo
Csdn Drdobbs Tenni Theurer Yahoo
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsBoris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
 
vdocuments.site_nginx-essential.pdf
vdocuments.site_nginx-essential.pdfvdocuments.site_nginx-essential.pdf
vdocuments.site_nginx-essential.pdf
 
How to use database component using stored procedure call
How to use database component using stored procedure callHow to use database component using stored procedure call
How to use database component using stored procedure call
 
Rich Portlet Development in uPortal
Rich Portlet Development in uPortalRich Portlet Development in uPortal
Rich Portlet Development in uPortal
 
Nginx Essential
Nginx EssentialNginx Essential
Nginx Essential
 

More from Patryk Omiotek

Unlocking Realtime Web Applications - 4Developers Katowice 2023
Unlocking Realtime Web Applications  - 4Developers Katowice 2023Unlocking Realtime Web Applications  - 4Developers Katowice 2023
Unlocking Realtime Web Applications - 4Developers Katowice 2023Patryk Omiotek
 
TensorFlow for beginners
TensorFlow for beginnersTensorFlow for beginners
TensorFlow for beginnersPatryk Omiotek
 
How the Internet of Things will change our lives?
How the Internet of Things will change our lives?How the Internet of Things will change our lives?
How the Internet of Things will change our lives?Patryk Omiotek
 
How to build own IoT Platform
How to build own IoT PlatformHow to build own IoT Platform
How to build own IoT PlatformPatryk Omiotek
 

More from Patryk Omiotek (6)

Unlocking Realtime Web Applications - 4Developers Katowice 2023
Unlocking Realtime Web Applications  - 4Developers Katowice 2023Unlocking Realtime Web Applications  - 4Developers Katowice 2023
Unlocking Realtime Web Applications - 4Developers Katowice 2023
 
TensorFlow for beginners
TensorFlow for beginnersTensorFlow for beginners
TensorFlow for beginners
 
Docker how to
Docker how toDocker how to
Docker how to
 
How the Internet of Things will change our lives?
How the Internet of Things will change our lives?How the Internet of Things will change our lives?
How the Internet of Things will change our lives?
 
How to build own IoT Platform
How to build own IoT PlatformHow to build own IoT Platform
How to build own IoT Platform
 
WordpUp Lublin #1
WordpUp Lublin #1WordpUp Lublin #1
WordpUp Lublin #1
 

Recently uploaded

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Deliverybabeytanya
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
The Intriguing World of CDR Analysis by Police: What You Need to Know.pdf
The Intriguing World of CDR Analysis by Police: What You Need to Know.pdfThe Intriguing World of CDR Analysis by Police: What You Need to Know.pdf
The Intriguing World of CDR Analysis by Police: What You Need to Know.pdfMilind Agarwal
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts servicesonalikaur4
 
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...akbard9823
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Roomdivyansh0kumar0
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 

Recently uploaded (20)

Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
The Intriguing World of CDR Analysis by Police: What You Need to Know.pdf
The Intriguing World of CDR Analysis by Police: What You Need to Know.pdfThe Intriguing World of CDR Analysis by Police: What You Need to Know.pdf
The Intriguing World of CDR Analysis by Police: What You Need to Know.pdf
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
 
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls KolkataLow Rate Call Girls Kolkata Avani 🤌  8250192130 🚀 Vip Call Girls Kolkata
Low Rate Call Girls Kolkata Avani 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 

Web crawlers part-2-20161104

  • 1. • Captcha for learning and collecting data
  • 8. ISYOUR BROWSER UNIQUE? • https://amiunique.org • https://panopticlick.eff.org
  • 10. SITEMAPS <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>
  • 11. CRAWLERTRAPS • Captcha • Hidden fields • Hidden in CSS
  • 12. ONION ;) import socks import socket from urllib.request import urlopen socks.set_default_proxy(socks.SOCKS5, "localhost", 9150) socket.socket = socks.socksocket print(urlopen('http://icanhazip.com').read())
  • 13. URLOPEN from urllib.request import urlopen html = urlopen("http://sampleshop.pl/product/happy-ninja-2/") print(html.read())
  • 14. “BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects representing XML structures.”
  • 15. BEAUTIFUL SOUP from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://sampleshop.pl") bsObj = BeautifulSoup(html.read(), "html.parser") print(bsObj.h1)
  • 16. REQUESTS import requests from bs4 import BeautifulSoup session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' } url = "https://www.whatismybrowser.com/developers/ what-http-headers-is-my-browser-sending" req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("table", {"class": "table-striped"}).get_text)
  • 17. REQUESTS import requests from bs4 import BeautifulSoup session = requests.Session() headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36', 'Accept': ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, */*;q=0.8' } url = "http://sampleshop.pl/shop" req = session.get(url, headers = headers) bsObj = BeautifulSoup(req.text, "html.parser") print(bsObj.find("p", {"class": "site-description"}).get_text())