SlideShare a Scribd company logo
1 of 10
Download to read offline
Web scraping using
semi-automated browsing
Michael Kurzmeier
TOC
Introduction
Scope and Features
Example application
What is WebDriver?
From W3C Recommendation:
WebDriver is a remote control interface that enables introspection and control of user agents.
Provided is a set of interfaces to discover and manipulate DOM elements in web documents and to control the
behavior of a user agent. It is primarily intended to allow web authors to write tests that automate a user agent
from a separate controlling process, but may also be used in such a way as to allow in-browser scripts to control a
— possibly separate — browser.
https://www.w3.org/TR/webdriver1/
Can it be used for scraping?
WebDriver can:
● Open a browser window
● Open a URL
● Identify the page status (loads, timeout, redirect)
● Identify and interact with elements (click links, extract data, tell if an element it present or not)
So … sounds good?
Pros and Cons
WebDriver actually loads the entire page in a browser - it is slow but can handle backend-dependent sites
better than wget
It offers more interaction with page elements but this means you need to know the target site well
It is a “what you see is what you get” solution, which is good and bad at the same time
Target is iframe (the defaced
page)
But metadata is outside
iframe
Need screenshot to scan
captures more quickly
Captchas, js, popups
URL schema for MD:
http://zone-h.com/mirror/id/
1002
Page mirror:
http://zonehmirrors.org/defa
ced/alldas/2000/10/12/www
.jpbdbkl.gov.my/
Workflow
● Open Browser (profile with Noscript enabled)
● Go to https://www.zone-h.org/mirror/id/{Number}
● If page loads: capture metadata, load iframe
○ If iframe loads: capture iframe, take full-page screenshot
● If multiple pages do not load:
○ Assume you are in a bad batch and increment first by 1-9, then by 10-100
● Continue on until banned for the day (~800 requests) or target reached
Sample output with two
failed captures and random
id increments
Application
Go where no wget has gone before -> scrape interactive, high-value sites
Very customizable scraping solutions
More target interaction possible -> queries, searches etc.
Links
https://www.gnu.org/software/wget/
https://www.selenium.dev/documentation/webdriver/
https://github.com/mkrzmr/Political-Expression-in-Web-Defacements-Crawler/blob/main/zoneH_IWL_
GUI.py
'Political Expression in Web Defacements' Thesis

More Related Content

Similar to Web scraping using semi-automated browsing

Unobtrusive javascript
Unobtrusive javascriptUnobtrusive javascript
Unobtrusive javascriptLee Jordan
 
Modern Web Technologies
Modern Web TechnologiesModern Web Technologies
Modern Web TechnologiesPerttu Myry
 
Building high performance web apps.
Building high performance web apps.Building high performance web apps.
Building high performance web apps.Arshak Movsisyan
 
Java script Session No 1
Java script Session No 1Java script Session No 1
Java script Session No 1Saif Ullah Dar
 
Tool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPanelTool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPaneltoolitup
 
Make Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedMake Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedPromet Source
 
Angular webinar - Credo Systemz
Angular webinar - Credo SystemzAngular webinar - Credo Systemz
Angular webinar - Credo SystemzTraining Institute
 
Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016Abdelrahman Omran
 
Structural profiling of web sites in the wild
Structural profiling of web sites in the wildStructural profiling of web sites in the wild
Structural profiling of web sites in the wildXavierChamberlandThi
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond
 
What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)
What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)
What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)Emily Nakashima
 
Performance on the Yahoo! Homepage
Performance on the Yahoo! HomepagePerformance on the Yahoo! Homepage
Performance on the Yahoo! HomepageNicholas Zakas
 
Building performance into the new yahoo homepage presentation
Building performance into the new yahoo  homepage presentationBuilding performance into the new yahoo  homepage presentation
Building performance into the new yahoo homepage presentationmasudakram
 
Architecting single-page front-end apps
Architecting single-page front-end appsArchitecting single-page front-end apps
Architecting single-page front-end appsZohar Arad
 
Ten practical ways to improve front-end performance
Ten practical ways to improve front-end performanceTen practical ways to improve front-end performance
Ten practical ways to improve front-end performanceAndrew Rota
 
Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)SOASTA
 
Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)SOASTA
 
Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?)Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?)Tammy Everts
 

Similar to Web scraping using semi-automated browsing (20)

Unobtrusive javascript
Unobtrusive javascriptUnobtrusive javascript
Unobtrusive javascript
 
Modern Web Technologies
Modern Web TechnologiesModern Web Technologies
Modern Web Technologies
 
Building high performance web apps.
Building high performance web apps.Building high performance web apps.
Building high performance web apps.
 
Java script Session No 1
Java script Session No 1Java script Session No 1
Java script Session No 1
 
Tool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPanelTool it Up! - Session #2 - NetPanel
Tool it Up! - Session #2 - NetPanel
 
Chrome extensions
Chrome extensionsChrome extensions
Chrome extensions
 
Make Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speedMake Drupal Run Fast - increase page load speed
Make Drupal Run Fast - increase page load speed
 
Angular webinar - Credo Systemz
Angular webinar - Credo SystemzAngular webinar - Credo Systemz
Angular webinar - Credo Systemz
 
Progressive Web App
Progressive Web AppProgressive Web App
Progressive Web App
 
Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016Progressive Web Apps / GDG DevFest - Season 2016
Progressive Web Apps / GDG DevFest - Season 2016
 
Structural profiling of web sites in the wild
Structural profiling of web sites in the wildStructural profiling of web sites in the wild
Structural profiling of web sites in the wild
 
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
 
What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)
What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)
What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)
 
Performance on the Yahoo! Homepage
Performance on the Yahoo! HomepagePerformance on the Yahoo! Homepage
Performance on the Yahoo! Homepage
 
Building performance into the new yahoo homepage presentation
Building performance into the new yahoo  homepage presentationBuilding performance into the new yahoo  homepage presentation
Building performance into the new yahoo homepage presentation
 
Architecting single-page front-end apps
Architecting single-page front-end appsArchitecting single-page front-end apps
Architecting single-page front-end apps
 
Ten practical ways to improve front-end performance
Ten practical ways to improve front-end performanceTen practical ways to improve front-end performance
Ten practical ways to improve front-end performance
 
Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)
 
Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)Metrics, Metrics Everywhere (but where the heck do you start?)
Metrics, Metrics Everywhere (but where the heck do you start?)
 
Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?)Metrics, metrics everywhere (but where the heck do you start?)
Metrics, metrics everywhere (but where the heck do you start?)
 

More from WARCnet

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxWARCnet
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdfWARCnet
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdfWARCnet
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptxWARCnet
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptxWARCnet
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfWARCnet
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfWARCnet
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdfWARCnet
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxWARCnet
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxWARCnet
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAWARCnet
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnetWARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsWARCnet
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussionWARCnet
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experimentWARCnet
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWARCnet
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational eventsWARCnet
 
Web Archive Research Skills and Tools Survey (WARST)
 Web Archive Research Skills and Tools Survey (WARST) Web Archive Research Skills and Tools Survey (WARST)
Web Archive Research Skills and Tools Survey (WARST)WARCnet
 

More from WARCnet (20)

Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
Gauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptxGauditz & Kunze, Web archives as research data FINAL.pptx
Gauditz & Kunze, Web archives as research data FINAL.pptx
 
2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf2022 Visit Royal Danish Library Ditte Laursen.pdf
2022 Visit Royal Danish Library Ditte Laursen.pdf
 
20221015 introduction to panel Ditte Laursen.pdf
20221015 introduction to panel  Ditte Laursen.pdf20221015 introduction to panel  Ditte Laursen.pdf
20221015 introduction to panel Ditte Laursen.pdf
 
WARCnet_2022.pptx
WARCnet_2022.pptxWARCnet_2022.pptx
WARCnet_2022.pptx
 
WARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptxWARCnet conference - Mapping social media archiving initiatives.pptx
WARCnet conference - Mapping social media archiving initiatives.pptx
 
Warcnet 2022_final.pptx
Warcnet 2022_final.pptxWarcnet 2022_final.pptx
Warcnet 2022_final.pptx
 
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdfMaemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
Maemura_WARCnet_Developing Datasheets for Archived Web Datasets.pdf
 
Hegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdfHegarty-WARCNet2022-slides.pdf
Hegarty-WARCNet2022-slides.pdf
 
20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf20221018_Panel_Covid_WARCnet_closing_conference.pdf
20221018_Panel_Covid_WARCnet_closing_conference.pdf
 
Millward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptxMillward - We cannot put this off any longer - upload.pptx
Millward - We cannot put this off any longer - upload.pptx
 
Balbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptxBalbi_Keynote_AarhusWARCnet.pptx
Balbi_Keynote_AarhusWARCnet.pptx
 
Reporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INAReporting from a Short-Term Network Stay at the BnF and INA
Reporting from a Short-Term Network Stay at the BnF and INA
 
Post WARCnet
Post WARCnetPost WARCnet
Post WARCnet
 
The WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formatsThe WARCnet Code Book of web archive data formats
The WARCnet Code Book of web archive data formats
 
Working Group 6 discussion
Working Group 6 discussionWorking Group 6 discussion
Working Group 6 discussion
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experiment
 
What’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collectionsWhat’s in a URL? Analysing COVID-19 web archive collections
What’s in a URL? Analysing COVID-19 web archive collections
 
Working Group 2 on transnational events
Working Group 2 on transnational eventsWorking Group 2 on transnational events
Working Group 2 on transnational events
 
Web Archive Research Skills and Tools Survey (WARST)
 Web Archive Research Skills and Tools Survey (WARST) Web Archive Research Skills and Tools Survey (WARST)
Web Archive Research Skills and Tools Survey (WARST)
 

Recently uploaded

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 

Recently uploaded (20)

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 

Web scraping using semi-automated browsing

  • 1. Web scraping using semi-automated browsing Michael Kurzmeier
  • 3. What is WebDriver? From W3C Recommendation: WebDriver is a remote control interface that enables introspection and control of user agents. Provided is a set of interfaces to discover and manipulate DOM elements in web documents and to control the behavior of a user agent. It is primarily intended to allow web authors to write tests that automate a user agent from a separate controlling process, but may also be used in such a way as to allow in-browser scripts to control a — possibly separate — browser. https://www.w3.org/TR/webdriver1/
  • 4. Can it be used for scraping? WebDriver can: ● Open a browser window ● Open a URL ● Identify the page status (loads, timeout, redirect) ● Identify and interact with elements (click links, extract data, tell if an element it present or not) So … sounds good?
  • 5. Pros and Cons WebDriver actually loads the entire page in a browser - it is slow but can handle backend-dependent sites better than wget It offers more interaction with page elements but this means you need to know the target site well It is a “what you see is what you get” solution, which is good and bad at the same time
  • 6. Target is iframe (the defaced page) But metadata is outside iframe Need screenshot to scan captures more quickly Captchas, js, popups URL schema for MD: http://zone-h.com/mirror/id/ 1002 Page mirror: http://zonehmirrors.org/defa ced/alldas/2000/10/12/www .jpbdbkl.gov.my/
  • 7. Workflow ● Open Browser (profile with Noscript enabled) ● Go to https://www.zone-h.org/mirror/id/{Number} ● If page loads: capture metadata, load iframe ○ If iframe loads: capture iframe, take full-page screenshot ● If multiple pages do not load: ○ Assume you are in a bad batch and increment first by 1-9, then by 10-100 ● Continue on until banned for the day (~800 requests) or target reached
  • 8. Sample output with two failed captures and random id increments
  • 9. Application Go where no wget has gone before -> scrape interactive, high-value sites Very customizable scraping solutions More target interaction possible -> queries, searches etc.