Web scraping using semi-automated browsing

•

0 likes•87 views

WARCnet

Presentation by Michael Kurzmeier: Web scraping using semi-automated browsing

Education

Web scraping using
semi-automated browsing
Michael Kurzmeier

TOC
Introduction
Scope and Features
Example application

What is WebDriver?
From W3C Recommendation:
WebDriver is a remote control interface that enables introspection and control of user agents.
Provided is a set of interfaces to discover and manipulate DOM elements in web documents and to control the
behavior of a user agent. It is primarily intended to allow web authors to write tests that automate a user agent
from a separate controlling process, but may also be used in such a way as to allow in-browser scripts to control a
— possibly separate — browser.
https://www.w3.org/TR/webdriver1/

Can it be used for scraping?
WebDriver can:
● Open a browser window
● Open a URL
● Identify the page status (loads, timeout, redirect)
● Identify and interact with elements (click links, extract data, tell if an element it present or not)
So … sounds good?

Pros and Cons
WebDriver actually loads the entire page in a browser - it is slow but can handle backend-dependent sites
better than wget
It offers more interaction with page elements but this means you need to know the target site well
It is a “what you see is what you get” solution, which is good and bad at the same time

Target is iframe (the defaced
page)
But metadata is outside
iframe
Need screenshot to scan
captures more quickly
Captchas, js, popups
URL schema for MD:
http://zone-h.com/mirror/id/
1002
Page mirror:
http://zonehmirrors.org/defa
ced/alldas/2000/10/12/www
.jpbdbkl.gov.my/

Workﬂow
● Open Browser (proﬁle with Noscript enabled)
● Go to https://www.zone-h.org/mirror/id/{Number}
● If page loads: capture metadata, load iframe
○ If iframe loads: capture iframe, take full-page screenshot
● If multiple pages do not load:
○ Assume you are in a bad batch and increment ﬁrst by 1-9, then by 10-100
● Continue on until banned for the day (~800 requests) or target reached

Sample output with two
failed captures and random
id increments

Application
Go where no wget has gone before -> scrape interactive, high-value sites
Very customizable scraping solutions
More target interaction possible -> queries, searches etc.

Links
https://www.gnu.org/software/wget/
https://www.selenium.dev/documentation/webdriver/
https://github.com/mkrzmr/Political-Expression-in-Web-Defacements-Crawler/blob/main/zoneH_IWL_
GUI.py
'Political Expression in Web Defacements' Thesis

Similar to Web scraping using semi-automated browsing

Unobtrusive javascriptLee Jordan

Modern Web TechnologiesPerttu Myry

Building high performance web apps.Arshak Movsisyan

Java script Session No 1Saif Ullah Dar

Tool it Up! - Session #2 - NetPaneltoolitup

Chrome extensionsAleks Zinevych

Make Drupal Run Fast - increase page load speedPromet Source

Angular webinar - Credo SystemzTraining Institute

Progressive Web AppPadmaashree K

Progressive Web Apps / GDG DevFest - Season 2016Abdelrahman Omran

Structural profiling of web sites in the wildXavierChamberlandThi

Scraping the web with Laravel, Dusk, Docker, and PHPPaul Redmond

What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)Emily Nakashima

Performance on the Yahoo! HomepageNicholas Zakas

Building performance into the new yahoo homepage presentationmasudakram

Architecting single-page front-end appsZohar Arad

Ten practical ways to improve front-end performanceAndrew Rota

Metrics, Metrics Everywhere (but where the heck do you start?)SOASTA

Metrics, metrics everywhere (but where the heck do you start?)Tammy Everts

Similar to Web scraping using semi-automated browsing (20)

Unobtrusive javascript

Modern Web Technologies

Building high performance web apps.

Java script Session No 1

Tool it Up! - Session #2 - NetPanel

Chrome extensions

Make Drupal Run Fast - increase page load speed

Angular webinar - Credo Systemz

Progressive Web App

Progressive Web Apps / GDG DevFest - Season 2016

Structural profiling of web sites in the wild

Scraping the web with Laravel, Dusk, Docker, and PHP

What Your JavaScript Does When You're Not Around (Influx Days 2017 Edition)

Performance on the Yahoo! Homepage

Building performance into the new yahoo homepage presentation

Architecting single-page front-end apps

Ten practical ways to improve front-end performance

Metrics, Metrics Everywhere (but where the heck do you start?)

Metrics, metrics everywhere (but where the heck do you start?)

Recently uploaded

Crayon Activity Handout For the Crayon AUnboundStockton

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR

Paris 2024 Olympic Geographies - an activityGeoBlogs

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE

Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

Mastering the Unannounced Regulatory InspectionSafetyChain Software

Presiding Officer Training module 2024 lok sabha electionsanshu789521

How to Configure Email Server in Odoo 17Celine George

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr

Interactive Powerpoint_How to Master effective communicationnomboosow

Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha

MENTAL STATUS EXAMINATION format.docxPoojaSen20

Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth

Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari

How to Make a Pirate ship Primary Education.pptxmanuelaromero2013

Recently uploaded (20)

Crayon Activity Handout For the Crayon A

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️

Paris 2024 Olympic Geographies - an activity

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...

Science 7 - LAND and SEA BREEZE and its Characteristics

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

Mastering the Unannounced Regulatory Inspection

Presiding Officer Training module 2024 lok sabha elections

How to Configure Email Server in Odoo 17

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...

Interactive Powerpoint_How to Master effective communication

Call Girls in Dwarka Mor Delhi Contact Us 9654467111

MENTAL STATUS EXAMINATION format.docx

Introduction to ArtificiaI Intelligence in Higher Education

Employee wellbeing at the workplace.pptx

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf

How to Make a Pirate ship Primary Education.pptx

Web scraping using semi-automated browsing

1. Web scraping using semi-automated browsing Michael Kurzmeier

2. TOC Introduction Scope and Features Example application

3. What is WebDriver? From W3C Recommendation: WebDriver is a remote control interface that enables introspection and control of user agents. Provided is a set of interfaces to discover and manipulate DOM elements in web documents and to control the behavior of a user agent. It is primarily intended to allow web authors to write tests that automate a user agent from a separate controlling process, but may also be used in such a way as to allow in-browser scripts to control a — possibly separate — browser. https://www.w3.org/TR/webdriver1/

4. Can it be used for scraping? WebDriver can: ● Open a browser window ● Open a URL ● Identify the page status (loads, timeout, redirect) ● Identify and interact with elements (click links, extract data, tell if an element it present or not) So … sounds good?

5. Pros and Cons WebDriver actually loads the entire page in a browser - it is slow but can handle backend-dependent sites better than wget It offers more interaction with page elements but this means you need to know the target site well It is a “what you see is what you get” solution, which is good and bad at the same time

6. Target is iframe (the defaced page) But metadata is outside iframe Need screenshot to scan captures more quickly Captchas, js, popups URL schema for MD: http://zone-h.com/mirror/id/ 1002 Page mirror: http://zonehmirrors.org/defa ced/alldas/2000/10/12/www .jpbdbkl.gov.my/

7. Workflow ● Open Browser (profile with Noscript enabled) ● Go to https://www.zone-h.org/mirror/id/{Number} ● If page loads: capture metadata, load iframe ○ If iframe loads: capture iframe, take full-page screenshot ● If multiple pages do not load: ○ Assume you are in a bad batch and increment first by 1-9, then by 10-100 ● Continue on until banned for the day (~800 requests) or target reached

8. Sample output with two failed captures and random id increments

9. Application Go where no wget has gone before -> scrape interactive, high-value sites Very customizable scraping solutions More target interaction possible -> queries, searches etc.

10. Links https://www.gnu.org/software/wget/ https://www.selenium.dev/documentation/webdriver/ https://github.com/mkrzmr/Political-Expression-in-Web-Defacements-Crawler/blob/main/zoneH_IWL_ GUI.py 'Political Expression in Web Defacements' Thesis

Web scraping using semi-automated browsing

Recommended

Recommended

More Related Content

Similar to Web scraping using semi-automated browsing

Similar to Web scraping using semi-automated browsing (20)

More from WARCnet

More from WARCnet (20)

Recently uploaded

Recently uploaded (20)

Web scraping using semi-automated browsing