Web scraping
Generally a bad idea
Web scraping
If it sounds painful
That’s because it is
Web scraping
Should I do it?
No
Thanks for coming
Any questions?
What is web scraping
● Programmatically extracting data from web pages
Example
Web scraping is a horrible idea
● The scripts are tightly linked to the HTML
● The scripts fragile and prone to breaking
● Identifying HTML elements to extract is messy work
● Legal gray area
● You could be blocked from the web site
Sometimes web scraping is all we have
● The data isn’t accessible any other way
● We still need the data
Benefits of web scraping
● Automation
● Scalability
Techniques to demonstrate
1. Simple technique
○ For simple/static web pages
2. Advanced technique
○ JavaScript must execute
○ Interaction
○ Authentication
Tools
1. Simple technique
○ request-promise
○ cheerio
2. Advanced technique
○ nightmare (headless browser)
○ cheerio
Live coding
The code:
https://github.com/ashleydavis/brisjs-web-scraping-talk
The pages to scrape:
Simple: https://quotes.wsj.com/AU/XASX/CBA
Advanced: https://www.asx.com.au/asx/share-price-research/company/CBA
Production issues...
Performance
● Cache the Nightmare object / batch requests
● Disable image download
Debugging
● Show the Electron window
● Enable devtools
● Handle errors from Nightmare
● Display logging from the headless browser
Resources
● Code
○ github.com/ashleydavis/brisjs-web-scraping-talk
● Contact
○ Email: ashley@codecapers.com.au
○ Twitter: @ashleydavis75
○ GitHub:
■ ashleydavis
■ data-forge
● Data Wrangling with JavaScript
○ datawranglingwithjavascript.com
● The Data Wrangler
○ the-data-wrangler.com
My book

Web scraping