Screen Scraping - taking data from what is intended for the screen. Common task many people TRY. Awful. Regular expressions, html, breaking, ick.
The website we’ll be looking at. Let’s get everyone’s twitter handle and name. Darn. No API. No problem!
Web inspector - this website is nicely but together. Each attendee is in it’s own element. Handy!
Use jquery to select all the attendee elements, chrome provides a nice little output. Now, lets write a script to get at the contents of those elements.
Make an array. Go over each attendee and pull out the element with the twitter handle and name, then push an object with those two values into an array. Copy paste code into the console. See what the value of attendees is. DARN. Extra characters.
Small modification, let’s use jQuery’s trim function. You could use regular expressions, but why not use jQuery. Performance isn’t as dire in this situation as it would be in a browser. You’re dealing with a known situation, unlike all the devices + browsers. Cool, it worked.
Enter Node.JS. Let’s take this script and move it to the server. JSDom module, then using env - creating a closed environment and only executing the scripts you specify (no script tags, etc.) Anything inside the callback will be run effectively on “document ready” - so we can put our script that ran in chrome here. Then we’ll convert the results to json with stringify and output the results to “console” which is standard output.
That worked, but let’s write it to a file. First, let’s make the json nicer. All we have to do to write to a file is include the fs module and replace the console with fs.writeFile, specifying the filename and well send a message that we’re done with the file. Now you’ve scraped a screen. The great thing about all of this is that it is very flexible - if the page changes, we just need to change the selector.
T-mark is hiring - we are looking for a linux systems administrator/ django programmer.
Modern Screen Scraping with Node.js, jsdom and jQuery