Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
the SEO’s guide to: !SCRAPING!EVERYTHING!  @eppievojt!  digital marketing consultant, JPL!
NEXT LEVEL!XPATH-ING!  Use Case 1:  Does site x link to any page on  eppie.net?
NEXT LEVEL!XPATH-ING!  Scrape partial       What we know:"  matches using        1)  Link will contain"                   ...
DO YOU LINK!TO ME?!  //a[contains(@href,http://www.eppie.net’)]             PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
Add translate() to normalize case//a[contains(translate(@href,   ABCDEFGHIJKLMNOPQRSTUVWXYZ,abcdefghijklmno   pqrstuvwxyz)...
How you can use this:Get notified when a link is removed+ Make contact to potentially save dropping link (friendly  reminde...
NEXT LEVEL!XPATH-ING!  Use Case 2:  Find every external link from cnn.com
NEXT LEVEL!XPATH-ING!                        What we know:"  Combine attribute  selectors to more     1)  External links a...
SCRAPE ALL!EXTERNAL LINKS!  //a[contains(@href,http://) and not    (contains(@href,cnn.com))]
How you can use this:Identify if a page is too spammed out to bother with by   pulling external link countsFind expired or...
LINK TYPE!IDENTIFICATION!  Use Case 3:  How are they ranking? What kind of links  do they have?
LINK TYPE!IDENTIFICATION!  XPath’s ancestor    What we know:"  axis lets us        A link inside a containing element with...
LINK TYPE!IDENTIFICATION!  "//a[@href=h,p://randfishkin.com/blog]/    ancestor::*[contains(@id|    @class,comment)]"       ...
Why you might use this:Analyze competitors’ strategies for acquiring linksFind what types of links are being used to get g...
REGEX TO!THE RESCUE!  Use Case 4:  I’ve scraped some data, now I need to  extract some small portion of it that  XPath can...
REGEX TO!THE RESCUE!  Use regular                     Example:  expressions to  pattern match      Extract all @mentions o...
REGEX TO!THE RESCUE!
REGEX TO!THE RESCUE!
REGEX TO!THE RESCUE!
REGEX TO!THE RESCUE!
EXTRACT!@ MENTIONS!       /(?:^|s)@([A-z0-9_]+)/gi
Why you might use this:Pull contact information from a web site (Twitter username,  email address) to improve outreach eff...
BEYOND THE !SPREADSHEET!  Use Case 5:  I want to chain processes together,  process lots of data, or allow multiple  users...
BEYOND THE !SPREADSHEET!  Scraping outside   PHP Scraping Overview:  the spreadsheet                     1)    CURL target...
BEYOND THE !SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com
SHOW!SOME LOVE!  I’m @eppievojt and I work for @jplcreative "  eppie.net  linkdetective.com  jplcreative.com
Upcoming SlideShare
Loading in …5
×

NEXT LEVEL!XPATH-ING! Scrape partial What The SEO's Guide to Scraping Everything

29,823 views

Published on

NEXT LEVEL!XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to find 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization

Published in: Technology, Design
  • Be the first to comment

×