NEXT LEVEL!XPATH-ING! Scrape partial What
we know:" matches using 1) Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to ﬁnd 2) Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3) People who link to you don’t care about your desire for canonicalization
How you can use this:Get
notiﬁed when a link is removed+ Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource)Integrate into link outreach process+ Get notiﬁcation when link goes live DO YOU LINK! TO ME?!
NEXT LEVEL!XPATH-ING! What we know:"
Combine attribute selectors to more 1) External links all contain http://" accurately target 2) Internal links can also use http://" useful information 3) So we need to exclude http:// links to the current domain
How you can use this:Identify
if a page is too spammed out to bother with by pulling external link countsFind expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites.Broken link building automation SCRAPE ALL! EXTERNAL LINKS!
LINK TYPE!IDENTIFICATION! XPath’s ancestor What
we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.
Why you might use this:Analyze
competitors’ strategies for acquiring linksFind what types of links are being used to get good anchor textImprove workﬂow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!
Why you might use this:Pull
contact information from a web site (Twitter username, email address) to improve outreach effortsExtract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!
BEYOND THE !SPREADSHEET! Scraping outside
PHP Scraping Overview: the spreadsheet 1) CURL target page allows for more 2) Convert to DOM Object complex systems 3) Run Xpath Queries 4) Store Data or Hit API to be built.