The SEO's Guide to Scraping Everything
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The SEO's Guide to Scraping Everything

on

  • 23,288 views

 

Statistics

Views

Total Views
23,288
Views on SlideShare
11,271
Embed Views
12,017

Actions

Likes
8
Downloads
83
Comments
0

20 Embeds 12,017

http://www.eppie.net 11909
http://localhost 34
http://www.twylah.com 21
https://twitter.com 11
http://webcache.googleusercontent.com 7
https://www.linkedin.com 6
http://translate.googleusercontent.com 5
http://www.linkedin.com 5
http://socialslide.herokuapp.com 4
http://jobdanielcaicedo.blogspot.com 3
http://us-w1.rockmelt.com 2
http://www.blogger.com 2
http://www.contadorespublicosdevenezuela.com 1
http://oppmodel.dev 1
http://andyjturn6 1
http://excelenciadelsaber.blogspot.com 1
http://www.google.co.uk 1
https://si0.twimg.com 1
http://vrindavijayan.blogspot.in 1
http://www.onlydoo.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The SEO's Guide to Scraping Everything Presentation Transcript

  • 1. the SEO’s guide to: !SCRAPING!EVERYTHING! @eppievojt! digital marketing consultant, JPL!
  • 2. NEXT LEVEL!XPATH-ING! Use Case 1: Does site x link to any page on eppie.net?
  • 3. NEXT LEVEL!XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to find 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization
  • 4. DO YOU LINK!TO ME?! //a[contains(@href,http://www.eppie.net’)] PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
  • 5. Add translate() to normalize case//a[contains(translate(@href, ABCDEFGHIJKLMNOPQRSTUVWXYZ,abcdefghijklmno pqrstuvwxyz),http://www.eppie.net’)] DO YOU LINK! TO ME?!
  • 6. How you can use this:Get notified when a link is removed+ Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource)Integrate into link outreach process+ Get notification when link goes live DO YOU LINK! TO ME?!
  • 7. NEXT LEVEL!XPATH-ING! Use Case 2: Find every external link from cnn.com
  • 8. NEXT LEVEL!XPATH-ING! What we know:" Combine attribute selectors to more 1)  External links all contain http://" accurately target 2)  Internal links can also use http://" useful information 3)  So we need to exclude http:// links to the current domain
  • 9. SCRAPE ALL!EXTERNAL LINKS! //a[contains(@href,http://) and not (contains(@href,cnn.com))]
  • 10. How you can use this:Identify if a page is too spammed out to bother with by pulling external link countsFind expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites.Broken link building automation SCRAPE ALL! EXTERNAL LINKS!
  • 11. LINK TYPE!IDENTIFICATION! Use Case 3: How are they ranking? What kind of links do they have?
  • 12. LINK TYPE!IDENTIFICATION! XPath’s ancestor What we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.
  • 13. LINK TYPE!IDENTIFICATION! "//a[@href=h,p://randfishkin.com/blog]/ ancestor::*[contains(@id| @class,comment)]" ment- Wa s Rand com ay to spa mming his w E the top ? This + 0S y... tells the stor
  • 14. Why you might use this:Analyze competitors’ strategies for acquiring linksFind what types of links are being used to get good anchor textImprove workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!
  • 15. REGEX TO!THE RESCUE! Use Case 4: I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)
  • 16. REGEX TO!THE RESCUE! Use regular Example: expressions to pattern match Extract all @mentions of a specific user from a tweet or page structured text
  • 17. REGEX TO!THE RESCUE!
  • 18. REGEX TO!THE RESCUE!
  • 19. REGEX TO!THE RESCUE!
  • 20. REGEX TO!THE RESCUE!
  • 21. EXTRACT!@ MENTIONS! /(?:^|s)@([A-z0-9_]+)/gi
  • 22. Why you might use this:Pull contact information from a web site (Twitter username, email address) to improve outreach effortsExtract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!
  • 23. BEYOND THE !SPREADSHEET! Use Case 5: I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.
  • 24. BEYOND THE !SPREADSHEET! Scraping outside PHP Scraping Overview: the spreadsheet 1)  CURL target page allows for more 2)  Convert to DOM Object complex systems 3)  Run Xpath Queries 4)  Store Data or Hit API to be built.
  • 25. BEYOND THE !SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com
  • 26. SHOW!SOME LOVE! I’m @eppievojt and I work for @jplcreative " eppie.net linkdetective.com jplcreative.com