Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
the SEO’s guide to: !

SCRAPING!
EVERYTHING!
  @eppievojt!
  digital marketing consultant, JPL!
NEXT LEVEL!
XPATH-ING!

  Use Case 1:
  Does site x link to any page on
  eppie.net?
NEXT LEVEL!
XPATH-ING!
  Scrape partial       What we know:"

  matches using        1)  Link will contain"
              ...
DO YOU LINK!
TO ME?!

  //a[contains(@href,'http://www.eppie.net’)]




             PROBLEM: FAILS TO ACCOUNT FOR CASE SE...
Add translate() to normalize case
//a[contains(translate(@href,
   'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno
   pqrstu...
How you can use this:
Get notified when a link is removed
+ Make contact to potentially save dropping link (friendly
  remi...
NEXT LEVEL!
XPATH-ING!

  Use Case 2:
  Find every external link from cnn.com
NEXT LEVEL!
XPATH-ING!
                        What we know:"
  Combine attribute
  selectors to more     1)  External lin...
SCRAPE ALL!
EXTERNAL LINKS!

  //a[contains(@href,'http://') and not
    (contains(@href,'cnn.com'))]
How you can use this:
Identify if a page is too spammed out to bother with by
   pulling external link counts

Find expire...
LINK TYPE!
IDENTIFICATION!

  Use Case 3:
  How are they ranking? What kind of links
  do they have?
LINK TYPE!
IDENTIFICATION!
  XPath’s ancestor    What we know:"
  axis lets us        A link inside a containing element w...
LINK TYPE!
IDENTIFICATION!


  "//a[@href='h,p://randfishkin.com/blog']/
    ancestor::*[contains(@id|
    @class,'comment'...
Why you might use this:
Analyze competitors’ strategies for acquiring links

Find what types of links are being used to ge...
REGEX TO!
THE RESCUE!

  Use Case 4:
  I’ve scraped some data, now I need to
  extract some small portion of it that
  XPa...
REGEX TO!
THE RESCUE!

  Use regular
                     Example:
  expressions to
  pattern match      Extract all @ment...
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
REGEX TO!
THE RESCUE!
EXTRACT!
@ MENTIONS!

       /(?:^|s)@([A-z0-9_]+)/gi
Why you might use this:
Pull contact information from a web site (Twitter username,
  email address) to improve outreach e...
BEYOND THE !
SPREADSHEET!

  Use Case 5:
  I want to chain processes together,
  process lots of data, or allow multiple
 ...
BEYOND THE !
SPREADSHEET!
  Scraping outside   PHP Scraping Overview:
  the spreadsheet
                     1)    CURL ta...
BEYOND THE !
SPREADSHEET!

 Simple PHP Scraper Class:
 http://www.scrapeeverything.com
SHOW!
SOME LOVE!

  I’m @eppievojt and I work for @jplcreative "

  eppie.net
  linkdetective.com
  jplcreative.com
Upcoming SlideShare
Loading in …5
×

of

The SEO's Guide to Scraping Everything Slide 1 The SEO's Guide to Scraping Everything Slide 2 The SEO's Guide to Scraping Everything Slide 3 The SEO's Guide to Scraping Everything Slide 4 The SEO's Guide to Scraping Everything Slide 5 The SEO's Guide to Scraping Everything Slide 6 The SEO's Guide to Scraping Everything Slide 7 The SEO's Guide to Scraping Everything Slide 8 The SEO's Guide to Scraping Everything Slide 9 The SEO's Guide to Scraping Everything Slide 10 The SEO's Guide to Scraping Everything Slide 11 The SEO's Guide to Scraping Everything Slide 12 The SEO's Guide to Scraping Everything Slide 13 The SEO's Guide to Scraping Everything Slide 14 The SEO's Guide to Scraping Everything Slide 15 The SEO's Guide to Scraping Everything Slide 16 The SEO's Guide to Scraping Everything Slide 17 The SEO's Guide to Scraping Everything Slide 18 The SEO's Guide to Scraping Everything Slide 19 The SEO's Guide to Scraping Everything Slide 20 The SEO's Guide to Scraping Everything Slide 21 The SEO's Guide to Scraping Everything Slide 22 The SEO's Guide to Scraping Everything Slide 23 The SEO's Guide to Scraping Everything Slide 24 The SEO's Guide to Scraping Everything Slide 25 The SEO's Guide to Scraping Everything Slide 26
Upcoming SlideShare
Keyword Research for SEO: Research, Analysis, and Evaluation
Next
Download to read offline and view in fullscreen.

16 Likes

Share

Download to read offline

The SEO's Guide to Scraping Everything

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

The SEO's Guide to Scraping Everything

  1. the SEO’s guide to: ! SCRAPING! EVERYTHING! @eppievojt! digital marketing consultant, JPL!
  2. NEXT LEVEL! XPATH-ING! Use Case 1: Does site x link to any page on eppie.net?
  3. NEXT LEVEL! XPATH-ING! Scrape partial What we know:" matches using 1)  Link will contain" http://www.eppie.net in the " XPath’s “contains” href attribute" function to find 2)  Some people like to hurt the internet inexact data. by capitalizing URLs, so we’ll need to account for that" 3)  People who link to you don’t care about your desire for canonicalization
  4. DO YOU LINK! TO ME?! //a[contains(@href,'http://www.eppie.net’)] PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY
  5. Add translate() to normalize case //a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmno pqrstuvwxyz'),'http://www.eppie.net’)] DO YOU LINK! TO ME?!
  6. How you can use this: Get notified when a link is removed + Make contact to potentially save dropping link (friendly reminder, buy expiring domain, recreate dead resource) Integrate into link outreach process + Get notification when link goes live DO YOU LINK! TO ME?!
  7. NEXT LEVEL! XPATH-ING! Use Case 2: Find every external link from cnn.com
  8. NEXT LEVEL! XPATH-ING! What we know:" Combine attribute selectors to more 1)  External links all contain http://" accurately target 2)  Internal links can also use http://" useful information 3)  So we need to exclude http:// links to the current domain
  9. SCRAPE ALL! EXTERNAL LINKS! //a[contains(@href,'http://') and not (contains(@href,'cnn.com'))]
  10. How you can use this: Identify if a page is too spammed out to bother with by pulling external link counts Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites. Broken link building automation SCRAPE ALL! EXTERNAL LINKS!
  11. LINK TYPE! IDENTIFICATION! Use Case 3: How are they ranking? What kind of links do they have?
  12. LINK TYPE! IDENTIFICATION! XPath’s ancestor What we know:" axis lets us A link inside a containing element with leverage semantic an id or class name including the word “comment,” “footer,” or “blogroll” is markup to ID link highly suggestive of type types.
  13. LINK TYPE! IDENTIFICATION! "//a[@href='h,p://randfishkin.com/blog']/ ancestor::*[contains(@id| @class,'comment')]" ment- Wa s Rand com ay to spa mming his w E the top ? This + 0S y... tells the stor
  14. Why you might use this: Analyze competitors’ strategies for acquiring links Find what types of links are being used to get good anchor text Improve workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis SCRAPE ALL! EXTERNAL LINKS!
  15. REGEX TO! THE RESCUE! Use Case 4: I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)
  16. REGEX TO! THE RESCUE! Use regular Example: expressions to pattern match Extract all @mentions of a specific user from a tweet or page structured text
  17. REGEX TO! THE RESCUE!
  18. REGEX TO! THE RESCUE!
  19. REGEX TO! THE RESCUE!
  20. REGEX TO! THE RESCUE!
  21. EXTRACT! @ MENTIONS! /(?:^|s)@([A-z0-9_]+)/gi
  22. Why you might use this: Pull contact information from a web site (Twitter username, email address) to improve outreach efforts Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research REGEX TO! THE RESCUE!
  23. BEYOND THE ! SPREADSHEET! Use Case 5: I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.
  24. BEYOND THE ! SPREADSHEET! Scraping outside PHP Scraping Overview: the spreadsheet 1)  CURL target page allows for more 2)  Convert to DOM Object complex systems 3)  Run Xpath Queries 4)  Store Data or Hit API to be built.
  25. BEYOND THE ! SPREADSHEET! Simple PHP Scraper Class: http://www.scrapeeverything.com
  26. SHOW! SOME LOVE! I’m @eppievojt and I work for @jplcreative " eppie.net linkdetective.com jplcreative.com
  • HeatherSewell6

    Nov. 26, 2021
  • dominiosclave

    Sep. 28, 2021
  • communicationsnetwork

    Feb. 17, 2020
  • marcgulinski

    Jun. 11, 2016
  • KatiaDorcin

    Jun. 5, 2016
  • stekinsella

    Nov. 18, 2015
  • Cemoulto

    Jun. 1, 2015
  • nathenless

    Mar. 31, 2015
  • LMBoost

    May. 6, 2014
  • profesordeseo

    Apr. 25, 2014
  • SEODiaz

    Apr. 17, 2014
  • paulshapiro

    Jan. 10, 2014
  • Jeremy-Rieunier

    Sep. 17, 2012
  • williamhrs

    May. 24, 2012
  • benbeck

    May. 23, 2012
  • businessplanonline

    May. 12, 2012

Views

Total views

33,929

On Slideshare

0

From embeds

0

Number of embeds

12,755

Actions

Downloads

103

Shares

0

Comments

0

Likes

16

×