Regular Expressions (RegEx) for SEO

6,269 views

Published on

Regular Expressions are highly technical. This training covers the basics of RegEx and also gives examples of how to use it.

Take some time to go through each example and try to figure it out on your own.

Published in: Technology, Design

Regular Expressions (RegEx) for SEO

  1. 1. Regular Expressions for SEO The Coolest Pattern Matching Search Language... Troy Boileau | Team Leader, SEO & Inbound Marketing Consultant For Powered by Search Internal | October 2013
  2. 2. We’re in business because we believe that great brands need both voice and visibility in order to connect people with what matters. A boutique, full-service digital marketing agency in Toronto, Powered by Search is a PROFIT HOT 50-ranked agency that delivers search engine optimization, pay per click advertising, local search, social media marketing, and online reputation management services. Some of our clients... Featured in...
  3. 3. RegEx Basics Practical SEO Uses RegEx Puzzles for Homework
  4. 4. Regular Expressions for SEO
  5. 5. http://xkcd.com/
  6. 6. RegEx Basics
  7. 7. RegEx Basics Use Sublime Text This is the sexiest text editor / IDE you’ll ever use. It’s light weight, too. It’s the text editor you’ll fall in love with.
  8. 8. RegEx Basics Literal Matching Text I want to match this. RegEx match this RegEx matches literal strings. This is like running a normal search in Word. Pretty cool, huh?
  9. 9. RegEx Basics Anchors Text I want this, I want that, I want I want I want Text I want this, I want that, I want I want I want RegEx ^I want RegEx I want$ There are a couple of special characters called “Anchors.” The carret (^) represents the beginning of a line. The dollar sign ($) represents the end of a line. You see these a lot in .htaccess files.
  10. 10. RegEx Basics Special Characters There are also a series of other special characters. These are: • [ - Starts a Character Class (More Later) • - Escapes or modifies the character after it. • . - Wildcard. It represents any character. • | - OR, so (this|that|the other) means this, that, or the other. • ( - Starts a group. • ) - Ends a group. To match any of these literal characters, put a backslash in front of it. This also applies to ?+*^$ which we’ve talked about or will get to later.
  11. 11. RegEx Basics Quantifiers A quantifier tells the expression how many times to match the expression before it. Text Ahhhhhhhhhhh. A spider. RegEx A[h]+ • ? - Zero or one time • + - One or more times • {exactly} - Exactly this many times • {min,max} - Between min and max times • * - Zero or more times
  12. 12. RegEx Basics Greedy vs. Non-Greedy Quantitative expressions are greedy by default. It’ll repeat the expression as many times as possible before giving up and continuing with the rest of the RegEx. This leads to unexpected issues. To make these quantifiers, *+{}, nongreedy, just add a question mark. Text <p>test</p> Text <p>test</p> RegEx (Greedy) <.+> RegEx (Lazy) <.+?>
  13. 13. RegEx Basics Variations / Character Classes [] A variation is a set of literal characters that can possibly fill a space. For example: Text Well then I’m better than you. RegEx th[ea]n The characters in the variation aren’t a GROUP. What the following RegEx is telling the computer is, “Find any of: a t, an h, an e, a pipe, a t, an h, an a, or an n.” That’s not what we want. Text Well then I’m better than you. RegEx [then|than]
  14. 14. RegEx Basics Groups () In the case above, we could use a group to solve our problem. Text Well then I’m better than you. RegEx (then|than) A group isn’t the best answer. It’s for alternation and/or quantification. Text I like redredred apples. RegEx (blue|green|red)+
  15. 15. RegEx Basics Variables / Captured Groups $1 When you use a group, it captures the information in a numbered variable. They count up from $1. You can use the variable when doing a find-replace. Text https://www.searchersforbeerfridges.com/?vote_number=9001 RegEx Find .+?//(.*?)/.* RegEx Replace $1 New Text www.searchersforbeerfridges.com
  16. 16. Practical SEO Uses
  17. 17. Practical SEO Uses Google Analytics – Branded Organic In Analytics I often want to find branded organic search traffic. Let’s look at the GWT data in Analytics for our fictional client, Lett.Me. Lett Me has a ton of common mis-typings and variations. They get traffic from lm, lm.com, let me, lettme.com, letme.com, let.me, and lett.me. What’s the regular expression that captures all of that?
  18. 18. Practical SEO Uses Google Analytics – Branded Organic Here’s the regular expression I came up with. It matches some funky cases like let me.com but that’s fine: RegEx Find (lm|let[t]?[ ]?[.]?me)(.com)? You can also remove the square brackets, but I feel like it’s easier to read with them in. Without them it looks like this: RegEx Find (lm|let{1,2} ?.?me)(.com)? Now just save this RegEx in your reporting document and you’ll never have to type out the whole thing again. Imagine what this could do for reporting on keyword groups!
  19. 19. Practical SEO Uses Trim To Root Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx?
  20. 20. Practical SEO Uses Trim To Root Trim to Root using Find Replace. Here’s the list: http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/?pg=2 What’s the RegEx? RegEx Find ^ .*?//(.*?)/.* RegEx Replace $1
  21. 21. Practical SEO Uses Fixing HTML – Nested Tags I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx?
  22. 22. Practical SEO Uses Fixing HTML – Nested Tags I commonly get improperly formatted HTML. Here’s an example: <h2><b></b><i></i>I Wrote This In Microsoft Word!</h2> <h2></h2> <p>This is a great image!</p> <p><img src=“http://site.com/sampleimage.png” /></p> I want to remove all of the empty tags. What’s the RegEx? RegEx Find <[a-z0-9]{1,6}></[a-z0-9]{1,6}> RegEx Replace
  23. 23. Practical SEO Uses Top Level Domains Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx?
  24. 24. Practical SEO Uses Top Level Domains Find only .bs and .spam top level domains. Here’s the list: http://www.spam.com/bs http://bs.com/spam http://spam.bs.com/balls http://remove-this.bs/test http://www.and-this.spam/ What’s the RegEx? RegEx Find .*//(.*?).(bs|spam)/.* RegEx Replace $1
  25. 25. Practical SEO Uses Finding Substrings in Domains Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!)
  26. 26. Practical SEO Uses Finding Substrings in Domains Does the domain contain the words “directory” or “article”? The list: http://directorylinks.com/spamspam http://www.spammy.com/link-directory http://shadyarticles.com/ http://newyorktimes.com/?article_id=744 https://bonusarticles.com What’s the RegEx? (If you can match bonus articles without the trailing slash, I salute you!) RegEx Find ^.*?//.*(directory|article).*?(/|..{2,3}$).*
  27. 27. Practical SEO Uses Merging Lists Does the list of URLs contain domains we’ve already disavowed? Say we’re doing a reconsideration request and we don’t want to consider any of the links we’ve already disavowed. So, we have List A, new links with some old links mixed in, that we want cleansed of any of the domains in List B. It’s a whole process. What do you think it is? List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
  28. 28. Practical SEO Uses Merging Lists First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that? List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
  29. 29. Practical SEO Uses Merging Lists First I’d use one of the tricks we learned already to format List B in an easier to manipulate way. I’ve bolded it below. What do you think the RegEx F/R is to get that? RegEx Find ^ .*?//(.*?)/.* RegEx Replace $1 List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ http://directorylinks.com/article http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/
  30. 30. Practical SEO Uses Merging Lists Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: n is the newline character and you need it. What’s the RegEx? List A List B http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ directorylinks.com spam.com mafia-wars.com 192.233.111
  31. 31. Practical SEO Uses Merging Lists Great. Now, we’ve learned how to search for substrings (string is a substring of substrings, if that isn’t confusing). How might we turn List B into a set of variations of substrings that we can search through List A with? A tip: n is the newline character and you need it. RegEx Find What’s the RegEx? n RegEx Replace List A List B | http://globeandmail.com/ http://directorylinks.com/?id=1 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/ directorylinks.com spam.com mafia-wars.com 192.233.111
  32. 32. Practical SEO Uses Merging Lists If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B? List A List B http://globeandmail.com/ directorylinks.com|spam.com|m http://directorylinks.com/?id=1 afia-wars.com|192.233.111 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
  33. 33. Practical SEO Uses Merging Lists If you did it right, you should have what I’ve currently listed under List B. What’s the final step we need to be able to search List A with the substrings in List B? .*(directorylinks.com|spam.com|mafia-wars.com|192.233.111).* List A List B http://globeandmail.com/ directorylinks.com|spam.com|m http://directorylinks.com/?id=1 afia-wars.com|192.233.111 http://spam.com/article http://mafia-wars.com/torrentz http://192.233.111/ http://tomsdiner.net/article https://thediner.pl/
  34. 34. Practical SEO Uses Finding Client Anchor in HTML Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.”
  35. 35. Practical SEO Uses Finding Client Anchor in HTML Screaming Frog lets you use Regular Expressions in your searches. One use of this feature is finding out whether or not someone is actually linking to your website or not, because all legitimate anchors share the same format. <a (any or no tags) href=“any variation of your URL” (any or no tags)>(possible other tags)anchor text(possible other tags)</a> In the attached HTML document, find all 3 links to Mooz.com. Bonus: Find only the 2 links to Mooz.com that contain the anchor text, “Cow Melk” or “Milk.” RegEx Find <a.{0,100}href=.{0,100}mooz.com <a.{0,100}href=.{0,100}?mooz.com(.{0,100}?)(Cow Melk|Milk)
  36. 36. RegEx Puzzles for Homework
  37. 37. RegEx Puzzles for Homework Resources Sample HTML https://docs.google.com/file/d/0B9QXdjV-pBueNi1pSy1HOV9rcjQ/edit?usp=sharing Sample URLs https://docs.google.com/file/d/0B9QXdjV-pBueVEluY002TklzMnc/edit?usp=sharing
  38. 38. RegEx Puzzles for Homework Puzzles Some Puzzles: • Show only the domain, no sub-domain, with a find-replace. • Find all links that are obviously from a blog. • Format a list of links as domains in a comma separated list.
  39. 39. RegEx Puzzles for Homework No Sub-Domains Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx?
  40. 40. RegEx Puzzles for Homework No Sub-Domains Show only the domain, no sub-domain, with a find-replace. http://www.georgebrown.com/www-non-www http://blog.russian.me/ https://russian.eu/ http://screw.you.regex.net/ What’s the RegEx? RegEx Find ^.*?//(.*.)*(.*).(.{2,3})/.* RegEx Replace $2.$3
  41. 41. RegEx Puzzles for Homework Blog or RSS In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx?
  42. 42. RegEx Puzzles for Homework Blog or RSS In the attached sample-urls.txt, find all links that are obviously from a blog or RSS feed. What’s the RegEx? RegEx Find .*(/blog|/article|feed.|/feed).*
  43. 43. RegEx Puzzles for Homework Comma Separated Domains Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 http://www.cio.com/article/738249/ http://www.clicktivist.org/ Should be: www.domain.com, www.domain2.com, etc. What’s the RegEx?
  44. 44. RegEx Puzzles for Homework Comma Separated Domains Format a list of links as domains in a comma separated list. The links: http://www.business2community.com/seo http://www.buzzstream.com/blog/competitive-link-building.html http://www.cansinmert.com/ http://www.canuckseo.com/index.php/2010 RegEx Find http://www.cio.com/article/738249/ (|n).*//(.*)/.* http://www.clicktivist.org/ Replace With $2, Should be: Delete trailing comma www.domain.com, www.domain2.com, etc. What’s the RegEx?
  45. 45. http://www.smbc-comics.com/
  46. 46. Questions?
  47. 47. Thanks for Hanging Out Stay in Touch Twitter: @troyfawkes Google+: http://gplus.to/TroyFawkes Email: troy@poweredbysearch.com www.poweredbysearch.com www.troyfawkes.com

×