White Hat Cloaking

3,145 views

Published on

My SMX Advanced Presentation on White Hat Cloaking

Published in: Technology, Design
  • Be the first to comment

White Hat Cloaking

  1. 1. White Hat Cloaking – Six Practical Applications Presented by Hamlet Batista
  2. 2. Why white hat cloaking? <ul><li>“ Good” vs “bad” cloaking is all about your intention </li></ul><ul><li>Always weigh the risks versus the rewards of cloaking </li></ul><ul><li>Ask permission— or just don’t call it cloaking! </li></ul><ul><li>Cloaking vs “IP delivery” </li></ul>Page 
  3. 3. Crash course in white hat cloaking Page  When to cloak? How do we cloak? How can cloaking be detected? Risks and next steps 1 2 4 5 Practical scenarios where good cloaking makes sense Practical scenarios and alternatives 3
  4. 4. When is practical to cloak? <ul><li>Content accessibility </li></ul><ul><ul><li>Search unfriendly Content Management Systems </li></ul></ul><ul><ul><li>Rich media sites </li></ul></ul><ul><ul><li>Content behind forms </li></ul></ul><ul><li>Membership sites </li></ul><ul><ul><li>Free and paid content </li></ul></ul><ul><li>Site structure improvements </li></ul><ul><ul><li>Alternative to PR sculpting via “no-follow“ </li></ul></ul><ul><li>Geolocation/IP delivery </li></ul><ul><li>Multivariate testing </li></ul>Page 
  5. 5. Practical scenario #1 Page  Regular users see <ul><li>URLs with many dynamic parameters </li></ul><ul><li>URLs with session IDs </li></ul><ul><li>URLs with canonicalization issues </li></ul><ul><li>Missing titles and meta descriptions </li></ul>Search engine robot sees <ul><li>Search engine friendly URLs </li></ul><ul><li>URLs without session IDs </li></ul><ul><li>URLs with a consistent naming convention </li></ul><ul><li>Automatically generated titles and meta descriptions </li></ul>Proprietary website management systems that are not search-engine friendly
  6. 6. Practical scenario #2 Page  Sites built completely in Flash, Silverlight or any other rich media technology Search engine robot sees <ul><li>A text representation of all graphical (images) elements </li></ul><ul><li>A text representation of all motion (video) elements </li></ul><ul><li>A text transcription of all audio in the rich media content </li></ul>Your text
  7. 7. Practical scenario #3 Page  Membership sites Search users see <ul><li>Snippets of premium content on the SERPs </li></ul><ul><li>When they land on the site they are faced with a registration form </li></ul>Your text Members sees <ul><li>The same content search engine robots see </li></ul>
  8. 8. Practical scenario #4 Page  Regular users follow a link structure designed for ease of navigation Sites requiring massive site strucuture changes to improve index penetration Search engine robots follow a link structure designed for ease of crawling and deeper index penetration of the most important content Step 4 Step 1 Step 2 Step 3 Step 4 Step 5 Step 1 Step 3 Step 2 Step 5
  9. 9. Practical scenario #5 Page  Sites using geolocation technology Regular users see <ul><li>Content tailored to their geographical location and/or user’s language </li></ul>Your text Search engine robot sees <ul><li>The same content consistently </li></ul>
  10. 10. Practical scenario #6 Page  Split testing organic search landing pages Each regular user sees <ul><li>One of the content experiment alternatives </li></ul>Your text Search engine robot sees <ul><li>The same content consistently </li></ul>
  11. 11. How do we cloak? Page  Search robot detection <ul><li>By HTTP User agent </li></ul><ul><li>By IP address </li></ul><ul><li>By HTTP cookie test </li></ul><ul><li>By JavaScript/CSS test </li></ul><ul><li>By DNS double check </li></ul><ul><li>By visitor behavior </li></ul><ul><li>By combining all the techniques </li></ul>Content delivery <ul><li>Presenting the equivalent of the inaccesible content to robots </li></ul><ul><li>Presenting the search-engine friendly content to robots </li></ul><ul><li>Presenting the content behind forms robots </li></ul>Cloaking is performed with a web server script or module
  12. 12. Robot detection by HTTP user agent Page  Search robot HTTP request 66.249.66.1 - - [04/Mar/2008:00:20:56 -0500] “ GET /2007/11/13/game-plan-what-marketers-can-learn-from-strategy-games/ HTTP/1.1″ 200 61477 “ -” “ Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “-” A very simple robot detection technique
  13. 13. Robot detection by HTTP cookie test Page  Search robot HTTP request 66.249.66.1 - - [04/Mar/2008:00:20:56 -0500] “ GET /2007/11/13/game-plan-what-marketers-can-learn-from-strategy-games/ HTTP/1.1″ 200 61477 “ -” “ Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “ Missing cookie info ” Another simple robot detection technique, but weaker
  14. 14. Robot detection by JavaScript/CSS test HTML Code <div id=&quot;header&quot;><h1><a href=&quot;http://www.example.com&quot; title=&quot;Example Site&quot;>Example site</a></h1></div> and the CSS code is pretty straight forward, it swaps out anything in the h1 tag in the header with an image CSS Code /* CSS Image replacement */ #header h1 {margin:0; padding:0;} #header h1 a { display: block; padding: 150px 0 0 0; background: url(path to image) top right no-repeat; overflow: hidden; font-size: 1px; line-height: 1px; height: 0px !important; height /**/:150px; } Page  DHTML Content Another option for robot detection
  15. 15. Robot detection by IP address Page  Search robot HTTP request 66.249.66.1 - - [04/Mar/2008:00:20:56 -0500] “ GET /2007/11/13/game-plan-what-marketers-can-learn-from-strategy-games/ HTTP/1.1″ 200 61477 “ -” “ Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “ -” A more robust robot detection technique
  16. 16. Robot detection by double DNS check Page  Search robot HTTP request <ul><li>nslookup </li></ul><ul><li>66.249.66.1 </li></ul><ul><li>Name: crawl-66-249-66-1.googlebot.com </li></ul><ul><li>Address: 66.249.66.1 </li></ul><ul><li>crawl-66-249-66-1.googlebot.com </li></ul><ul><li>Non-authoritative answer: </li></ul><ul><li>Name: crawl-66-249-66-1.googlebot.com </li></ul><ul><li>Address : 66.249.66.1 </li></ul>A more robust robot detection technique
  17. 17. Robot detection by visitor behavior Page  Robots differ substantially from regular users when visiting a website Your text
  18. 18. Combining the best of all techniques Page  Maintain a cache with a list of known search robots to reduce the number of verification attempts Label as possible robot any visitor with suspicious behavior Label a robot anything that identifies as such Confirm it is a robot by doing a double DNS check. Also confirm suspect robots User Behavior Check User Agent Check IP Address Check Double DNS check
  19. 19. Clever cloaking detection Page  A clever detection technique is to check the caches at the newest datacenters <ul><li>IP-based detection techniques rely on an up-to-date list of robot IPs </li></ul><ul><li>Search engines change IPs on a regular basis </li></ul><ul><li>It is possible to identify those new IPs and check the cache </li></ul>Your text
  20. 20. Risks of cloaking Page  Search engines do not want to accept any type of cloaking Survival tips <ul><li>The safest way to cloak is to ask for permission from each of the search engines that you care about </li></ul><ul><li>Refer to it as IP delivery . </li></ul>Your text <ul><li>Cloaking : Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines . If the file that Googlebot sees is not identical to the file that a typical user sees, then you're in a high-risk category. A program such as md5sum or diff can compute a hash to verify that two different files are identical. </li></ul><ul><li>http://googlewebmastercentral.blogspot.com/2008/06/how-google-defines-ip-delivery.html </li></ul>
  21. 21. Next Steps <ul><li>Make sure clients understand the risks/rewards of implementing white hat cloaking </li></ul><ul><li>More information and how to get started </li></ul><ul><ul><li>How Google defines IP delivery, geolocation and cloaking http://googlewebmastercentral.blogspot.com/2008/06/how-google-defines-ip-delivery.html </li></ul></ul><ul><ul><li>First Click Free http://googlenewsblog.blogspot.com/2007/09/first-click-free.html </li></ul></ul><ul><ul><li>Good Cloaking, Evil Cloaking and Detection http://searchengineland.com/070301-065358.php </li></ul></ul><ul><ul><li>YADAC: Yet Another Debate About Cloaking Happens Again http://searchengineland.com/070304-231603.php </li></ul></ul><ul><ul><li>Cloaking is OK Says Google http://blog.venture-skills.co.uk/2007/07/06/cloaking-is-ok-says-google/ </li></ul></ul><ul><ul><li>Advanced Cloaking Technique: How to feed password-protected content to search engine spiders http://hamletbatista.com/2007/09/03/advanced-cloaking-technique-how-to-feed-password-protected-content-to-search-engine-spiders/ </li></ul></ul>Page 
  22. 22. <ul><li>Blog http://hamletbatista.com </li></ul><ul><li>LinkedIn http://www.linkedin.com/in/hamletbatista </li></ul><ul><li>Facebook http://www.facebook.com/people/Hamlet_Batista/613808617 </li></ul><ul><li>Twitter http://twitter.com/hamletbatista </li></ul><ul><li>E-mail [email_address] </li></ul>Page  I would be happy to help. Feel free to contact me ? ? ?

×