Seo book


Published on

A free SEO book that covers the technical side of Search Engine Optimisation.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Seo book

  1. 1. 1
  2. 2. ContentsI Theory 5 0.1 Why SEO is important . . . . . . . . . . . . . . . . . . . . . . . . 5 0.2 Dierent needs from SEO . . . . . . . . . . . . . . . . . . . . . . 51 What is a Search Engine? 7 1.1 History of Search Engines . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Important Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Dynamic Data . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Spam and Manipulation . . . . . . . . . . . . . . . . . . 8 1.3 How a Search Engine works . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Text acquisition . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Duplicate Content Detection . . . . . . . . . . . . . . . . 10 1.3.3 Text transformation . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Index Creation . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.5 User Interaction . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.6 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 122 How good can a search engine be? 13 2.1 NP Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 AI Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Ranking Factors 15 3.1 On Page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 O Page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Google PageRank Notes . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Short Description . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Mathematical Description . . . . . . . . . . . . . . . . . . 19 3.3.3 Interesting Notes on the Original Implementation of PageR- ank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.4 Optimal Linking Strategies . . . . . . . . . . . . . . . . . 21 3.3.5 Implementation to make computing PageRank faster . . . 23 3.3.6 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.7 Is linking out a good thing? . . . . . . . . . . . . . . . . . 23 3.3.8 TrustRank / Bad Page Rank . . . . . . . . . . . . . . . . 24 3.3.9 Improvements to Googles ranking algorithms . . . . . . . 25 2
  3. 3. 4 Detecting Spam and Manipulation 27 4.1 Google Webmaster Guidelines . . . . . . . . . . . . . . . . . . . . 27 4.2 Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Detecting Manipulation in Content . . . . . . . . . . . . . . . . . 28 4.4 Detecting Manipulation in Links . . . . . . . . . . . . . . . . . . 28 4.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29II Practice 295 An Example Campaign 30 5.1 Company Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Competitor Research . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Keyword Research . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.5 Content Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.6 Website Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.7 Link Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.8 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3
  4. 4. PrefaceThis book aims to provide a general overview of how search engines rank doc-uments in practice, the core of which will remain true even as Search Enginesalgorithms are rened. 4
  5. 5. Part ITheory0.1 Why SEO is important ˆ A higher search engine result will receive exponentially greater clicks than a lower oneFor example, if a search was repeated 1000 times by dierent users, this istypically how many clicks each result would get. Position Clicks 1 222 2 63 3 45 4 32 5 26 6 21 7 18 8 16 9 15 10 16 Source: Leaked Aol Click Data ˆ Paid adverts have low click through rates, and get expensive quickly Search Engine % Organic Click Through Rate % Paid Result Click Through Rate Google 72 28 Yahoo 61 39 MSN 71 29 AOL 50 50 Average 63 37 88% of online search dollars are spent on paid results, even though 85% ofsearchers click on organic results. Vanessa Fox, Marketing in the Age of Google, May 3, 20100.2 Dierent needs from SEOThere are many dierent reasons you may wish to engage in optimising yoursearch results, including ˆ Money - Sales for e-commerce sites are directly correlated with trac. ˆ Reputation - Some companies go to the extent of pushing negative arti- cles down in the rankings. 5
  6. 6. ˆ Branding - Coming up top in the results pages is impressive to customers, and is particularly important in industries where reputation is extremely important. 6
  7. 7. 1 What is a Search Engine?1.1 History of Search EnginesThe rst mechanised information retrieval sysyems were built by the US militaryto analyse the mass of documents being captured from the Germans. Researchwas boosted when the UK and US governments funded research to reduce aperceived science gap with the USSR. By the time the internet was becomingcommonplace in the early 1990s information retrieval was at an advanced stage.Complicated methods, primarily statistical, had been developed an archives ofthousands of documents could be searched in seconds. Web search engines are a special case of information retrieval systems, ap-plied to the massive collection of documents available on the internet. A typicalsearch engine in 1990 was split into two parts: a web spider that traverses theweb following links and creating a local index of the pages, then traditional in-formation retrieval methods to search the index for pages relevant to the usersquery and order the pages by some ranking function. Many factors inuence apersons decision about what is relevant, such as the current task, context andfreshness. In 1998 pages were primarily ranked by their contextual content. Since thisis entirely controlled by the owner of the page, results were easy to manipulateand as the Internet became ever more commercialized the noise from spam inSERPs (search engine results pages) made search a frustrating activity. It wasalso hard to discern websites which more people would want to visit, for examplea celebrities ocial home page, from less wanted websites with similar content,for example a site. For these reasons directory sites such as Yahoo were stillpopular, despite being out of date and making the user work out the relevance Googles founders Larry Page and Sergey Brins Page Rank innovation (namedafter Larry Page), and that of a similar algorithm also released in 1998 calledHyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the addi-tional meta information from the link structure of the Internet. A more detaileddescription of Page Rank will follow in [chapter], but for now Googles own de-scription will suce. PageRank relies on the uniquely democratic nature of the web by using itsvast link structure as an indicator of an individual pages value. In essence,Google interprets a link from page A to page B as a vote, by page A, for pageB. But, Google looks at more than the sheer volume of votes, or links a pagereceives; it also analyzes the page that casts the vote. Votes cast by pages thatare themselves important weight more heavily and help to make other pagesimportant. Whilst it is impossible to know how Google has evolved their algorithmssince the 1998 paper that launched page rank, and how real world ecientimplementation diers from the theory, as Google themselves say the PageRankalgorithm remains the heart of Googles software ... and continues to providethe basis for all of [their] web search tools. The search engines continue toevolve at a blistering pace, improving their ranking algorithms (Google says 7
  8. 8. 1there are now over 200 ranking factors considered for each search ), and indexinga growing Internet more rapidly.1.2 Important IssuesThe building of a system as complex as a modern search engine is all aboutbalancing dierent positive qualities. For example, you could eectively preventlow quality spam by paying humans to review every document on the web,but the cost would be immense. Or you could speed up your search engine byconsidering only every other document your spider encounters, but the relevanceof results would suer. Some things, such as getting a computer to analyse adocument to with the same quality as a human, are theoretically impossibletoday, but Google in particular is pushing boundaries and getting ever closer. Search engines have some particular considerations:1.2.1 PerformanceThe response time to a users query must be lightening fast.1.2.2 Dynamic DataUnlike a traditional information retrieval system in a library the pages on theInternet are constantly changing.1.2.3 ScalabilitySearch engines need to work with billions of users searching through trillions ofdocuments, distributed across the Earth.1.2.4 Spam and ManipulationActively engaging against other humans to maintain the relevancy of results isrelatively unique to search engines. In a library system you may have an authorthat creates a long title packed with words their readers may be interested in,but thats about the worst of it. When designing your search engine you arein a constant battle with adversaries who will attempt to reverse engineer youralgorithm to nd the easiest ways to aect your restyles. A common termfor this relation ship is Adverse rial Information Retrieval. The relationshipbetween the owner of a Web site trying to rank high on a search engine and thesearch engine designer is an adversarial relationship in a zero-sum game. Thatis, assuming the results were better before, every gain for the web site owner is aloss for the search engine designer. Classifying where your eorts cross helping asearch engine be aware of your web sites content and popularity, which shouldhelp to improve a search engines results, and start instead ranking beyondyour means and start decreasing the quality of a search engines results can be 1 See 8
  9. 9. somewhat tricky. The practicalities of what search engines consider to be spam,and as importantly what they can detect and x, will be discussed later. 2 According to Web Spam Taxonomy , approximately 10-15% of indexedcontent on the web is spam. What is considered spam and duplicate contentvaries, which makes this statistic hard to verify. There is a core of about 56million pages 3 that are highly interlinked at the center of the Internet, and areless likely to be spam. Documents further away (in link steps) from this coreare more likely be spam. Deciding the quality of a document well (say whether it is a page writtenby an expert in the eld, or generated by a computer program using naturallanguage processing) is an AI Complete problem, that is it wont be possibleuntil we have articial intelligence that can match that of a human. However, search engines hope to get spam under control by lessening thenancial incentive of spam. This quote from a Microsoft Research paper 4 ex-presses this nicely: Eectively detecting web spam is essentially an arms race be- tween search engines and site operators. It is almost certain that we will have to adapt our methods overtime, to accommodate for new spam methods that the spammers use. It is our hope that our work will help the users enjoy a better search experience on the web.Victory does not require perfection, just a rate of detec-tion that alters the economic balance for a would-be spammer. It is our hope that continued research on this front can make eective spam more expensive than genuine content. 5Google developers for their part describe web spam as the following , citing thedetrimental impact it has upon users These manipulated documents can be referred to as spam. When a userreceives a manipulated document in the search results and clicks on the link togo to the manipulated document, the document is very often an advertisementfor goods or services unrelated to the search query or a pornography websiteor the manipulated document automatically forwards the user on to a websiteunrelated to the users query.1.3 How a Search Engine worksA typical search engine can split into two parts: Indexing, where the Internet istransformed into an internal representation that can be eciently searched. Thequery process, where the index is searched for the user query and documentsare ranked and returned to the user in a list. Indexing 2 Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Work-shop on Adversarial Information Retrieval on the Web, May 2005 3 See On Determining Communities in the Web by K Verbeurg 4 See Detecting Spam Web Pages through Content Analysis by A Ntoulas 5 See patent 7302645: Methods and systems for identifying manipulated articles 9
  10. 10. 1.3.1 Text acquisitionA crawler starts at a seed site such as the DMOZ directory, then repeatedlyfollows links to nd documents across the web, storing the content of the pagesand associated meta data (such as the date of indexing, which page linked to thesite). In a modern search engine the crawler is constantly running, downloadingthousands of pages simultaneously, to continuously update and expand the in-dex. A good crawler will cover a large percentage of the pages on the Internet,and visit popular pages frequently to keep its index fresh. A crawler will connectto the web server and use a HTTP request to retrieve the document, if it haschanged. On average, Web page updates follow the Poisson distribution - that isthe crawler can expect the time until the web page updates next time to followan exponential distribution. Crawlers are now also indexing near real time datathrough varying sources such as access to RSS Feeds and the Twitter API, andare able to index a range of formats such as PDFs and Flash. These formatsare converted into a common intermediate format such as XML. A crawler canalso be asked to update its copy of a page via methods such as a ping or XMLsitemap, but the update time will still be up to the crawler. The document datastore stores the text and meta data the crawler retrieves, it must allow for veryfast access to a large amount of documents. Text can be compressed relativelyeasily, and pages are typically indexed by a hash of their URL. Googles originalpatent used a system called BigTable, Google now keeps documents in sectionscalled shards distributed over a range of data centres (this oers performance,redundancy and security benets).1.3.2 Duplicate Content DetectionDetecting exact duplicates is easy, remove the boilerplate content (menus etc.)then compare the core text through check sums. Detecting near duplicates isharder, particularly if you want to build an algorithm that is fast enough tocompare a document against every other document in the index. To performfaster duplicate detection, nger prints of a document are taken. A simple ngerprinting algorithm for this is outlined here: 1. Parse the document into words, and remove formatting content such as punctuation and HTML tags. 2. The words are grouped into groups of words (called n-grams, a 3-gram being 3 words, 4-gram 4 words etc.) 3. Some of these n-grams are selected to represent a document 4. The selected n-grams are hashed to create a shorter description 5. The hash values are stored in a quick look up database 6. The documents are compared by looking at overlaps of ngerprints. 10
  11. 11. Fingerprinting in action A paper 6 by four Google employees found the following statistics across theirindex of the web. Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of vegrams: 1,176,470,663 Most common trigram in English: all rights reserved Detecting unusual patterns of n-grams can also be used to detect low qual-ity/spam documents . 71.3.3 Text transformationTokenization is the process of splitting a series of characters up into separatewords. These tokens are then parsed to look for tokens such as a /a tond which parts of the text is plain text, links and such. ˆ Identifying ContentSections of documents that are just content are found, in an attempt to ignoreboiler plate content such as navigation menus. A simple way is to look forsections where there are few HTML tags, more complicated methods considerthe visual layout of the page. ˆ StoppingCommon words such as the and and are removed to increase the eciencyof the search engine, resulting in a slight loss in accuracy. In general, the moreunusual a word the better it is at determining if a document is relevant. 6 See N-gram Statistics in English and Chinese: Similarities and Dierences 7 See 11
  12. 12. ˆ StemmingStemming reduces words to just their stem, for example computer and com-puting become comput. Typically around a 10% improvement is seen inrelevance in English, and up to 50% in Arabic. The classic stemmer algorithmis the Porter Stemmer which works through a series of rules such as replacesses with ss to stresses - stress. ˆ Information ExtractionTrying to determine the meaning of text is very dicult in general, but certainwords can give clues. For example the phrase x has worked at y is usefulwhen building an index of employees.1.3.4 Index CreationDocument statistics such as the count of words are stored for use in rankingalgorithms. 8 is created to allow for fast full text searches. An inverted index 9The index is distributed across multiple data centres across the globe .1.3.5 User InteractionThe user is provided with an interface in which to give their query. The queryis then transformed, using similar techniques to with documents such as stem-ming, as well as spell checking and expanding the query to nd other queriessynonymous with the users query. After ranking the document set, a top set ofresults are displayed together with snippets to show how they were matched.1.3.6 RankingA scoring function calculates scores for documents. Some parts of the scoringcan be performed at query time, others at document processing time.1.3.7 EvaluationUsers queries and their actions are logged in detail for improve results. Forexample, if a user clicks on a result then quickly performs the same searchagain, it is likely that they clicked a poor result. 8 An inverted index is an index data structure storing a mapping from content, such aswords or numbers, to its document in a set of documents. The purpose of an inverted indexis to allow fast full text searches, at a cost of increased processing when a document is added the database. 9A approach is at good overview of Googles shardgoogle-architecture 12
  13. 13. 2 How good can a search engine be?There are some very specic limits in computer science as to what a computerprogram is capable of doing, and these have direct consequences for how searchengines can index and rank your web pages. The two core sets or problemsare NP-Complete problems, which for large sets of data take too long to solveperfectly, and AI-Complete problems, which cant be done perfectly until wehave computers that are intelligent as people. That doesnt mean search enginescant make approximations, for example nding the shortest route on a map isa NP-Complete problem yet Google maps still manages to plot pretty goodroutes 10 .2.1 NP Hard ProblemsPolynomial (P) problems can be solved in polynomial time, that is relativelyquickly. Non Polynomial (NP) problems cannot be solved in polynomial time,that is they cant be solved for any reasonably large set of inputs such as anumber of web pages. The time taken to solve the NP hard problem (in red) grows extremelyquickly as the size of the problem grows. These concepts become complex quickly, but the key thing to pick up isthat if a problem is NP Hard there is no way it can ever be solved perfectly forsomething as large as a search engines index, and approximations will have tobe used. There are some NP Hard problems that are of particular interest toSEO: ˆ The Hamiltonian Path Problem - Detecting a greedy network (IE if you interlink your web pages to hoard page rank) in the structure of a Hamil- tonian path 11 is an NP hard problem ˆ Detecting Page Farms (the set of pages that link to a page) is NP hard 12 10 11 12 See Sketching Landscapes of Page Farms by Bin Zhou and Jian Pei 13
  14. 14. ˆ Detecting Phrase Level Duplication in a Search Engines Index 132.2 AI Hard ProblemsAI Hard problems require intelligence matching that of a human being to besolved. Examples include the Turing Test (tricking a human into thinking theyare talking to a human, not a computer), recognising dicult CAPTCHAs andtranslating text as well as an expert (who wouldnt be perfect either). During a question-and-answer session after a presentation at his alma mat-ter,Stanford University, in May 2002, Page said that Google would full its mis-sion only when its search engine was AI-complete, and said something similarin an interview with Newsweek then Playboy. I think were pretty far along compared to 10 years ago, he said. At thesame time, where can you go? Certainly if you had all the worlds informationdirectly attached to your brain, or an articial brain that was smarter than yourbrain,youd be better. Between that and today, theres plenty of space to cover.What would a perfect search engine look like? we asked. It would be the mindof God 14 And, actually, the ultimate search engine, which would understand, youknow, exactly what you wanted when you typed in a query, and it would giveyou the exact right thing back, in computer science we call theatrical intelli-gence. That means it would be smart, and were a long way from having smartcomputers. 15 Of particular interest to SEO is that fully understanding the meaning ofhuman text is an AI complete problem, and even getting close to understandingwords in context is very dicult 16 . This means detecting the quality of reason-able quality computer generated text against that of a human expert automat-ically is tricky. Its not unusual to see websites packed with decent computergenerated text (which automatically detecting is an AI complete problem) andsingle phrases stitched together from a variety of sources (which is an NP com-plete problem) ranking for Google Trends results. This is particularly hard tostop as for new news items there are less fresh sources available to choose from,this results in search engine poisoning 17 . Any site that receives a large amountof trac from this will eventually be visited manually by a Google employee,and penalised manually 18 . Googles solution to the very similar machine translation problem is inter-esting; rather than attempting to build AI they use their massive resources anddata stored from web pages and user queries to build a reliable statistical engine 13 See Detecting phrase-level duplication on the world wide web by Microsoft Researchemployees 14 http: // searchenginewatch. com/ 2156601 15 http: // tech. fortune. cnn. com/ 2011/ 02/ 17/ is-something-wrong-with-google/ 16 17 18 14
  15. 15. - their approach isnt necessarily far smarter than their competitors but theirresources make them the best translator out there.2.3 CompetitorsAlthough not a classic computer science problem, a big limit to how searchengines can treat possible spam is that competitors could attempt to make yourwebsite look like it was spamming to lower your ranking, increasing theirs. Forexample, if your website suddenly receives and inux of low quality links fromsites known to ink to spam, how would Google know if you naively ordered thisor a competitor did? This is an unsolvable problem, short of non-stop surveillance of all websiteowners. This is what Google has to say on the matter 19 Theres almost nothing a competitor can do to harm your ranking or haveyour site removed from our index. If youre concerned about another site linkingto yours, we suggest contacting the webmaster of the site in question. Googleaggregates and organizes information published on the web; we dont control thecontent of these pages. I can say from experience that Google bowling most certainly does happen,and there are a couple of experiments written up on the web 20 , though it wouldbe very dicult to Google bowl a popular website. Essentially, if a small per-centage of links to a site are most likely spam they are just ignored, if a largepercentage are likely spam then the links may result in a penalty rather thanjust being ignored. It seems likely that poor quality links are increasingly being ignored. Thepaper Link Spam Alliances from Stanford, the Google founders Alma mater,discusses both dated methods of detecting and punishing potential link spam.Note that link spam isnt the only way that sites can potentially be Googlebowled, if your competitor lls your comment section with duplicate contentabout organ enlargement and links to known phishing sites it is unlikely to helpyour rankings. Google now also takes into account users choosing to block sitesfrom results 21 , presumably with a negative eect.3 Ranking FactorsGoogle engineers update their algorithms daily 22 . They then run many tests tocheck they have the right balance between all these factors. The following is from an interview with Googles Udi Manber. Q: How do you determine that a change actually improves a set of results? A: We ran over 5,000 experiments last year. Probably 10 experiments for ev-ery successful launch. We launch on the order of 100 to 120 a quarter. We have 19 20 21 22 15
  16. 16. dozens of people working just on the measurement part. We have statisticianswho know how to analyze data, we have engineers to build the tools. We have atleast 5 or 10 tools where I can go and see here are 5 bad things that happened.Like this particular query got bad results because it didnt nd something or thepages were slow or we didnt get some spell correction. I have created a spreadsheet that shows how a search engine may cal-culate the ranking of a trivial set of documents for a particular query, you view it and try changing things yourself atpoodle-a-simple-emulation-of-search-engine-ranking-factors/.3.1 On Page Factors ˆ KeywordsRepetitions of the words in the query in the document, particularly in key areassuch as the title and headers are positive signals of relevance. The proximityof the words together is important, particularly having the exact query in thedocument. A very large repetition, particularly in nongrammatical sentences,can be a negative signal of spam. Presence of the query words in the Domainand URL are useful signals of relevance. Related phrases to the query arealso positive signals of relevance (see Latent Semantic Indexing). The metakeywords HTML tag, meta name=keywords content=my, keywords, islargely ignored by modern search engines 23 . ˆ QualityA number of dierent authors on a website, good grammar, spelling and longpages written at reasonable time intervals are positive signs of high qualitycontent 24 . ˆ Geographical LocalityMentions of an address close the user show the document may be geographicallyrelevant to the user, particularly for geograpihcally sensitive queries such asplumbers in london. ˆ FreshnessFor time dependant queries, such as news events, recent pages are more likelyto be helpful to the user. See Googles Quality Deserves Freshness drive, ofwhich Googles faster indexing Caeine update was a part. ˆ Duplicate ContentLarge percentages of content duplicated either from the same site, or others isan indicator of poor quality content and users will only want to see the canonicalcopy. 23 See 24 See 16
  17. 17. ˆ AdvertsA very large number of adverts can reduce the user experience, and aliatelinks are often associated with heavily SEO manipulated websites. ˆ Outbound LinksLinks to spammy of phising websites, or an unusually large number of outboundlinks on a number of pages, are common indicators of a page that users will notwant to visit 25 . ˆ SpamAn unusual repetition of keywords, particularly outside of sentences is a signof spam. Techniques such as hidden text and sneaky javascript redirects arerelatively easy to detect and punish.3.2 O Page Factors ˆ Site ReliabilityUnreliable or slow sites provide a poor user experience, and so will have a penaltyapplied. You can be warned if this happens if you sign up for Google webmastertools 26 . ˆ Popularity of the SiteFrom aggregated ISP data that search engines buy and search trac 27 . ˆ Incoming Links/ PageRankThe link structure of the internet is a useful pointer of a websites popularity.Anchor text on incoming links related to query shows a search engine the pageis related to the query. Links they remain for a long time from sites that havemany links pointing to themselves are rated highly. Links that are in boiler plateareas or sitewide may be ignored. Links that are all identical in anchor text (ieblatantly machine generated), from spammy websites (bad neighbourhoods 28 ),thought to be paid for with the intention of manipulating rankings or spam canresult in penalties. Links from sites that are most likely owned by the sameowner, detected either from Whois data or if the sites are hosted within thesame Class C IP, are likely considered less reliable signals of importance. Anormal rate of growth of incoming links, as opposed to bursty start stops 29 thatindicate link building campaigns 30 . 25 See Improving Web Spam Classiers Using Link Structure for a very interesting Yahoopatent on detecting spam based on the number of inbound and outbound links 26 See 27 See and 28 See 29 See 30 See 17
  18. 18. ˆ Other indirect signals of a websites popularityOther data can include mentions in chats, emails and social networks. ˆ Links from trusted websitesThe proximity on web graph to important, trusted sites (Links from old, highpage rank websites at the centre of the old heavily interconnected internet areuseful signals that a website can be trusted and is important 31 ). ˆ Links from other sites that rank for the queryResults may be reordered based on how they link to each other. ˆ Geographical LocationIf the geographical location of server, website according to directories, top leveldomain or location as set in Google Webmaster Tools match that of the userit is a signal that the page will be more relevant to the user, particularly forlocation sensitive searches. ˆ User Click DataIf users often search again after clicking on the sites result that is an indicatorthat the page is not a good match for the query. The personal history of resultsclicked, and pattern of related searches may help indicate what a user is lookingfor 32 . ˆ Domain InformationOlder domains are likely trusted more. Google is a domain registrar so has ex-tensive information Whois Information, and validates that address informationassociated with domains is correct. ˆ Manul ReviewsGoogle Quality Raters 33 manually reviewing websites and tagging them as cat-egories such as essential to query, not relevant to query, spam.3.3 Google PageRank NotesGoogles PageRank was the innovation that propelled Google to the top ofthe search engine pile. Whilst its implementation has changed much since itsoriginal description, and many other factors are now taken into account, it isstill at the heart of modern search engines so some extra notes will be made onit here. 31 See and type in for a visual graph 32 See See 33 See 18
  19. 19. 3.3.1 Short DescriptionThe key point is that PageRank considers each link a vote, and links from pageswhich have many links themselves are considered more important. Or as Googleputs it: PageRank reects our view of the importance of web pages by consideringmore than 500 million variables and 2 billion terms. Pages that we believe areimportant pages receive a higher PageRank and are more likely to appear at thetop of the search results. PageRank also considers the importance of each pagethat casts a vote, as votes from some pages are considered to have greater value,thus giving the linked page greater value.3.3.2 Mathematical DescriptionIts not essential to have a mathematical understanding of how PageRank is cal-culated, but for those familiar with basic graph theory and algebra it is useful.You may wish to skip this section, and read a slightly less mathematical de- 34 . For a more complete treatment of the mathematics see the originalscriptionPageRank paper 35 , the Deeper Inside PageRank by Amy N. Langvilleandand Carl D, and this thesis 36 . The following is summarised from SketchingLandscapes of Page Farms 37 by Bin Zhou and Jian Pei: The Web can be modeled as a directed Web graph G = (V, E), where V is the set of Web pages, and E is the set of hyperlinks. A link from page p to pageq is denoted by edge p q. An edge p q can also be writte nas a tuple (p,q). PageRank measues the importance of a page p by considering how collec-tively other Web pages point to p directly or indirectly. Formally, for a Webpage p, the PageRank score is dened as: Where M(p) = { q| q p } is the set of pages having a hyperlink pointto p, OutDeg(pi ) is the out-degree of pi (i.e., the number of hyperlinks frompi pointing to some pages other than pi ), and d is a damping factor (0.85 inthe original PageRank implementation) which models the random transitions ofthe web. If a damping factor of 0.5 is used then at each page there is a 50/50 34 See the introductions of, or the Wikipedia article 35 At 36 37 See 19
  20. 20. chance of the surfer clicking a link, or jumping to a random page on the internet.Without the damping factor the PageRank of any page with an outgoing linkwould be 0. To calculate the PageRank scroes for all pages in a graph, one can assign arandom PageRank score value to each node in the graph, then apply the aboveequation iteratively until the PageRank scroes in the graph converge. The google toolbar is a logarithmic scale out of 10, not the actual internaldata. For example: Domain Calculated PageRank PageRank displayed in Toolbar 47 2 54093 5 84063 5 1234567 7 2364854 73.3.3 Interesting Notes on the Original Implementation of PageR- ankFrom PageRank Uncovered 38 , essential reading for those looking to understandPageRank from an SEO perspective: ˆ PageRank is a multiplier, applied after relevant results are foundRemember, PageRank alone cannot get you high rankings. Weve mentionedbefore that PageRank is a multiplier; so if your score for all other factors is 0andyour PageRank is twenty billion, then you still score 0 (last in the results).This isnot to say PageRank is worthless, but there is some confusion over whenPageRank is useful and when it is not. This leads to many misinterpretationsof its worth. The only way to clear up these misinterpretations is to point outwhen PageRank is not worth while.If you perform any broad search on Google, itwill appear as if youve found several thousand results. However, you can onlyview the rst 1000 of them. Understanding why this is so, explains why youshould always concentrate on on the page factors and anchor text rst, andPageRank last. ˆ Each page is born with a small amount of PageRankA page that is in the Google index has a vote, however small. Thus, the morepages you have in the index the more overall vote you are likely to have.Or,simply put, bigger sites tend to hold a greater total amount of PageRankwithin their site (as they have more pages to work with). Note that Googles original algorithm has most likely been amended sinceto detect and reduce page rank hoarding, and generating PageRank by massiveinterlinking on auto generated pages. Also for quicker calculations an approx- 38 See 20
  21. 21. imation of PageRank which only gives certain seed pages PageRank may beused 39 . Interestingly, however, there are examples of this working, see How to getbillions of pages indexed in Google at In a related issue, at one point 10% of MSN Searchs (now known asBing) German index was computer generated content on a single domain 40 .3.3.4 Optimal Linking StrategiesDeciding how to interlink pages that you own or have inuence over is tricky;interlinking can be a good signal that that pages are related and on a certaintopic, build PageRank and control PageRank ow. However, heavily interlinkingcan be a signal of manipulation and spam, and dierent linking structures canmake dierent sites in your possession rank higher. The mathematics gets trickyfast, here is a quick overview of the literature today: ˆ Note from Web Spam TaxonomyThough written about Spam farms, the math holds true for good commercialsites too. Essentially this states that maximum page rank for a target pageis achieved by linking only to the target page from forums, blogs etc. theninterlinking the network of sites owned (as if there are no outlinks on a page therandom surfer will jump to a random page on the Internet). 1. Inaccessible pages are those that a spammer cannot modify. These arethe pages out of reach; the spammer cannot inuence their outgoing links. (Notethat a spammer can still point to inaccessible pages.) 2. Accessible pages are maintained by others (presumably not aliated withthe spammer), but can still be modied in a limited way by a spammer. Forexample, a spammer may be able to post a comment to a blog entry, and thatcomment may contain a link to a spam site. 3. Own pages are maintained by the spammer, who thus has full control overtheir contents. We can observe how the presented structure maximizes the total PageRankscore of the spam farm, and of page t in particular: 1. All available n own pages are part of the spam farm, maximizing the staticscore total PageRank. 2. All m accessible pages point to the spam farm, maximizing the incomingscore incoming PageRank. 39 For more on why this shouldnt work see 40 See 21
  22. 22. 3. Links pointing outside the spam farm are suppressed, making PRout out-going PageRank zero. 4. All pages within the farm have some outgoing links, rendering a zeroPRsink score component. Within the spam farm, the the score of page t is maximal because: 1. All accessible and own pages point directly to the target, maximizing itsincoming score PRin (t). 2. The target points to all other own pages. Without such links, t would hadlost a signicant part of its score (PRsink (t) 0), and the own pages would hadbeen unreachable from outside the spam farm. Note that it would not be wise toadd links from the target to pages outside the farm, as those would decrease thetotal PageRank of the spam farm. ˆ From Link Spam AlliancesThe analysis that we have presented show how the PageRank of target pages canbe maximized in spam farms. Most importantly, we nd that there is an entireclass of farm structures that yield the largest achievable target PageRank score.All such optimal farm structures share the following properties: 1. All boosting pages point to and only to the target. 2. All hijacked point to the target. 3. There are some links from the target to one or more boosting pages. ˆ From Maximizing PageRank via OutlinksIn this paper we provide the general shape of an optimal link structure for awebsite in order to maximize its PageRank. This structure with a forward chainand every possible backward link may be not intuitive. At our knowledge, ithas never been mentioned, while topologies like a clique, a ring or a star areconsidered in the literature on collusion and alliance between pages. Moreover,this optimal structure gives new insight into the armation of Bianchini et al.that, in order to maximize the PageRank of a website, hyperlinks to the restof the webgraph should be in pages with a small PageRank and that have manyinternal hyperlinks. More precisely, we have seen that the leaking pages must bechosen with respect to the mean number of visits before zapping they give to thewebsite, rather than their PageRank. ˆ From The eect of New Links on PageRank by XieTheorem: The optimal linking strategy for a Web page is to have only one out-going link pointing to a Web page with a shortest mean rst passage time backto the original page. Conclusions: .... We conclude that having no outgoing link is a bad policyand that the best policy is to link to pages from the same Web community.Surprisingly, a new incoming link might not be good news if a page that pointsto us gives many other irrelevant links at the same time. Reading this paper fully it is only in very particular circumstances that anew incoming link is not good news. 22
  23. 23. 3.3.5 Implementation to make computing PageRank fasterThere have been a number of proposed improvements to the original PageRankalgorithm to improve the speed of calculation 41 , and to adapt it to be better atdetermining quality results. No search engine calculates PageRank as shown inthe naive algorithm in the original paper 42 .3.3.6 HITSHITS is another ranking algorithm that takes into account the pattern of linksfound throughout the web, and it was released just before PageRank in 1999.HITS treats some pages on the web as authorities, which are good documentson a topic, and hubs, which mostly link to authorities. A page is given a high authority score by being linked to by pages that arerecognized as Hubs for information. A page is given a high hub score by linkingto nodes that are considered to be authorities on the subject. Unlike PageRank, which is query independent and so computed at index-ing time, HITS hub and author scores are query depend ant and so computed(though likely cached) at query time.3.3.7 Is linking out a good thing?Whilst TEOMA is the only search engine that uses HITS at its core, its think-ing has heavily inuenced search engine designers - so it is likely that linkingout to high quality authorities can positively inuence either a pages ranking(though potentially negatively, if designers want authorities rather than hubs toappear in their results 43 ), or the importance of the other links it contains. Manywebmasters fear linking out to sites as they would rather keep links internal toprevent PageRank owing out (many webmasters also nofollow links to similarreasons, not that this form of PageRank sculpting no longer works according toMatt Cutts, Googles head of [anti]web spam). Matt Cutts also said a number of years ago: Of course, folks never know when were going to adjust our scoring. Itspretty easy to spot domains that are hoarding PageRank; that can be just anotherfactor in scoring. Some search engines are even concerned about people linking out too much,whilst crawlers can now index a large number of links on a page, a very largenumber of outbound links often indicates that a site has been hacked with spamlinks or is machine generated. A spammer might manually add a number of outgoing links to well-knownpages, hoping to increase the pages hub score. At the same time,the most 41 For example, see Computing PageRank using Power Extrapolation and Ecient PageR-ank Approximation via Graph Aggregation 42 Matt Cutts discusses a couple of the implementation details at 43 See and Deeper In-side PageRank, discussed earlier 23
  24. 24. wide-spread method for creating a massive number of outgoing links is direc-tory cloning 44 .3.3.8 TrustRank / Bad Page RankIts likely that after results are generated based on relevance, PageRank is thenapplied to help order, then Trust Rank to help order the results. A site may losetrust every time it fails some kind of spam test (for example if a large numberof reciprocal links are found,cloaking, duplicate content, fake whois data) andgain Trust for certain properties (domain age, trac, being one a number ofimportant seed sites that are manually tagged as trusted sites). These initialTrust Ranks could then be propagated in a similar way to PageRank, so linkingto and from bad neighborhoods would negatively aect the sites Trust Rankthrough association 45 . From SEO By The Sea: In 2004, a Yahoo whitepaper was published which described how the searchengine might attempt to identify web spam by looking at how dierent pageslinked to each other. That paper was mistakenly attributed to Google by a largenumber of people, most likely because Google was in the process of trademarkingthe term TrustRank around the same time, but for dierent reasons. Surpris-ingly, Google was granted a patent on something it referred to as Trust Rankin 2009, though the concept behind it was dierent than Yahoos description ofTrustRank. Instead of looking at the ways that dierent sites linked to eachother, Googles Trust Rank works to have pages ranked according to a measureof the trust associated with entities that have provided labels for the documents. 44 See Web Spam Taxonomy 45 See and 24
  25. 25. ... If youve ever heard or seen the phrase TrustRank before, its possible thatwhoever was writing about it, or referring to it was discussing a paper titled Com-bating Web Spam with TrustRank (pdf ). While the paper was the joint work ofresearchers from Stanford University and Yahoo!, many writers have attributedit to Google since its publication date in 2004 The confusion over who cameup with the idea of TrustRank wasnt helped by Google trademarking the termTrustRank in 2005. That trademark was abandoned by Google on February29, 2008, according to the records at the US PTO Tess database. However, apatent called Search result ranking based on trust deals with something calledtrust rank, led on May 9, 2006. Google mentions distrust and trust changes as indicators. More than trustanalysis, trust variation analysis is on the road. Fake reviews, sponsored blogsand e-commerce trust network inuence are pointed out. The paper A Cautious Surfer for PageRank comments on why TrustRankshouldnt be overused: However, the goal of a search engine is to nd good quality results; spam-freeis a necessary but not sucient condition for high quality. If we use a trust-basedalgorithm alone to simply replace PageRank for ranking purposes, some goodquality pages will be unfairly demoted and replaced, for example, by pages withinthe trusted seed sets, even though they may be much less authoritative.Consideredfrom another angle, such trust-based algorithms propagate trust through pathsoriginating from the seed set; as a result,some good quality pages may get lowvalue if they are not well connected to those seeds.3.3.9 Improvements to Googles ranking algorithmsThere have been a number of notable algorithm changes which made consider-able changes appear to results pages, though often the eects were later scaledback slightly. ˆ NoFollowMatt Cutts and Jason Shellen created the nofollow specication to help limitthe eect and incentive for blog spam. If a search engine comes across a linktagged as nofollow, it will not treat the link as a vote, ie as a positive signal inrankings. Areas where untrusted users can post content are often tagged nofol-low, roughly 80% of content management systems (the software that websitesrun on) implement nofollow. The HTML code of a NoFollow link: a href=signin.php rel=nofollowsign in/a ˆ Increasing use of anchor textEven the original PageRank algorithm took into account the anchor text of links,so links were used to give both a number that indicated the sites popularityand information about the content of a document and so its relevance for userqueries. 25
  26. 26. ˆ Google Bombing Prevention, 2nd February 2007Google Bombing is the process of massively linking to a page with a specicanchor text, to give PageRank but more importantly indications that the doc-ument is related to the anchor text. For example, in 1999 a number of bloggersgrouped together to link to with the anchor text more evil thanSatan himself. This resulted in Microsoft being placed number one in searchesfor more evil than Satan himself despite not having the phrase anywhere on itspage. Detecting a sudden inux of links with identical anchor text is very easy,and in 2007 Google changed their indexing structure so that Google bombs suchas miserable failure would typically return commentary, discussions, and ar-ticles about the tactic itself. Matt Cutts said the Google bombs had not beena very high priority for us. Over time, weve seen more people assume thatthey are Googles opinion, or that Google has hand-coded the results for theseGoogle-bombed queries. Thats not true, and it seemed like it was worth tryingto correct that perception. 46 Some Google bombs still work, particularly thosetar getting unusual phrases, with varied anchor text, over a period of time,within paragraphs of text. ˆ Florida, November 2003Results for highly commercial queries, likely informed from the cost of Adwords,became heavily ltered so more trusted academic websites and less commercialoptimised websites ranked. Some of these changes resulted in less relevance, forexample if a user was searching for buy bricks they probably didnt want tomainly see websites about the process of creating bricks, and were rolled back.For more see 47 and 48 . ˆ Bourbon, June 2005A penalty was applied to sites with unusually fast or bursty patterns of linkgrowth. ˆ Jagger, October 2005A penalty applied to sites with unusually large amounts of reciprocal links, newmethods for detecting hidden text. ˆ Big Daddy, December 2005According to Matt Cutts, punished were sites where our algorithms had verylow trust in the inlinks or the outlinks of that site. Examples that might causethat include excessive reciprocal links, linking to spammy neighborhoods on theweb, or link buying/selling. 49 46 See 47 48 49 See 26
  27. 27. ˆ Caeine, October 2010A faster indexing system that changed results little, but allowed for fresherresults and some of the later Panda updates 50 . ˆ Panda, April 2011Penalty applied to content deemed low quality, detected primarily from userdata. Websites which contained masses of articles, focusing on quantity overquality, were often hit 51 .4 Detecting Spam and ManipulationYou will often hear that your site has to look natural to the search engines.Just what natural means is hard to dene, but essentially it means the proleof a site whose popularity was never engineered or promoted, and was insteadbased on people luckily coming across it and deciding to recommend it to theirfriends with links. Whats more, you also need to make your site look popular,creating no links to your site yourself will look natural but you will haveno chance of competing with people who do unless you have the cash to buylarge amounts of advertising. This section briey covers what search enginesconsider to be acceptable, when and how they can detect violations, and whatthe potential penalties are.4.1 Google Webmaster GuidelinesGoogle have created a page called Webmaster Guidelines to inform users ofwhat they consider to be acceptable methods of promoting your website. Whilstthe lines for crossing general principles such as Would I do this if search enginesdidnt exist? are somewhat vague, they do oer some specic notes of whatnot to do: ˆ Avoid hidden text or hidden links. ˆ Dont use cloaking or sneaky redirects. ˆ Dont send automated queries to Google. ˆ Dont load pages with irrelevant keywords. ˆ Dont create multiple pages, sub domains, or domains with substantially duplicate content. 50 See 51 See and 27
  28. 28. ˆ Dont create pages with malicious behavior, such as phishing or installing viruses, Trojans, or other bad ware. ˆ Avoid doorway pages created just for search engines, or other cookie cutter approaches such as aliate programs with little or no original con- tent. ˆ If your site participates in an aliate program, make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site rst.Most of the methods listed above are naive and easy to detect, Google havebeen fairly successful in making successful manipulation aligned with creatinggenuine content, though without any promotion it is unlikely even the bestcontent will be noticed.4.2 PenaltiesPenalties 52 that Google to detected manipulation vary in length of time andeect, from small ranking penalties for certain keywords for a page to sitewide bans, depending upon the sophistication of the manipulating methodsand the quality of the oending site. If you believe you had had one applied,you can submit a Google Reconsideration Request from Google WebmasterTools, once you have xed the oending issues.4.3 Detecting Manipulation in ContentThere is a fascinating paper by Microsoft which details a number of methods fordetecting spam pages in search engine indexs based on their content. A simpleway is to use Bayesian lters (one is included with Ignite SEO to test yourcontent as the search engines would), so for example seeing the phrase buypills would be a strong indicator of spam. Most of the research is on detectingblatantly computer generated lists of keywords, which is fairly easy to detect.Detecting the quality of human written content is very dicult, so unless youare endlessly repeating your keywords if you are writing your own content youcan be reasonably happy with its quality in search engines eyes. The following graphs are cut from Detecting Spam Web Pages throughContent Analysis 53 by Microsoft Research employees.4.4 Detecting Manipulation in LinksMuch research has focused on detecting spam pages through their backlinks oroutlinks. Yahoo obtained a patent that uses the rate of link growth to detect 52 53 28
  29. 29. manipulation. Essentially a constant rate of new backlinks, perhaps with asmall growth over time, is expected for a typical site. A saw-tooth pattern ofinlinks is a strong indicator of backlink campaigns that start and stop (thoughcould also be an indicator of say a site that releases new software monthly). In their paper, Fetterly et al, analyse the indegree (incoming/backlinks) andoutdegree (links on the page) distributions of web pages: Most web pages have in and outdegrees that follow a powerlaw distribution.Occasionally, however, search engines encounter substantially more pages withthe exact same in or outdegrees than what is predicted by the distribution for-mula. The authors nd that the vast majority of such outliers are spam pages. As discussed in the Trust Rank section earlier, large amount of links fromsites that have already been detected as linking to spam (so called untrustwor-thy hubs) is a negative indicator. Links from unrelated websites, reciprocallinks, links out of content, from sites that are known to host paid links andmany other signals are likely taken into consideration. Zhang et al have identied a method for identifying unusually highly inter-connected groups of web pages. More methods of identifying manipulative sitesare listed in Link Spam Alliances by Geyongyi and Garcia-Molina.4.5 Other MethodsIf you think a competitor has been using methods that violate the webmasterguidelines, you can report them to Google 54 . Its good practice to ensure thatany site you wish to keep for a long time, and expect to get reasonable amountsof trac, Google will sometimes manually review websites without prompting, GoogleQuality Raters inspect sites for relevance to results but can also take web pagesas spam. Particular markets are inspected more often than others. 54 29
  30. 30. Part IIPractice5 An Example CampaignNow weve covered the theory, its time for a real world example of putting itinto practice.5.1 Company ProleJohn runs a driving school in Springeld, Ohio. He has a website he has ownedfor a couple of years, that ranks around the second page for most searches relatedto driving schools in Ohio and receives about 20 visitors day, a third from searchengines and two thirds from links from local websites. A quick search for what he imagines would be his main keyword, drivingschool Springeld Ohio, has a company directory site at the top followed byother directories, companies and people asking on forums for recommendations.This mix of relevant small companies web sites and small pages on big websitesindicates the keyword to be of medium diculty to rank for.5.2 GoalsJohn thinks if he can get his site to rank 3rd instead of around the middle of thesecond page for his core keywords, he will increase his search trac by around1000%, his overall trac by about 300%, and roughly double his sales. He aimsto do this over a period of roughly one month.5.3 Competitor ResearchJohn nds his main competitors by searching, and gets estimates of their tracsources using sites such as and A tool suchas Ignite SEO can automatically build SEO reports of competitors, listing theirpaid and organic keywords, demographics and backlinks. Looking at the HTMLsource code of some his competitors displays their targeted keywords in themeta name=keywords content=keyword1, keyword.5.4 Keyword ResearchJohn takes his initial guesses of what potential customers might search for,and those from his competitors and his existing trac, and using the GoogleKeyword Tool 55 and Google Insights56 expands this list. 55 56 30
  31. 31. 5.5 Content CreationJohn takes his keywords and create a small amount of content on his websitecontaining them. He then creates a large amount of content quickly and creates 57 that, each one targeting a dierent keyword.sites hosted on free hosting sitesThe content generator section of Ignite SEO 58 is perfect for this.5.6 Website CheckBefore investing in o site promotion (ie link building), it is worth performinga quick check that the site is search engine friendly. Creating an account inGoogle Webmaster Tools will let you know if Google has any issues indexingyour website, and it is worth ensuring navigation isnt over reliant on JavaScriptor Flash.5.7 Link BuildingThis is the core process that will actually improve Johns rankings. By looking athis competitors backlinks using Yahoos linkdomain: command, John replicatestheir links to his website by visiting each site one by one. Using a tool suchas Ignite SEO, he can automatically build links to the hosted sites he quicklycreated in 5.5, without the risk of a link campaign negatively aecting therankings of his core website. Other signals of quality such as facebook andtwitter recommendations are built here.5.8 AnalysisThe success of the campaign is measured with a good tracking system suchas Google Analytics, as well as tracking the new incoming links with GoogleWebmaster Tools and Yahoos link: command. The results are compared withthe goals, and the whole process is rened and repeated. 57 58 31
  32. 32. About the AuthorChristopher Doman is a partner of Ignite Research, a rm specialising in soft-ware and consultancies for search engine marketing. He holds a BA in ComputerScience from the University of Cambridge. 32