Hi guys, I’m about to stutter and swear my way though a presentation about Search Engine Mechanics. Hopefully you’ll learn something we can use to pass the exam. I’m going to focus on Google search for my presentation because, it’s the most popular and also the most well documented of the major search engines.
So, before I begin to explain how Google works, we some example content. Every time you post a tweet…
Post a picture on tumblr,
Or post on your personal blog, you’re likely to be indexed by google.Wehile I was researching for this presentation I posted this blog to see what would happen.
Later that day I searched for “search engine mechanics related to information architecture” Google had indexed my post, typo and all. I edited the typo only a couple of minutes after I originally posted the blog, giving me an indication of how quickly google picked up the post.
Next time googleindexs that page, it will update the listing and correct the typo.So how does google actually go about indexing a page?
Spiders. These are google’s robot’s that follow links from page to page around the web, indexing their content to be queried by google.
So, for example Google’s spiders have come to index my website, how does a bot view my page?
So here you can see that Google bots essentially see my website as a title, a list of links and some content. In addition it reads the page’s hidden meta data.This page gets added to googles index – Keywords to one index, and complete pages to another.Once that is complete, it will follow all the links on that page – and index those pages, and so on.
For every page indexed, the keywords and meta data are stored in one index, this is used for generic search queries - for example “Dinosaurs”.A second index, containing the full content of a page is also created -
Google uses an algorithm called PageRank to rate pages stored in it’s index. In Essence it’s a numeric value that represents how important an individual page is to the web.
So this is a super basic visualisation of how page rank works. Each link to a page counts as one “vote” for a page. Popular pages provide additional votes, as these pages are trusted by google to link to other good websites.So the page in the top left has a pagerank of 5. It’s got 4 incoming links, one from a more trusted site. The page in the bottom right, has a ranking of 4, 3 incoming links one of those being from a trusted site.In reality the algorithm is much more detailed, and it includes influences from a website’s own internal links, and some hidden calculations based on other factors which google keeps top secret.
Another thing that will effect a page’s rank is spam.There is a whole team at googlethat concentrates on fighting spam results. That team works on variety of issues from hidden text to off-topic pages stuffed with spammykeywords, and linkfarms. The team spots new spam trends and works to adjust PageRank to give penalties for cheating the system.
Another thing google has to penalize is pirated content. If a content owner files a DMCA request with google, they can have illegal content removed from google’s index.Google also heavily penalizes websites hosting pirated content. If you’ve ever searched for a popular film or album, you might have noticed you’ll be provided links of places to buy content, places to find information about content, but never a torrent link or a rapidshare link without specifically searching for it.The results are not removed from the index, but you will always have to use additional keywords to find what you are looking for.
Page rank, and google’s 11 secret herbs and spices are added to the index to make google’s results even better for the end user.
So this is happening all the time, to make sure that google searches are giving the best and most relevant results to it’s users.
Okay, time to google some stuff.
When you submit a google search, what you are actually searching is the index of the web that googles spiders have created, not the web itself.
So you’ve hit search – google queries the index, and retrieves every page that includes your search terms.If your search terms are something like “Is Norway a third world country?” you’ll get thousands of results.
How does google know which of the of results it returns the are ones you want to see?Google’s software runs though an enormous list of questions and checking some specific criteria, for example: Where on the page to the search terms appear? In the title? In the Metadata? As body text? How often do the search terms appear on this page? Does this page have a high PageRank? To name a few.Once the results are sorted…
Then google starts applying content filters.
If you are searching globally, like from the google homepage, Google will often filters in results from News, Videos and Images to give the user relevant links from across all of it’s services.
If multiple pages from the same website have a high ranking they may be combined into a clustered result rather than displaying all of the links individually making room for other results.
There are a few kinds of user personalisation. The first is that sites that you have visited before, or visit often are given priority in your results.
This is an example of google using information from your account to show you websites that your friends recommend.
This year google also added “personal” results. Which essentially gives priority in your results to things your friends have posted about the topic you are searching for. Here we see an addict grasping at the dregs of reddit.
If a term has had a recent boost in popularity then Google often places additional weighting on its search position as it is current and relevant to the user.
For example, this is the results found by searching “Kony” in google from before the keyword boom this week.
And after invisible children’s latest video was released. Now, the majority of the displayed links are about Kony2012
The final results of the search. As you can see it’s got news site links, and videos filtered in, and every result is about KONY2012
So what if like Will Smith, you are scared of the robot uprising? Or don’t want sections of your content, perhaps your copyrighted images from appearing in google’s image search?
Robots dot txt.This is a text file, placed in the root directory of your server which gives instructions or directions to the robots visiting your site.It uses The Robots Exclusion Protocol.
Here I’ve got three examples of ways you could use robots.txtThe first one, is not super useful unless you are intending to go completely under google’s radar. It uses a wildcard to select all bots, and disallows searching everything in your root directory.Second this is more useful. It specifically targets the Google image crawlers, and stops it from caching or indexing any of your images. This will save you bandwidth, as well as stopping your images from appearing in image search.The third, is example of how you can use disallow in more useful ways. Disallowing caching of specific folders and files, can allow you to keep certain information internal to your site, and stop it appearing in your search results.
Giving directions to robots is all well and good when they play by the rules/don’t come back in time to kill you. However, some robots can ignore your file – generally these bots are looking for security holes in your site or looking to harvest emails to sell.
Right, now’s your chance to pick away at my presentation and make me look like an idiot on camera.If I don’t know the answer it’s probably classified top secret as google.