Transcript of "Web2.0.2012 - lesson 8 - Google world"
Web 2.0 blog, wiki, tag, social network: what are they, how to use them and why they are important Lesson 8 : the Google world
This material is distributed under the Creative Commons "Attribution - NonCommercial - Share Alike - 3.0", available at http://creativecommons.org/licenses/by-nc-sa/3.0/ . Part of the slides is the result of a welcome distance collaboration with prof. Roberto Polillo, University Milan Bicocca ( http://www.rpolillo.it )
Google: searching Each search engine has three main components: - Crawler - Database - Interface and query software The crawler is a software program which surfs the net and brings the pages in the index. The crawler also takes note of the links it finds and uses them to gradually reach new pages with new links The index is a huge database where pages are stored with all metadata and where all the words are "reversed" by creating indexes / keys for each The interface receives the user's request, try to interpret it and passes the request to the "query processor" that works on the index
Google: searching The searches are usually very short: 20% use a word, almost 50% is composed of two or three words, only 5% more than six words Also the "searches" are distributed according to a "long tail" curve, approximately 50% of daily searches are unique. Do you know GoogleWhacking? About 90% of users use the first four engines: G Y AOL and Bing (G> 50%) The traffic on search engines has two peaks in the morning (in the office) and one in the evening (once returned home). The approx cost of acquiring a customer ranges from $ 70 mail advertising, online advertising to $ 50, $ 20 of the yellow pages up to $ 8 (!)for links related
Google: “old” searching First search engines: Archie 1990 (ftp command line query) Veronica Gopher 1993 (search only documents title) WebCrawler 1994, the first to index the text of the pages. First good search engine: AltaVista (1995), born in DEC laboratories; thanks to Alpha 64bit processor it could launch a thousand crawler simultaneously. AltaVista answered the first year to 4 billion searches! Sold to Compaq, AltaVista was transformed into a portal Yahoo! Born as "David's and Jerry's Guide to the WWW" with a directory approach (see archive.org), a great success thanks to the link with Netscape. Yahoo! used its own directory service and for the search it used outboard engine: OpenText, AltaVista, then Inktomi and Google. 2009: Yahoo! and Microsoft Bing http://ppcblog.com/search-history/ http://www.searchenginehistory.com/ http://performancing.com/search-engine-history/
Google: born Brin and Page studied at Stanford and Page had the degree thesis on “the Web as a graph” with Terry Winograd. The project BackRub (1995) was a system to find links on the Web, store and republishing them for analysis to see which pages pointing to a Then (1994) given page. In 1996 BackRub began to index the Web and, through the interpretation of graphs, also to assess the relative importance of sites. So was born the basic concept of Page Rank algorithm, that takes into account both the number of links a site receives and the number of links to each of the sites linked to the first. In 1998 Brin and Page released the features of PageRank in paper "The Anatomy of a large-scale hypertextual Web search engine" and founded Google Inc. based in classic garage.
Google: the algorithm The secret of Google success is in the algorithm, obviously covered by secret, even if the network you can find its most important features A SEO expert has developed the “Randfish theorem" http://www.seomoz.org/ in which an hypothesis is presented about the Google scoring method (Keywords used * 0.3) + (Domain revelance * 0.25) + (Links in input * 0.25) + (User data * 0.1) + (Content Quality * 0.1) + (Manual push) - (Penalty automatic & manual) = Google Score
Google: the algorithm Factors in the keywords use : * Keywords in title tag * Keywords in header tags * Keywords in the document text * Keywords in internal links pointing to page * Keywords in domain name and / or URL
Google: the algorithm Domain relevance: * History of registration * Domain “age” * Importance of links pointing to the domain * Domain relevance on the subject, based on incoming and outgoing links * Links historical use & patterns to the domain Score of incoming links: * Links “age” * Quality of domains that send the link * Quality of pages sending the link * Links text * Assessment of quantity / weight of the links (PageRank) * Relevance of pages sending the link
Google: the algorithm User data: * All-time percentage of clicks (CTR) on the results page of search engines * Time spent by users on the page * Number of searches for URL / domain name * History of visits / usage of the URL / domain name that Google users can monitor (toolbar, wifi, analytics, etc.) Content quality: * Potentially given by hand for searches and the most popular pages * Provided by Google internal evaluators * Automated algorithms to assess the text (quality, readability, etc.)
Google: the algorithm The original patent (1998) U.s Patent file # 6,285,999 ; METHOD FOR NODE RANKING IN A LINKED DATABASE A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations, the world wide web or any other hypermedia database. The rank assigned to a document is calculated from the ranks of documents citing it. In addition, the rank of a document is calculated from a constant representing the probability that a browser through the database will randomly jump to the document. The method is particularly useful in enhancing the performance of search engine results for hypermedia databases, such as the world wide web, whose documents have a large variation in quality. Inventor: Page; Lawrence (Stanford, CA) Assignee: The Board of Trustees of the Leland Stanford Junior University (Stanford, CA)
Google: the algorithm The simplified formula http://en.wikipedia.org/wiki/PageRank Where: * PR[A] is PageRank value for A page * PR[B] ... PR[n] are PageRank values for pages B ... n linking to A * L[B] ... L[n] is the total numer of links in pages B ... n * d (damping factor) is the probability that an imaginary surfer who is randomly clicking on links will go on clicking. it is generally assumed that the damping factor will be set around 0.85. It represents the PageRank percentage passing from one page to another
Google: the algorithm PageRank in detail (from www.google.com/corporate/tech.html ) PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page's importance.
Google: the algorithm Hypertext-Matching Analysis: Our search engine also analyzes page content. However, instead of simply scanning for page-based text (which can be manipulated by site publishers through meta-tags), our technology analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user's query.
Other links about search engines http://docs.google.com/View?id=dfvwdtqp_1c8x6bmd8 https://docs.google.com/present/view?id=dfvwdtqp_31dqxqk8g9&ndplr=1 https://docs.google.com/present/view?hl=en&id=dfvwdtqp_35hq27gfhk http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1
Google The Google search-engine is now the most important access point to the network http://gs.statcounter.com/#search_engine-ww-monthly-200807-201104 Search on Google, or to google , is now part of common language. You don't know? Ask Google! Now many services offered by Google (BigG!) : a big part of Web 2.0 world now belongs to Google: YouTube, Google Earth / Maps / Calendar / Reader, ... and now Google went in browser market with Chrome and in mobile market with Android http://en.wikipedia.org/wiki/Usage_share_of_web_browsers http://blog.nielsen.com/nielsenwire/online_mobile/who-is-winning-the-u-s-smartphone-battle/ http://blog.nielsen.com/nielsenwire/consumer/more-us-consumers-choosing-smartphones-as-apple-closes-the-gap-on-android/
Google Dance Google periodically updates engine algorithms to penalize what it considers spam by specialists SEM / SEO (Search Engine Marketing / Optimization): the position index is so important that many websites are written containing only links to "climb" the sites that pay There is no doubt that these attacks continue against spamming trade also serves to "push" services AdWords advertising. Other frauds are possible with AdSense, where site owners earn from clicks on sponsored links on their sites; sometimes robot programs are used, sometimes workers offshore to click on the links and gain (an estimated 30% of advertising budgets so go missing) AdSense has helped to create the long tail of advertising, bringing hundreds of thousands of businesses to advertise and thousands of sites offering it. https://www.google.com/adsense/static/en/Publishertools.html
Google In 2007 Big Brother Award Italy has awarded Google the dubious prize of "most invasive technology”, motivating the decision this way: "Brin, one of the founders of Google likes to say its employees "Do not Be Evil" and this became the company slogan. The admiration for Google and his services and its success as a company can not hide the fact that every search, every e-mail, post on Google Groups is recorded and analyzed, even if anonymous, and all the analysis head on the profiling of the navigator. Google, given the size, is the entity in the world potentially more threatening to privacy. With the recent purchase of DoubleClick.com giant of advertising and online profiling, which enlarges the potential data mining of Google, it seems that the motto could now become "Do not Be Evil, buy the Devil." http://en.wikipedia.org/wiki/Criticism_of_Google
Google AdWords AdWords (introduced in 2000) is the main advertising from Google, and the main source of revenue (> $ 28 billion in 2010) Advertisers specify the search words that bring their ads on the right of the results page of search engine ("sponsored links") The advertiser pays when the user clicks on the ad (Pay Per Click) and the price per click is determined by complex rules The service is managed online: the software makes all the work (negotiations, sales, execution) http://en.wikipedia.org/wiki/AdWords http://adwords.google.com http://investor.google.com/financial/tables.html from advertising a big part of income
Google AdWords top queries covers only 3% of total -> long tail http://bnoopy.typepad.com/bnoopy/2005/03/the_long_tail_o.html see Google AdWords Intro.odp
Google AdSense With this service, Google "administer" advertising space on the web pages of the sites customers Google places ads in the web pages, according to criteria of semantic correlation with pages of the host site The host site is paid "per click" AdSense has brought hundreds of thousands of small businesses to advertise and offer it to thousands of sites Google currently shares 68% of revenues generated by AdSense with content network partners. http://en.wikipedia.org/wiki/AdSense
Google AdSense <ul></ul><ul>R.Polillo - Ottobre 2010 </ul>
Google Operating Systems Android : open-source platform Linux-based for mobile device application developments Google Chrome OS : netbooks/notebooks platform “ Google Chrome OS is an open source, lightweight operating system that will initially be targeted at netbooks. Later this year we will open-source its code, and netbooks running Google Chrome OS will be available for consumers in the second half of 2010. (...) Google Chrome OS will run on both x86 as well as ARM chips and we are working with multiple OEMs to bring a number of netbooks to market next year. The software architecture is simple — Google Chrome running within a new windowing system on top of a Linux kernel.” http://getchrome.eu/index.php
Google Operating Systems Android : see Android.ppt http://www.android.com/about/ Google Chrome OS : first systems in 2011 http://www.google.com/chromeos/features.html http://www.chromium.org/chromium-os http://www.chromium.org/chromium-os/chromiumos-design-docs/software-architecture
Google tricks Google tells what information is collected when using the search engine and what is done to protect the privacy of users: http://www.youtube.com/watch?v=iPkvNr2cpqg http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf Search in the blogs: http://blogsearch.google.it/ Search history http://www.google.com/history Sites comparison: http://www.google.com/insights/search/ # Other: http://www.google.com/intl/en/options/ and http://labs.google.com/
exercise 8 <ul><li>Shortest GoogleWhacking (one or two words)
Try some search on Google, Bing and Yahoo!: report about differences between them