• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Risks of search engine dependency and its influence on data quality
 

Risks of search engine dependency and its influence on data quality

on

  • 8,269 views

A master thesis about the risks of using the same search engine all the time when looking for information on the Internet.

A master thesis about the risks of using the same search engine all the time when looking for information on the Internet.

Statistics

Views

Total Views
8,269
Views on SlideShare
8,237
Embed Views
32

Actions

Likes
1
Downloads
88
Comments
0

2 Embeds 32

http://moteurs-de-recherches-alternatifs.blogspot.com 31
http://66.102.9.132 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Risks of search engine dependency and its influence on data quality Risks of search engine dependency and its influence on data quality Document Transcript

    • Risks of search engine dependency and its influence on data quality Thesis intermediate report submitted for the European Master in Business Studies (EMBS) by Ronan CHARDONNEAU Institut de Management de l'Université de Savoie d'Annecy (FR) Università degli studi di Trento (IT) Universität Kassel (GER) Universidad de León (SP) Date of submission: 26th January, 2009 Master Thesis
    • Disclaimer Before starting this report I inform any readers that this is an intermediate report of a final master thesis which will be finished in June 2009. The work within is the result of months of work since June 2008. To find what inspired me please have a look at my blog (in French) at: http://moteurs-de-recherches-alternatifs.blogspot.com/ The following intermediate report has not taken yet ( at the date of February the 13th 2009) in consideration the pieces of advice from my teachers. Even if there are not many corrections to do this work is then not perfect yet. I removed the acknowledgements part for the public version of the thesis. Lastly I would like to inform any readers of this work that I will normally be graduated in June 2009 and that from this date I will be actively looking for a job or a phd in the field of international e-marketing. I am very flexible and able to move at any place around the world. Please feel free to contact me though my blog by posting a comment or at ronanchardonneau@gmail.com. CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 2/66
    • Ronan CHARDONNEAU Blog: http://moteurs-de-recherches- ronanchardonneau@gmail.com alternatifs.blogspot.com/ Professional experience Educational background □ 2007/2009 • Search Engine Optimizer(Eng-IT) European Master in Business Studies June- September 2008 – Search engine Quadruple degree in international optimizer in English and Italian management (Trento, Annecy, Kassel Edexon Venice - Italy Léon) • International salesman □ 2006/2007 April- august 2007 – Selling IT services A three-year degree in Import-Export on the local foreign market (IUT Quimper) Easteq Shanghai - China □ 2004-2006 Assistant of the director of operations □ A two-year degree in Business and April- June 2006 – Creation of a search Administration: Finance, Accountancy engine for a database and Computering Northwest Folklife Seattle - USA (IUT Nantes) Track and Field coach trainer □ • 2004 2006- 2008 Scientific A-level Mailman □ (Lycée Clemenceau Nantes) Summer 2004-2006 – (La Poste - France) Professional projects Foreign languages □ Market study Possibilities of entering the Italian market □ Trilingual: French, English, Italian for Easteq China Offshore Autonomous in those three languages □ Market study Research of distributors and competitors • Chinese in Germany for Thyssen Krupp Elevators Able to stand basic conversations □ Setting up a prototype computer application for the students □ German and Spanish registration system of the university of Elementary level (A2) Nantes Others Computer skills □ Extra scholars activities • Search Engine Optimization Author of several blogs dealing with trips 6 months of experience in this field and many articles about Web applications. □ Software Suite □ Centers of interests Competent in GIMP, Microsoft and Open Foreign languages, traveling (more than Office. Notions of Linux/Ubuntu 20 countries visited), computers and • Programming technology Notions ofCHARDONNEAU Ronan - European Master in Business Studies 2007/2009 3/66 PHP and XHTML
    • Contents Foreword.......................................................................................................................6 Chapter 1: Introduction of the topic background..........................................................8 1.1 Relevance of the subject...................................................................................10 1.2 Major terms......................................................................................................11 1.3 Focus, goals and structure of the report...........................................................11 Chapter 2: Concept of data quality.............................................................................13 2.1 Data quality definition......................................................................................14 2.2 The importance of data quality.........................................................................15 Chapter 3: Search engines dependency.......................................................................16 3.1 Search engine market configuration.................................................................17 3.1.1 Search engine categories..........................................................................17 3.1.2 Search engine market...............................................................................19 3.1.3 The search engines in the world...............................................................19 3.1.4 The search engine market shares per country...........................................22 3.1.5 The search engines competition...............................................................23 3.1.6 The semantic web.....................................................................................24 3.2 Search engines dependency aspect...................................................................25 3.2.1 Search engines dependency proves..........................................................25 3.2.2 Search engines dependency aspect...........................................................27 3.3 Search engines dependency problems..............................................................28 3.3.1 Privacy issues...........................................................................................29 3.3.2 Looking for other search engines.............................................................30 3.3.3 Search engine awareness..........................................................................30 3.3.4 Other search engines existence awareness...............................................32 3.3.5 Less confident regarding other search engines.........................................33 3.3.5 Less confident regarding other search engines.........................................33 3.3.6 Even the best cannot provide you everything...........................................34 Chapter 4: Risks of search engines dependency and its influence on data quality.....35 4.1 The information has been found but is poor....................................................36 4.2 What the search engines do not tell you...........................................................36 4.3 The best way to get data quality.......................................................................37 CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 4/66
    • 4.3.1 The sub-search engines.............................................................................37 4.3.2 The size of the Internet.............................................................................38 4.3.3 Single search engine Internet coverage....................................................39 4.3.4 Multiple search engine Internet coverage.................................................42 4.3.5 Others search engine Internet coverage....................................................44 4.3.6 A concrete representation of the World Wide Web...................................46 4.4 The gap between search engine dependency and data quality.........................47 Chapter 5: The Google example.................................................................................50 5.1 Google..............................................................................................................51 5.2 Google's success...............................................................................................51 5.3 Google dependency state..................................................................................52 5.4 Google functions..............................................................................................52 5.5 Google added functionalities............................................................................53 5.6 Google success is his weakness.......................................................................53 5.7 Google's disappearance hypothesis..................................................................54 Conclusion..................................................................................................................55 Declaration..................................................................................................................56 List of literature...........................................................................................................57 Afterword....................................................................................................................61 CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 5/66
    • Table of figures Illustration 1: Total Sites Across All Domains August 1995 - January 2009................9 Illustration 2: A light interface: the homepage of the search engine www.ask.fr.......17 Illustration 3: Home page of the Mail.ru portal..........................................................18 Illustration 4: Top 10: Search websites in the world August 2007.............................19 Illustration 5: The most visited website by country....................................................20 Illustration 6: Search engine market shares in Czech Republic..................................22 Illustration 7: Results page of Cuil.............................................................................24 Illustration 8: Search engines figures for France, Source: XitiMonitor .....................28 Illustration 9: Some sub-search engines of Google....................................................37 Illustration 10: Google's Index coverage....................................................................40 Illustration 11: “Powered by Google” search engines coverage.................................41 Illustration 12: Search engines coverage....................................................................43 Illustration 13: “Who is like it” search engine............................................................44 Illustration 14: Dogpile meta search engine...............................................................45 Illustration 15: A representation of the most visited websites....................................46 Illustration 16: Comparison among Google, Yahoo and MSN...................................48 Illustration 17: Google's domination in Europe..........................................................52 Illustration 18: Google's Advanced search engine......................................................53 Illustration 19: Eye Tracking on Google.....................................................................54 CHARDONNEAU Ronan - European Master in Business Studies 6/66
    • Foreword As most of the students who has a computer one of my first move when I wake up is to switch on the computer and to spend my first twenty minutes of the day on the Internet. From there I have a look at the last news, I check my e-mails and eventually exchange some few words with a couple of friends by using online chat applications. I also check my other email account as well as my blogs and analyze the traffic I got during the last few days, to finish this process I consult my advertisement account to see if I got some revenues. I often use as well search engine to look for information which just came up into my mind during the night. In the paragraph you just read was the description of my morning routine on Internet. There is nothing special except that most of the moves I described above are in fact done on two to three major search engines: Google, Yahoo and Microsoft. I hardly ever use Yahoo or Microsoft for search purpose but Google is for sure the website I visit the most to crawl the web but... is Google the Internet? I got the idea to write about: « Risks of search engine dependency and its influence on data quality » not because I was using all those Google applications everyday and was scared about what will happen if I get in troubles with Google such as privacy issues or if Google just closed. I just write about it because one day I found Google results not accurate enough. And from this observation a lot of questions came to my mind: • Is it me who is not good enough at performing research on the Internet? • Is it because no one wrote about the information I am looking for? • Is it because the information is not on the first pages in Google that I have to browse all the pages in order to find it? • Is it because Google is not good enough? • Is it because the information is hidden in some other documents such as PDF, pictures, videos? • Is it because I have to use another way to crawl the web and if yes how? You see here how a simple observation can raise a lot of questions. I hesitated a lot about writing on this topic, the main problem I got was that I was not convinced that there is a potential risk of being search engine dependent. The CHARDONNEAU Ronan - European Master in Business Studies 7/66
    • reason is that companies such as Google are working hard in order to fit Internet users expectations and the vision we get is that they are doing a wonderful work. The problem is that there could be a difference between perception and real facts and this is exactly what I am eager to discover here. Can we measure how huge is the gap between the information we were looking for and the one of search engines as Google are providing us? Search engines are set up to find information on the Internet, information being the basis of any good decisions making we can then understand how important and interesting it is to write on this topic. I hope you will appreciate this reading as much as I did when making my research. Ronan CHARDONNEAU CHARDONNEAU Ronan - European Master in Business Studies 8/66
    • Chapter 1: Introduction of the topic background CHARDONNEAU Ronan - European Master in Business Studies 9/66
    • I will not surprise you if I say that Internet has been created to share information and to communicate with each others. It is hard to evaluate how big is the Internet, estimations among companies are very different, it varies from 15 to some 30 billion Web pages1. The number of websites is increasing everyday and estimated at 185,167,8972 with a constant augmentation since the creation of the world wide web. Illustration 1: Total Sites Across All Domains August 1995 - January 2009 Habits have changed since the creation of the Internet and websites are used now in diverse manners if it comes to be a standard for companies (recognized as a mark of trust, seriousness and quality) it is also a space for many individuals (blog phenomenon). As an example regarding France, in June 2008 14% of French people above 12 year-old which means 22% of French Internet users are authors of a blog or a website3. The banalization of the Internet and the fact that anyone can create his own website for free increase the feeling we have regarding the Internet: a true jungle of information and even sometimes real “dump” regarding information accuracy. 1 cf. Koch, P. / Koch, S. (2009): How big is the Internet?. 2 Netcraft, (2009): January 2009 Web Server Survey. 3 Crédoc, (2008): La diffusion des technologies de l'information et de la communication dans la société française, p.120. CHARDONNEAU Ronan - European Master in Business Studies 10/66
    • Websites can be accessible through three channels: • Direct access (for example you know the website address by heart, you put it in your favorites or you find a website on a business card and you are typing it in the address bar); • External links (you access to a website which has the link of another website, this is the case in most of websites, catalogs, advertisement); • Through Search Engines (you use a dedicated application by typing in some keywords in order to get suggestions of what you are looking for); As you can see from this list if you use only the first two ways to crawl the web it comes to be too rigid and not wide enough. It has been said as well that the first way is disappearing more and more in profit of search engines4. So one could say that there is currently two main ways to crawl the web, from link to link and by using search engine. This last one being indispensable in order to crawl the web properly. More and more information are put on the Internet which makes it come a true jungle. The only way to crawl those information properly is to use search engines. 1.1 Relevance of the subject Internet is becoming more and more our information provider. quot;In 2002 a study from the U.S. Department of Commerce estimated that 36% of the American public over the age of 3 used the Internet to search for product and service information in 2001. This usage represented a substantial increase in information search behavior from 2000, when 26% of the American public reported using the information.quot;5 Internet to search for product and service Between 2000 and 2008 the USA got an increase of Internet users of 130,9 %6 we can then imagine how important is searching information through Internet 4 cf. Ohayon, O. (2008): Google, moteur de recherche ou moteur de navigation?. 5 Peterson, Robert A. / Merino, Maria C. (2003):Consumer Information Search Behavior and the Internet, p.6. 6 Internet World Stats (2008): Internet Usage and Population in North America. CHARDONNEAU Ronan - European Master in Business Studies 11/66
    • nowadays. The number of Internet users is estimated to 1,463,632,361 (world population 6,676,120,288) with a growth rate from 2000-2008 fixed at 305.5 %7. 1.2 Major terms In this thesis you will hear a lot about the following terms: search engines, search engine dependency and data quality. Search engine is the most flexible technology which has been created in order to crawl the web. A search engine is no more than a web application which is processing data. A search engine does not create data it just process some information it has in his index. I describe search engine dependency as the fact that one does not make the choice between one search engine and another. Search engine dependency is very relevant in most of the countries. Most of the people are using only one single search engine when performing requests on the web. I will speak as well about bad habits when dealing with search engines. For example you can be dependent of using a search engine but using it badly. Data quality is the quality of data. Data are of high quality quot;if they are fit for their intended uses in operations, decision making and planning quot;. Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. These two views can often be in disagreement, even about the same set of data used for the same purpose8. 1.3 Focus, goals and structure of the report The focus of this work is to show if there are some risks of being search engine dependent and if there are some, is the gap of information significant between a search engine to another? Goals are numerous: • to put in evidence how bad it is to be dependent of an unique search engine; 7 Internet World Stats (2008): INTERNET USAGE STATISTICS The Internet Big Picture. 8 Kaplan, I. (2008): Bad Data Can Cost You Big Time. CHARDONNEAU Ronan - European Master in Business Studies 12/66
    • • to show how important it is to have reliable information; • to show how big is the gap between a general search engine and a specialized one; • to look for alternatives if tomorrow the main search engines disappear; • to show the weaknesses of general search engines; • to show and discover other search engines and put in evidence their usefulness; • to show that general search engines are not well used; • to discover the future expectations people have regarding search engines and what are the coming technologies in this field; The structure of the report should be done as follow: The first idea is to introduce the concept of data quality. We all go on search engines in order to find information whatever it is or at least to find the answer to one of our question (how much does a Paris-Berlin train ticket costs? What is the weather in New York? What is the last result of our favorite soccer team?). The second point is about Data quality which is very interesting in order to understand why search engines exist. The next point is dealing with the world of search engines and the dependency which is outcome from them. Analyzing the world of search engines is fundamental to understand how Internet is not as rational as we could think (search engines may be not the Internet, search engines may be different from a country to another). Then we will have a look on figures which show quite clearly that people are not using several search engines when making research on the Internet but an unique one that I will define as the dependency concept. Once this definition given I will focus on the heart of this work which are the risks that this dependency provide and its influence on data quality. Google being in Europe the most used search engine I will use it as a major example in my last part. CHARDONNEAU Ronan - European Master in Business Studies 13/66
    • Chapter 2: Concept of data quality CHARDONNEAU Ronan - European Master in Business Studies 14/66
    • When I firstly decided to write about the concept of data quality for search engine I did not mean at the beginning that data has to be perfect. I just meant that when I write a request on any search engine I expect to have as results some answers which fit to my expectations. The last example I have in mind is when on the 31th of December a friend of mine looked for the « average height of Korean people » (request were made on Google) and all the results on the page were dealing about the « average size of the sexual organs of Korean people ». I have high doubts that no information regarding the average height of Korean people does not exist on the web it however has been what Google seemed to tell me. Here as you can see in this example it raises a lot of issues regarding data quality. My first expectation regarding data quality was then to get some results which fit to my request. But looking it deeper what is the value of a supposed correct answer that I could have found written by someone like you and I (not specialized in this topic). In fact here everything depends on how professional the data as been written, this is what I am explaining in this chapter. 2.1 Data quality definition “This part needs additional information and improvements and is then not finished yet.” Data quality is defined by four criteria: • Accuracy: it means that the information has to be true so based on real facts. Here we see the importance of having the source of the document. • Timely: the data given has to be dated. The most striking example could be the one with stock exchange, what is the value of a data regarding a currency change without the date? • Meaningful: here it comes back to what I was explaining in the introduction. Does the result fits with my request? What is the color of a frog? The answer is green, is it useful? Yes but is it complete?; • Complete: an information can be for sure useful but will be far more useful if CHARDONNEAU Ronan - European Master in Business Studies 15/66
    • it is complete. Here is then the minimum vital of data quality. 2.2 The importance of data quality “This part needs additional information and improvements and is then not finished yet.” As I mentioned it there are two levels of data quality: • Poor quality one: for example you and I are looking for information whatever it is in order to get a quick answer or even a mere comment. Here two theories are facing each other: An information should be given even if it is possibly wrong. It of ▪ course can be useless if the information is a hoax but will it be in the case of a warning of a terrorist attack? ▪ An information should be given only if it is 100% accurate. Here it is interesting in order to not create polemical situations for nothing. There are no rational choice to make here, sometimes you need to use the first theory and sometimes the other one. I would describe poor quality data as mass consumption information which mean good enough for “lambda” people but not that useful for businesses and even less for researchers. With the increasing of Internet users more and more mass consumption data are created and of course their number is bypassing the one of quality ones. • High quality one: here we find the data which fits the four criteria I described above. High quality data are useful in order to take complex and rational decisions. The Wikipedia Issue: As you may know Wikipedia is an Open Source encyclopedia where CHARDONNEAU Ronan - European Master in Business Studies 16/66
    • everyone can write what they want on a specific subject. His success made in come the biggest encyclopedia of all time. Such a success is recognized as a mark of trust and seriousness for search engines which is characterized by being displayed most of the time in the 5th first result for any request. With most of the time an appearance on the first result. The main issue even if recently it has been added the possibility to add the author name most of the articles are anonymous. It is also possible to write down an article without mentioning any source of information. This is then explained to the reader before starting the article but everything is then dependending on reader's behaviour. Which quality of information does he or she want to have. CHARDONNEAU Ronan - European Master in Business Studies 17/66
    • Chapter 3: Search engines dependency CHARDONNEAU Ronan - European Master in Business Studies 18/66
    • In order to understand well this part we should have firstly a look at the search engine market configuration. Even if for a lot of people search engine is a synonym of Google it may be not true for some other people. I made a strong effort during this work to make it as global as possible. Europe is good as well as the United States but we will never got the all answers of our issues if we always consider these two parts of the world. 3.1 Search engine market configuration 3.1.1 Search engine categories The world of search engines is not as uniformed as we could have think. I until now identified three kind of search engines: • Standard: the most well known search engines such as www.google.com, www.livesearch.com (Microsoft), Ask (http://www.ask.com/?o=312) they are looking for any kind of information through the Internet and are characterized by a very light interface: Illustration 2: A light interface: the homepage of the search engine www.ask.fr • Portals: may be the most complex search engines to analyze on a figures CHARDONNEAU Ronan - European Master in Business Studies 19/66
    • point of view. Portals are characterized by a lot of information on their home page including the search engine function. It is then difficult to make the difference here between the success of the portal and the success of the search engine on this web page. The best example I found is the one of mail.ru which is the most visited website in Russia far before Google. It seems when looking at the search engine statistics that the research made on Mail.ru are not accounted or that no one is using the search function of Mail.ru. So it is sometimes hard to measure the success of some search engines. The most well known portal is Yahoo. Illustration 3: Home page of the Mail.ru portal • Specialized search engines: they to my point of view belong to a subcategory of the first group. The major problem of standard search engines is that they are too big. Specialized search engines are then no more than a special function which is working as a filter. It is then far more easier to find the information you are looking for on a specific website rather than using standard ones. A good example of it can be the one of Ebay, it is of course far more convenient to go on the Ebay website and use the search engine directly from there rather than going on Google and writing a request such as 'Ebay buying socks'. I mentioned it again but in most of the cases specialized search engines are not a revolution they are just one part of standard search engine and are not a new technology by themselves. CHARDONNEAU Ronan - European Master in Business Studies 20/66
    • 3.1.2 Search engine market The search engine market is segmented by an enormous leader: Google, a follower who is far from him: Yahoo and a large amounts of small search engines. The dominant position of Google may stay for years and years and only a technical revolution could really jeopardized him. Illustration 4: Top 10: Search websites in the world August 2007 As you can see here I identified what we could call « The big four » which represents the four major search engines. Then comes three specialized search engines (7,49% all together) but their success are limited to the size of the website they are browsing. Then came in last positions what I would called the dead champions which once have been great and well known search engines but which are nowadays in decline and will one day probably disappear. 3.1.3 The search engines in the world The more I study the E-world and the more I realize that the web is exactly the reflect of our society but in a non-physical aspect. Internet did not break cultural differences and search engines really show it. As we just saw Google represents 6 research out of 10 in the world but does it mean that each country in the world has a population of 60% Google users? To answer this question let's have a look at the following map: CHARDONNEAU Ronan - European Master in Business Studies 21/66
    • Illustration 5: The most visited website by country Source9 As we can see here the world is not covered entirely by Google. We clearly have some Google countries, Yahoo countries, Mail.ru countries and so on and so fourth. We can however notice some striking information such as almost all the American continent is using Google as well as Europe, Northern Africa and Southern Africa, Australia, India. In one word almost all countries which have strong links with the Anglo-Saxon culture. Then comes what I would qualified as a cultural wall which is starting in Eastern Europe and which is finishing in Russia. Here are the ex-soviet countries. I unfortunately have no concrete proves of what I am saying but I suppose that there is a kind of a « boycott of American technologies » and support of Russian technologies. The recent partnership between Yandex (main search engine in Russia) and the browser Firefox raised those suspicions10. Russia is not the only country in this situation, China also. The recent advertisement broadcast by Baidu (the leader search engine in China) shows clearly 9 Alexa the Web information company (2008). 10 cf. Houste, F. (2009): Russie : Yandex sera le moteur de recherche par défaut de Firefox. cf. Schwartz, B. (2009): Firefox Drops Google For Yandex In Russia, But Big Loser May Be Rambler. CHARDONNEAU Ronan - European Master in Business Studies 22/66
    • the will that Chinese public institutions are ready to protect their territory.11 As you can see Asia is the region where Google is the least present. It is also the continent where are gathering a lot of different search engines. I may not emphasize the diversity of the Caribbean area as well as Center Africa which are areas where Internet is not that well implemented which mean then that the battle to take the lead is not finished yet. For example is it really relevant to say that Yahoo is the leader in Cameroon which is a country with less than 500,000 Internet users? I however will highlight that the Pacific area which is containing all the « Tigers » (Taiwan, Thailand, Singapore...) are all in red: Yahoo. The search engine world is then divided in two parts: • The Google world: which is composed of all the Anglo-Saxon countries as well as countries which have strong links with the United States or Great Britain. It is clearly showed regarding India and Australia. We can at the date of today still identify two countries which are not under Google control: Czech Republic and Iceland but actually by looking at the figures and the forecasts it is just a matter of time12. • The Asian – Pacific world: Asia is composed of a lot of countries and of course a lot of cultures. Among them we can identify four players: ◦ Mail.ru which is dominating all the ex-soviet countries; ◦ Baidu which has a total control over China; ◦ Naver, a 100% South Korean product which is the best example that search engines work by culture; ◦ Yahoo which is leader in all the “Tigers” Asian countries. Yahoo being an American technology such as Google, how is it possible that Yahoo is so successful in the Pacific area and not elsewhere? The reason I found is that Yahoo is a shiny portal and that Asian culture on 11 cf. Einhorn, B. (2007): Baidu Thinks It Can Play in Japan. cf. (2007): Baidu au Japon?. cf. Grallet, G. (2009): Baidu, un autre Google s'éveille. 12 cf. Rafat, A. (2008): Czech Portal Seznam Could Fetch $900 Million; Google, Apax, Warburg and Others in Fray. cf. Mar Hauksson, K. (2007): Global search report 2007, p8. CHARDONNEAU Ronan - European Master in Business Studies 23/66
    • Internet recognize a quality website to the number of animations on it 13. I will then add that Japan is a strong pole of Internet with one of the highest rate of Internet integration in the world per capita14. This is why I think Yahoo is so popular in this region. 3.1.4 The search engine market shares per country One of the most complete work I found on this topic after mine is the « Global Search Report 2007 »15. What stroke me the most in all the countries studied in this report as well as in all the report and research I made until now are the search engines market configuration which looks like very often to this: Illustration 6: Search engine market shares in Czech Republic It is very rare to find a country where there is a close competition among search engines. Even if in the High Technology world things change from a day to another you have often the following configuration where the first search engine is leading the game by more than 30 points on its followers. This is typical from the search engines market or you are adopted by a population or you are not. This trend seems quite relevant in the software industry, people seem to look for a standard used by all. This is the case for the Operating System industry, the browser industry, the e-learning industry. The 13 cf. Tobin, R. / Hotchkiss, G. / Lee, P. (2008): Chinese Search Engine Engagement, p28. 14 Internet World Stats. (2008): Internet Usage in Asia. 15 cf. Wilsdon, N. (2007): Global Search Report 2007. CHARDONNEAU Ronan - European Master in Business Studies 24/66
    • explanation I found for the success of search engine within a population is the word to mouth, this is how Google has been so successful isn 't it? How never heard sentences such as « you just have to Google it » Google is even nowadays in dictionaries as a verb16. As a conclusion I would say that in the world of search engines you are first or you are nothing. I have to mention as well that the market has a lot of small local search engines which are if original enough bought by the big ones or if not will disappear quickly (some example are coming in the news every month). The only key of the success on the short term seem to be advertisement but on the long run you need the technology behind in order to compete. 3.1.5 The search engines competition Google has been created in 1998 and at that time search engines were already in place, it did not scared Google and one after the other Google bypassed all of them. In fact among the big four Yahoo is the oldest (1994) and Microsoft the youngest (2003). Even if the battle seems to be finished it will take a lot of time to Google to be the number one in all countries (everything being linked to culture rather than rationality) which in fact is giving hope to its followers. At the time I am writing this thesis discussions are still on the way between Yahoo and Microsoft in order for Microsoft to buy Yahoo search technologies. We can understand how strategic a such acquisition could be. Yahoo having the research knowledge and Microsoft the funds as well as the software ownership. Regarding Baidu we cannot clearly see how they could compete against Google outside of China. As I said previously specialized search engines are limited to the website they are linked to. We could then think about new comers who starting from nothing could beat famous search engines in a small period of time, it could have been the success of some products such as Cuil launched in summer 2008 which received a lot of advertisement through the news17. But search engines is a very ungrateful world 16 cf. Merriam Webster. (2001): Google - Definition from the Merriam-Webster Online Dictionary. 17 cf. Arrington, M. (2008): Cuil On BusinessWeek's Most Successful of 2008 List. Huh?. CHARDONNEAU Ronan - European Master in Business Studies 25/66
    • where visitors are giving no more than one chance: the product works or it does not. Illustration 7: Results page of Cuil This is a point that I discovered very quickly and that you can test by yourself. People want the information as soon as they can. They are ready to test the product but in a certain amount of tries. When you move from Google to another search engine you are often intransigent. At the first result which does not fit your expectations you will go back to Google. But is the search engine wrong or is it because it is responding differently that on what you were used to? In order to conclude this part I would say that with the search engine history we have and the search engine market configuration, I cannot see how Google could lose its position. Until now only one company succeeds to make a such gap in the world of search engine and it is Google itself and it was in a period where everything had to be created on Internet. So I would say that on this field I don't see how Google can be beaten and even worried. A new technology regarding research is however more and more recurrent in this field and is called semantic research. 3.1.6 The semantic web “This part needs additional information and improvements and is then not finished yet.” CHARDONNEAU Ronan - European Master in Business Studies 26/66
    • The semantic web is another way of crawling the net. We all know how to make a search on the Internet isn't it? We just type in some keywords and press the return key in order to get the answer. In this configuration you have to feed the search engine with the request. With semantic Web the concept is a bit different and based on suggesting you the request instead of typing it entirely. Each time that you are starting to type your request a list of suggestions are coming to you. We are recently seeing more and more this technology on the biggest search engines. The purpose is in fact to guide you as best as they can in order to put you on the right track and trying to avoid you to reach the labyrinth of the web. This technology fits one of the main drawback of search engines and that I call « search engine technology awareness » which consists in how to write good requests for search engines. The main drawback of the semantic web is that this is a very new technology which then have a lot to do before reaching his maturity point. Here we are speaking about a maximum length of a decade. We can also complain about the rigidity of the system but it is true that with length and experience this issue could be fixed. It said that Ask is one of the search engine which based a lot of R&D on this new technology but according to me and without being a technician I think that Google can have better results because of its huge database of requests. Future will tell us what is going to happen. 3.2 Search engines dependency aspect As I mentioned it in the introduction I define search engine dependency as the fact that people are swearing only by one search engine when looking for information on the Internet. 3.2.1 Search engines dependency proves “This part needs additional information and improvements and is then not finished yet.” CHARDONNEAU Ronan - European Master in Business Studies 27/66
    • This is not the studies which are lacking on this topic. When looking for information regarding information literacy on the Internet you arrive on different sources of studies and this regarding all the countries of the world. Information literacy is a relevant topic and an issue. I focused on some very recent and francophone research that I found on the Internet regarding Canadian students18, French19 and Belgium students20. I also found information regarding Germany on this topic. Many sources are as well saying that such research have been made in mostly all Europe, China (Hong Kong)21 and the United States22. All the studies I found until now (all done on students panels so literate people) are all saying the same thing: search engine are the first source of information when looking on the Internet and all students seem to have receive not enough training on how to look for information on the Internet. The best study I found on this topic is one made on all the registered PhD students (2,218 with an answer rate of 23,4%) last year (2008) on a whole region of France (not a high technological developed country but far to be the least on a worldwide scale)23. As we can imagine PhD students have a high requirements regarding quality of information. The study shows that 67,5% of the respondents have never received a training regarding how to look for information during their whole stay at the university which could explain the fact that people are running toward search engines directly. Search engines are used in 96% of the cases when performing research (which emphasize the necessity of how to well use those technologies). 94% of them do not use blogs which I take as a good thing (even if the survey is saying the opposite, blogs being written by professionals as well). The most used search engines are Google (85%) and Google Scholar 37% 18 cf. Crepuq. (2003): Etude sur les connaissances en recherche documentaire des étudiants entrant au 1er cycle dans les universités québécoises. 19 cf. Université de Lyon. (2007): De la documentation au plagiat. 20 cf. EduDoc. (2008): Enquête sur les compétences documentaires et informationnelles des étudiants qui accèdent à l'enseignement supérieur en Communauté française de Belgique. 21 cf. Leung, Hon-wing 梁漢榮. (2004): A study of computer science students' conceptions of information literacy and their experiences in information search process and use. 22 cf. Enquiro. (2004): Search Engine Usage in North America. 23 cf. URFIST de Rennes. (2008): ENQUÊTE SUR LES BESOINS DE FORMATION DES DOCTORANTS Á LA MAÎTRISE DE L’INFORMATION SCIENTIFIQUE dans les Ecoles doctorales de Bretagne. CHARDONNEAU Ronan - European Master in Business Studies 28/66
    • (which is a sub search engine of Google). 60% of them do not know what is a meta search engine and only 5% of them use them. 46 % do not know the search engine of their field and only 20% do use them. Those figures are very interesting because they show clearly how people are not adapted to the technology they are using. PhD students should be some of the most search engines awarded people and it seems that for France they are not. They are strictly dependent of a single search engine which is here Google. They know very few of his sub search engine and as written above they do not know how to use the technology. They are also not aware massively about other search engines. 3.2.2 Search engines dependency aspect Search engines dependency can however comes from different ways: • Search engine satisfaction: you are using a specific search engine which give you entire satisfaction, so why should you change?; • Search engine patriotism: you are using this search engine in order to support your local technologies; • Search engine convenience: the search engine is providing you all kind of services which made it very convenient to use or even made the other ones not convenient to use; Of course the trend for all search engine is to go for convenience because it gives to customer everything they need. The main drawback is that for the search engine companies you have then to dedicate less people to your core activity and then there is a risk that your search activity will pay the price. CHARDONNEAU Ronan - European Master in Business Studies 29/66
    • So being search engine dependent means using massively a search engine for one of those reasons and ignoring all the other ones. Search engines dependency reach very high rate in Europe: Illustration 8: Search engines figures for France, Source: XitiMonitor Most of the European countries have like France a strong addiction to Google with more than 90%. What does it concretely mean? Almost all European when making search on the Internet are fed by using the same way to process information. 3.3 Search engines dependency problems At the first sight when using a search engine we are not thinking about all the issues which are coming out from them. We make our research and we get results from this and then we try the results one after the other until finding the one which fits the best our expectations. The first main problem is that when addicted to a specific search engine which normally gave you satisfaction the day when the result will not be the one you want you may think about different possibilities: • The information is not displayed so the information you are looking for does not exist yet; • The request was not good enough so let's try with other keywords; The main issue to highlight is that people are so confident with some search engines that they will not normally look for alternatives or even consider that their CHARDONNEAU Ronan - European Master in Business Studies 30/66
    • favorite search engine can be wrong. People are so confident with their search engine that they are not thinking that the search engine can be wrong. 3.3.1 Privacy issues I am a bit divided on this topic because I consider it as the mass public « scarecrow » which is only good to animate polemical debates for nothing. It finds its explanations in the way that search engines are collecting information. When analyzing search engines we have to consider that it is a free product for all of us (in fact search engines get paid by displaying advertisement on each web page). Each time you are making a research on the Internet the search engine you are using registers the IP number of your computer and of course the research you just made. All these data are of course supposed to be confidential but some are used in order to make some statistics such as how many Internet users from a specific country have visited this website. It can be used for other purposes such as the rank of the most used research and others data such as those. Of course the more information you give and the most they collect so if you open an email account on a search engine for example they will collect your name, address. Until now few are the cases where we got the proof that information collected by search engines have been given to third parties. The most famous one is the one of Yahoo in China which filtered some emails and gives the names of some Chinese journalists who were denouncing things about the Chinese government24. To make it clear until now no mass exploitation of data have been observed and the recent news given by major search engines (Microsoft and Google) are saying that the trend is to eliminate those data as much as possible in the fear of losing confidentiality25. We however have to consider an additional element the more search engine know about what we are looking for and the most they can fit our expectations, so I personally do not think that reducing the collection of data is in 24 cf. Kahn, J. (2005): Yahoo helped Chinese to prosecute journalist. 25 cf. Boucq, I. (2009): Yahoo et vos données persos.... CHARDONNEAU Ronan - European Master in Business Studies 31/66
    • people interest and I will qualify the privacy issue as a global scarecrow in order to bug the major search engines and putting on the first row some alternative ones which on the long run may will not fix those issues. 3.3.2 Looking for other search engines A recurrent question often asked is « Why people are not looking for other search engines? ». Why do people knowledge regarding search engines is so low? If you ask to someone how many search engines he may know he will for sure answer you « Google » maybe « Yahoo » he may add « Microsoft » without telling you its real name which is « Live Search » and it will for sure stops right here. Many others search engines are however present on the market. We should then conclude that people are satisfied by the current search engines. However studies are showing that regarding Google people when getting their results are only considering the first three results and not the others giving far more importance to the first one.26 By just this observation we could then say that there are things to improve in Google's interface. I guess again that people are looking for something new even if it is not their priority. This is why I would like to raise another issue which is search engine awareness. 3.3.3 Search engine awareness Search engine awareness is for me the key issue of this all work and reveal two parts: • Poor search engine awareness regarding how to use a specific search engine; • Poor search engine awareness regarding the existence of other search engines; Both parts are fundamental. The first one deals with what we call search tools. It consists of a combination of keys in order to fit a specific request. If you are on Google typing the following request: search engine dependency is different from: 26 Eye tools. (2009): Eyetools Eyetracking Research. CHARDONNEAU Ronan - European Master in Business Studies 32/66
    • « search engine dependency » In the first case you will look for pages which are containing the words: search engine dependency as well as all the others combinations such as pages with search dependency, engine dependency, search, engine, dependency. Whereas in the other cases you are restricting the pages which include only those three words and in the same order. It exists dozens of those tools per search engine, the most famous ones are called « boolean operators ». Some websites provide the list of those tools27. We have also to consider that all the search engines are not using the same conventions so some tools under search engine A will not work on search engine B. If you are on Yahoo the following request: • title:search engine will give you all the web pages which have for titles the following keywords. This tool is unfortunately not working on Google. This example shows well how search engines are in fact complementary in order to make precise research. Is Google really better than Yahoo? To get a piece of the answer you can make this empirical experience I made a couple of months ago. I did the same request on three different search engines (Google, Yahoo and MS Live Search) and compared the first five results and gave my opinion on each of those results. If the result fit my expectations I gave one point if it did not zero. Google got a four out of five, Yahoo a two out of five and MS Live Search a zero out of five. However all results from a search engine to another was different. So Google won for sure at the time I performed the award of pertinence, however the information I was looking may have been on the two results Yahoo gave me. So is Google better than Yahoo for this example the answer is yes. Does it mean that I should not consider Yahoo? Absolutely not. 27 cf. Koch, P. / Koch, S. (1999-2006): A short and easy search engine tutorial. CHARDONNEAU Ronan - European Master in Business Studies 33/66
    • 3.3.4 Other search engines existence awareness “This part needs additional information and improvements and is then not finished yet.” Here is another issue which is actually not hitting only the competitors of major search engines but also themselves. In order to be recognized in the world of search engines you need very strong guarantees. This year two of them succeeds their advertisement campaign: « Cuil » by claiming that they had a database which is 6 times more than the one of Google. It said also that their success is coming from the fact that the company has been built by a former Google employee but actually Cuil disappears very soon as the change of their CEO. The main reproach which has been done is that the results given were not numerous enough which of course is seen from a very bad eye when you are claiming that it is your main strength. In the world of search engines you cannot be a clown. People give you one time the floor and will not give it to you back if you do not perform as they wish. The other example I can give is the one of Exalead, a French search engine which is supported by the European Union, the purpose behind is to set up an European technology in order to face the American ones. So as you can imagine there is a big support and a huge project behind which make me coming to the bullet point above if you are a search engine which wants just to be correctly advertised you definitely need very strong supports. I today January the 7th 2009 got twice in a day the same observation about researching information on the Internet. I was looking for two different songs but I did not know the title as well as the singer, all I had was some couple of words. I made then a research on Google with the words I knew from the song and added the word « lyrics » as well in order to CHARDONNEAU Ronan - European Master in Business Studies 34/66
    • specify that what I was looking for was a song. I as well added the quotes « » in order to say to Google that I wanted the words in this strict order in order for it to identify better the song I was looking for among the jungle of the web. I wrote then: lyrics « freaky people » the answers I got where dealing with a singer called « Michael Franti and Spearhead » and this on three pages in a row. I tried to set a different request and the result was the same. Then I gave a try to GrabAll which is a double search engine displaying Google and Yahoo on the same screen but on two different columns. I made then the same request and got very surprise to see the name of the Fat Boy Slim band on the first page of Yahoo. This result was on the fourth page of Google instead. Here I have two points to make. Firstly the utility of using a second search engine in order to control the result and secondly the fact that Google was taking in account that I was 100% of what I was looking for. Google was in fact giving me those results because the two words « freaky people » are a part of a title of a song of « Michael Franti and Spearhead » but did I asked for that? No I just asked for lyrics which have the words « freaky people ». The second example I am going to give you is as well quite relevant of how search engine dependency can be dangerous. I was looking this time for lyrics with « lyrics wanna be your doll » what I did not know is that I had it wrong and it was not « lyrics wanna be your doll » but « wanna be your dog ». The fact is that both songs containing those lyrics exist and that Google was displaying me the songs containing the words I just fill in whereas actually Yahoo was presenting me diverse results including the one I wanted. I am not writing here that Yahoo is better than Google, I am just saying that there is a high interest to have a look around in the world of search engines, I could have a look around all night on Google looking for my song that I will have never found because I had it wrong since the beginning. So looking for diversity should be the first reflex when you cannot find what you are looking for in short attempts. 3.3.5 Less confident regarding other search engines “This part needs additional information and improvements and is then not finished yet.” CHARDONNEAU Ronan - European Master in Business Studies 35/66
    • This part is in fact in close link with the example I just gave above. Each search engine has his own methodology when displaying the result. When you get used to get the results in a specific way then when it comes to be different you may think that the results are totally absurd. The problem is that the more you get addicted the more you will have the feeling that other search engines are bad and then will not give them a try. This is why I describe search engine dependency as the fact to be less and less confident regarding other search engines. 3.3.6 Even the best cannot provide you everything “This part needs additional information and improvements and is then not finished yet.” Some functionalities are not included in the biggest search engines which make not you think that they could even exist such as search engines for jobs or people. How to get a visual representation of what I am looking for? And the main problem of the big search engines is that they are too big so even if they have what you are looking for it is not sure that you can even see it. CHARDONNEAU Ronan - European Master in Business Studies 36/66
    • Chapter 4: Risks of search engines dependency and its influence on data quality CHARDONNEAU Ronan - European Master in Business Studies 37/66
    • This part is dealing with the measurement and the proves of what I am writing about. Here I expect to show how search engine dependency is affecting our everyday life and how we could overcome this situation. 4.1 The information has been found but is poor “This part needs additional information and improvements and is then not finished yet.” Let's imagine that you and I just performed a request under a specific search engine and that the answers given do not satisfy you entirely. You found the information but it is not developed enough, not signed, too old... 4.2 What the search engines do not tell you One of my former English teacher taught me one day that there is different ways of not giving information, one is to lie and one is to not say the information. I don't think search engines are lying and cheating even if in the case of Baidu the Chinese search engine there are high suspicions on it28. Some are saying that in order to be well ranked you need to pay Baidu for it and it has been said as well that the Chinese government as a big role to play when displaying the results. The recent Milk scandal in China (Baidu accepted to high ranked unlicensed companies which were providing fake milk in exchange of money) showed how dangerous can be a poor data quality search29. I however have the proof that search engines are not telling you everything when displaying results and that this rule is general for all the major search engines. In Google case, search engines are censored according to the country in which you are making your research from30 and it touches all the country around the world even countries such as France and Germany. Most of the time those censorship acts are for your welfare or to protect national security interest. 28 cf. China Tech News.com. (2007): CCTV: Baidu Search Engine Fraud Exposed?. 29 cf. China Daily. (2008): Baidu cuts revenue forecast on ad scandal. 30 cf. ROSEN, J. (2008): Google’s Gatekeepers. CHARDONNEAU Ronan - European Master in Business Studies 38/66
    • 4.3 The best way to get data quality Here is maybe the most interesting part of my work which consists in giving the solution to the main issue I highlighted since the beginning. Until now I just raised questions regarding the risks of dependency and not showed what you should do in order to explore the web properly. 4.3.1 The sub-search engines As I said previously Google is too big, the bigger it is the more precise your request have to be. Few people knowing the existence of Boolean operators it then comes more and more difficult to get the right information. This is why in fact Google put at your disposal some sub-search engines in order to make your research easier, for example: Google books, Google videos, Google images. We all agree that looking for pictures could be done on the research bar of Google, but it is far more convenient to make it directly from « Google images » because it is displayed better. The main problem is what I described previously: « search engine awareness within a search engine ». Am I aware that Google has the following services? I could sum up it all in a scheme: Illustration 9: Some sub-search engines of Google For a Google search engine dependent everything starts from Google home page. He is then aware here of those specialized search engines in order to make CHARDONNEAU Ronan - European Master in Business Studies 39/66
    • more precise research on the following information: images, maps, news and products. It is however under his own initiative to discover what is going on under those search engines and to discover them. I do not have a proof of what I am saying but I may presume that in order to buy a product more people are going on Google, type eBay or Amazon and go and buy on those websites rather than going on Google products. Google products is however including all those websites in his database so it should more convenient to have a look on Google products first in order to get the best price rather than going individually on eBay and Amazon. We can then symbolize the Internet users awareness of Google search engines according to this scheme. The more we get into deep of those levels and the least the Internet user is aware of it. Level 2 is still available on the first page but with two clicks. Level 3 is no more available on the home page but can be accessible in three clicks. Level 4 is even not accessible from Google main website and that you have too be aware of it in order to access it. Level 5 is hypothetical and constitute the unknown Google projects, often developed under another name. All those Google sub levels have been created in order to make easier research on a specific request. When looking for an image it is far more convenient to pass through Google images rather than the general Google. 4.3.2 The size of the Internet How big is the Internet? Is Google indexing all websites? It is really hard to answer to those questions but however possible to set up some estimations according to some information31. Those sources are saying that in 2005 the size of the Internet is estimated to 5 million terabytes and Google's index to 170 terabytes which would mean that Google is processing only 0,000034%. However the Internet is also containing what we call the invisible web 31 cf. Plesu, A. (2005): How Big Is the Internet?. CHARDONNEAU Ronan - European Master in Business Studies 40/66
    • composed of websites that owners do not want its content indexed as well as websites which are protected by a password. In 2004 this invisible web was estimated to be 500 times bigger than the visible web. It has been said as well that Google is indexing invisible web only recently32. A clever calculation will then give us: Internet size = Visible Web + Invisible Web; Internet size = 501 * Visible Web; Visible Web = Internet size / 501; Visible Web = 5 000 000 / 501; Visible Web = 9980 terabytes; Google Index = 170 / Visible Web; Google Index = 1,7% This estimation is of course only an estimation and could be full of errors. I however find it more useful than no information at all. I would also emphasize the fact that Google has no interest in indexing bad quality websites and that technologies have evolved in the last few years and that this rate should be of course far higher than those 1,7%. Whatever is the final result my point is the following: Google is not the Internet and is not processing all the web... but does all the web need to be indexed? 4.3.3 Single search engine Internet coverage “This part needs additional information and improvements and is then not finished yet.” Here is a more optimistic representation of Google Internet coverage I made in order to show that Google is not the Internet and that even within Google's sphere Internet users cannot access to all the information: 32 cf. Chitu, A. (2008): Google Starts to Index the Invisible Web. CHARDONNEAU Ronan - European Master in Business Studies 41/66
    • Illustration 10: Google's Index coverage Google dependent people are not only using Google when making research but as well Google partners all symbolized by the sign: Powered by Google means according to an IT company called Alacra33: Alacra uses Google Search Appliances to create the Alacra Compliance Web. The use of Google Search Appliances combines the power of Google search technology, including the ability to find the highest quality and most relevant documents, with Alacra's domain expertise in selecting those web sites and pages which are relevant for AML compliance. Which means that here you have a partnership working more or less in the same way that using a Google specialized search engine. 33 Alacra. (2008?): What does quot;Powered By Googlequot; mean?. CHARDONNEAU Ronan - European Master in Business Studies 42/66
    • There are thousands and thousands search engines all specialized in a specific field on the Internet which are using Google technology to search sites such as Tourism, tutorials... By using all those specialized search engines you are crawling better the Google's coverage space. Illustration 11: “Powered by Google” search engines coverage As you can see here on this configuration by using specialized search engines powered by Google you will always browse Google's cyberspace and when we know that Google is not the Internet we can then ask ourselves how we can get the best out of Internet. CHARDONNEAU Ronan - European Master in Business Studies 43/66
    • On January the 11th I went on both websites http://www.aol.fr/ and http://www.google.fr/ and type the following request 'les moteurs de recherches' both results on the first page were identical. The only difference is that Google gave me 3,210,000 results and AOL 290,000 (9% of Google results) so as I developed above AOL is looking at the same place as Google but is applying more filters. Is it really useful considering that AOL is a general search engine? The answer maybe be given in their home page at http://search.aol.com/aol/webhome « The AOL Search engine delivers great search results, enhanced by Google, plus relevant multimedia results delivered on a single page-so you can search less and discover more. »34 If you go on http://www.google.com/coop/cse/ you will have a good example of the search engine powered by Google. You can even create your own one. All is done by using Google technology and you are just applying your own filter by listing the websites where you want Google to look inside. 4.3.4 Multiple search engine Internet coverage “This part needs additional information and improvements and is then not finished yet.” Let's go now deeper in our analysis by taking in account search engine which are not using Google technologies. All developed their own technology and are sometimes better than Google in some fields worst in the other. We will then have something like that: 34 cf. Boswell, W. (2004?): AOL Search: How to Search with AOL Search. CHARDONNEAU Ronan - European Master in Business Studies 44/66
    • Illustration 12: Search engines coverage So here is my point the most the search engines are different and the most it gives you the possibility to discover the web. Here I just put the main actors but we have also to consider that it exists a lot of small search engines which developed their own index and have then their own way to process data. Of course one could say that there is no interest for an European person to process information on a Chinese or even Russian search engine because Chinese search engine should of course be better in looking for Chinese information rather than an European one. It is definitely true if we are speaking about contents such as texts, but when it is dealing with pictures or video the reality should be totally different. On the other hand the day where this European person is looking for Chinese CHARDONNEAU Ronan - European Master in Business Studies 45/66
    • information he should then be aware that he should use the Chinese search engine rather than the Chinese version of the European one. And as we can imagine all the other players: Yahoo, Baidu, Mail.ru have their own Boolean operators and their own sub search engines. So it comes back to the idea of search engine awareness. The more you know about them and the most you are increasing your knowledge about how to get the best out of Internet. 4.3.5 Others search engine Internet coverage “This part needs additional information and improvements and is then not finished yet.” You have a last category of search engine which are the one coming from the semantic web concept which are not based on keywords requests. I found one of those which is called « Who is like it » and gives as results websites similar to the one you just gave him. Illustration 13: “Who is like it” search engine Of course solutions have been found in order to create the perfect search engine including all those technologies. But as you can imagine it is not an easy thing if firstly search engines did not take the same standards as Boolean operators. Secondly the more search engine you combine the more results you will get and it then very complicated to decide which one is more pertinent than the other. CHARDONNEAU Ronan - European Master in Business Studies 46/66
    • A good example of it is the meta search engine called Dogpile which is quite popular in the United States it combines the results of 4 big search engines which are Google, Yahoo, MSN live search and Ask. The main problem is already showed on the advanced search web page of Dogpile, few are the Boolean operators limited to 4. Illustration 14: Dogpile meta search engine The second issue which is relevant is how to decide which results from those search engines are the most relevant. I showed you previously that when making a comparison between Google and Yahoo I found what I wanted on Yahoo on the 8th position. So it is hard to decide which result is the most relevant and to this game you are the only judge of the situation. So the hypothetical search engine should have the following characteristics: including all the technologies ever created regarding search engines (Google's knowledge+Yahoo knowledge+MSN knowledge+...) in order to define the most powerful algorithm ever. All those companies have of course different objectives than being the most rational search engine all are looking to be the number one. This day will may come but it will not be for sure for tomorrow so waiting for it we should understand how to use one by one all those technologies. The solution being then to test the search engines one by one but before testing all of them you have to know that they exist and how to CHARDONNEAU Ronan - European Master in Business Studies 47/66
    • use them but you should at least to know that they exist. 4.3.6 A concrete representation of the World Wide Web In order to make this work easier a Japanese company set up regularly a web map of the most famous websites in the world by referencing them by categories this map could of course be improved but should be a good start: Illustration 15: A representation of the most visited websites This map is available at the following address: http://informationarchitects.jp/ start/ with all the links included towards the websites it is composed of. It is very interesting in order to break the search engine dependency phenomenon. On this map are located all the most famous websites for 2008 you can then see all the most CHARDONNEAU Ronan - European Master in Business Studies 48/66
    • influential websites and where to seek for information. For example if I want to look for a video I may follow the gray line and test all the websites which are on it to find the video I am looking for. In order to conclude this part which actually should not be the core of the all thesis the Web is big and in order to discover it you need to know how to. We could compare it to the real world when living in country A you receive as feedback from country B by different ways (people who are moving from country B to A, the news, the books and documentations you have about country B) but you will never be physically in country B and for this there are some information that you could never get. Of course you can get nearer to those information by crawling more and more the web with your country A search engine (it will be like documenting yourself more and more about country B) but it will never be like being and living in country B. 4.4 The gap between search engine dependency and data quality “This part needs additional information and improvements and is then not finished yet.” In this part I will explain how to put in evidence the gap of information between being search engine dependent of only one search engine and using the most rational tool to search for information on the Internet. It may not be easy to prove it concretely so I may need to prove it by making empirical studies. I succeed to put in evidence so far that: Internet users are search engines addicted; • Internet users have few knowledge about search engines awareness; • I have now to prove that this is bad; • My first point will be to show that search engines are using different technologies provide different results. Here is a comparison I made for three search engines through CHARDONNEAU Ronan - European Master in Business Studies 49/66
    • http://www.thumbshots.org/Products/Thumbshots/Ranking.aspx which shows how many similarities search engines have among them. Here I said if I type in “Universität Kassel” what are the results that those search engines have in common (in blue). And I moreover added the option to highlight me the website of the university of Kassel (in red) which for me is a sign of relevancy of my request. Illustration 16: Comparison among Google, Yahoo and MSN Here as we can see: Google has 7 similarities with Yahoo out of 60; • Google has 10 similarities with MSN out of 60; • Google found two times the website I was looking for; • Yahoo has 4 similarities with MSN out of 60; • Yahoo found one time the website I was looking for; • MSN found 4 times the website I was looking for; • CHARDONNEAU Ronan - European Master in Business Studies 50/66
    • This analysis is then confirming what I was supposing before search engines do not look for information at the same place. Then one could ask about the pertinence of the results, is it worthwhile to display more than one time on one page the website of the university of Kassel? Or is it a sign of relevancy? So here we have an interpretation according to the Internet user. One thing is sure search engines using different technology provide different results. CHARDONNEAU Ronan - European Master in Business Studies 51/66
    • Chapter 5: The Google example CHARDONNEAU Ronan - European Master in Business Studies 52/66
    • “This part needs additional information and improvements and is then not finished yet.” Google is for sure in some parts of the world the best example we can find of the search engine dependency phenomenon I described. 5.1 Google In January 1996 a 24 year-old PhD student called Larry Page studying at the University of Stanford was looking for a theme for his thesis. Encouraged by his supervisor he studied the following topic “exploring the mathematical properties of the World Wide Web“ working in collaboration with another student called Sergey Brin. To make it simple, it is from this work and collaboration which will came up “Google Inc” (officially created in September the 7th 1998). Two months later Google is already included in the Top 100 of world websites of PC magazine (a reference in the United States for computers). Even if Google is formerly a web based application in English it is a worldwide service available on the Internet for all. As his creator (Larry Page) said quot;Google's search engine has always had strong global appealquot;35. Google is nowadays the most famous search engine. In ten years as the Millward Brown report said36 Google will become the most powerful brand in the world. 5.2 Google's success The broadest explanation I found is the following « Google provides for free a useful service that people actively seek out »37. And when you think about it it is definitely true. People are going on Google because they are all looking for that kind of services. But how can it be more successful than its fellows? Here I could say that in general Google is giving better results which may be one reason, but moreover it 35 cf. Chardonneau, R. (2008): International Marketing: Google. 36 cf. Millward Brown. (2008): Top 100 most powerful brands 08. 37 Eternal Dreamer. (2008): Why Google is so Successful?. CHARDONNEAU Ronan - European Master in Business Studies 53/66
    • is providing added services such emails, blogs, news, Advertisement programs. Moreover Google's health as a company has been well preserved by making good choices when making acquisitions and mergers, they internationally developed themselves very well. 5.3 Google dependency state If we take Europe we can see that it is definitely a Google dependent continent: Illustration 17: Google's domination in Europe As we can see here Google (in blue) is not only the most used search engine in those countries it is in fact like the only search engine present at the continental level. 5.4 Google functions Google Boolean operators are numerous but who really use them when performing a research on Internet? A simplified interface of them is available at http://www.google.com/advanced_search?hl=eng but here also who really makes research under it? CHARDONNEAU Ronan - European Master in Business Studies 54/66
    • Illustration 18: Google's Advanced search engine 5.5 Google added functionalities Here is another part I did not developed so far. It deals with other functionalities and services that the search engine do have but do not put sufficiently in advance to the Internet user. So let's say that the advanced research which is located on the home page is already not that much used you can then imagine how much is used a service which is hidden. 5.6 Google success is his weakness The fate of Google is linked also to the one of Search Engine Optimization which is the ability to well index a website on search engines. The main issue is that Google being the most well known search engine a lot of computer scientists tried to understand how Google was working in order to index web pages. The more experiments are made and the more the secret algorithm of Google is known. As we can imagine few are the web masters who are interested to have a website in the latest pages of Google. This is why Google results are not the most rational and are not the reflect of our physical society. This observation is far more important if we consider that some studies38 showed that Internet users are considering only the first 3 results of Google when making research on it. 38 cf. Enquiro. (2008): Eye Tracking Studies. CHARDONNEAU Ronan - European Master in Business Studies 55/66
    • Illustration 19: Eye Tracking on Google 5.7 Google's disappearance hypothesis It is hard to believe that a company such as Google can close his gates one day but everything can happen and moreover in the world of ICT, so let's imagine what will be the consequences of his disappearance. I truly believe that all Internet users will go for its followers Yahoo and Microsoft with exactly the same behaviors. It however said that the patent of Google algorithm belong to the Stanford University and expires in 2011 what will then happen to Google at that time? On January 31th 2009 a good example of this hypothesis has been put in evidence on a global scale. Due to a Google technician mistake during 50 minutes Google search engine was displaying all the results websites as potentially dangerous: CHARDONNEAU Ronan - European Master in Business Studies 56/66
    • According to French analysts this human error decreased the number of Google visitors of more than 90%: CHARDONNEAU Ronan - European Master in Business Studies 57/66
    • The first conclusion we can set is that people are definitely trusting Google. If Google say something is that it is supposed to be true. According to French analysts during that time Internet users did not went for other search engines39. They probably have shut down their computers during that time. However the day after all search engines competitors rose drastically their market shares except for powered Google search engines40. It clearly shows that Internet is stopping during a rather short period of time before that Internet users goes for alternative solutions. They then go for the major alternative search engines. 39 http://www.pcworld.fr/actualite/google-panne-et-degats-collateraux/26021/ 40 http://fr.news.yahoo.com/16/20090203/ttc-le-bug-de-google-revele-la-forte-dep-c2f7783.html CHARDONNEAU Ronan - European Master in Business Studies 58/66
    • Conclusion The purpose of this intermediate report was for me to show my knowledge about the topic, to prove that the problematic is pertinent, interesting and that there is enough to write about it. I wanted to show as well that I have the ideas clear enough for the next coming months. What I keep in mind from this work is that I like the research I am doing, that I am very curious about the final result (how to make the difference when looking for information on the Internet and that the gap between rational search of information and mass search information is abyssal). I am quite conscious that a lot of work has to be done and that a lot of questions have still no answers. However I would like as well to emphasize what I personally did so far and what I learned: I know now how to structure a master thesis. I had until now no knowledge • about it and I learned from scratch how to set up an academical report; I studied since June 2008 the world of search engines and comes to know a • lot about the topic. Once graduated I would like to continue on the field of e- marketing either on a PhD level or in an international e-marketing company. This thesis is then making me more valuable to pretend to one of those two goals; Studying this field brought me additional ideas regarding the e-marketing • field. I wrote done an additional 20-pages report (not included in this intermediate report) about a hypothetical computer application which could give information about how to launch e-marketing campaigns on a global scale; February will be for sure the month which will decide how this master thesis will be achieved. I have concretely more than 3 weeks to dedicate days and nights to this project. The biggest issues I will have to think about are: Chapter 3.1.6: regarding semantic web I did not study yet in detail the topic • CHARDONNEAU Ronan - European Master in Business Studies 59/66
    • and I will have to come back on this point later; Chapter 4: how to measure the gap of information between mass behavior • when looking for information on the Internet and the most rational behavior?; Chapter 5: The Google example; • Re writing the thesis from an informal point of view this time (using the third • person of the singular instead of the first which does not sound very academic); CHARDONNEAU Ronan - European Master in Business Studies 60/66
    • Declaration I certify that this work has been done by myself and only myself. All the sources used for its realization have been well indicated. Ronan CHARDONNEAU European Master in Business Studies Institut de Management de l'Université de Savoie d'Annecy (FR) Università degli studi di Trento (IT) Universität Kassel (GER) Universidad de León (SP) 22th January, 2009 CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 61/66
    • List of literature Alacra. (2008?): What does quot;Powered By Googlequot; mean? URL: http://www.alacra.com/compliance- search/faq.asp Last access: 23/01/2009 Alexa the Web information company. (2008): Top sites by country URL: http://www.alexa.com/site/ds/top_500 Last access: 23/01/2009 Anonymous. (2007): Baidu au Japon? URL: http://asia.2803.com/chine/baidu-au-japon/ Last access: 23/01/2009 Arrington, M. (2008): Cuil On BusinessWeek's Most Successful of 2008 List. Huh? URL: http://www.washingtonpost.com/wp-dyn/content/article/2008/12/29/AR2008122901227.html Last access: 23/01/2009 Battelle, John. (2005): The birth of Google URL: http://www.wired.com/wired/archive/13.08/battelle.html?tw=wn_tophead_4 Last access: 23/01/2009 Boswell, W. (2004?): AOL Search: How to Search with AOL Search URL: http://websearch.about.com/od/enginesanddirectories/a/aol_search_2.htm Last access: 23/01/2009 Boucq, I. (2009): Yahoo et vos données persos... URL: http://www.erenumerique.fr/yahoo_et_vos_donnees_persos_-news-15162.html Last access: 23/01/2009 Chardonneau, R. (2008): International Marketing: Google URL: http://storage.canalblog.com/24/01/274197/31730758.pdf Last access: 23/01/2009 China Tech News. (2007): CCTV: Baidu Search Engine Fraud Exposed? URL: http://www.chinatechnews.com/2007/05/31/5459-cctv-baidu-search-engine-fraud-exposed/ Last access: 23/01/2009 China Daily. (2008): Baidu cuts revenue forecast on ad scandal URL: http://www.chinadaily.com.cn/bizchina/2008-12/14/content_7302788.htm Last access: 23/01/2009 Chitu, A. (2008): Google Starts to Index the Invisible Web URL: http://googlesystem.blogspot.com/2008/04/google-starts-to-index-invisible-web.html Last access: 23/01/2009 Crédoc, (2008): La diffusion des technologies de l'information et de la communication dans la société française URL: www.arcep.fr/uploads/tx_gspublication/etude-credoc-2008-101208.pdf Last acess 23/01/2009 Crepuq. (2003): Etude sur les connaissances en recherche documentaire des étudiants entrant au 1er cycle dans les universités québécoises URL: http://www.crepuq.qc.ca/documents/bibl/formation/etude.pdf Last access: 23/01/2009 EduDoc. (2008): Enquête sur les compétences documentaires et informationnelles des étudiants qui accèdent à l'enseignement supérieur en Communauté française de Belgique URL: http://www.edudoc.be/synthese.pdf Last access: 23/01/2009 CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 62/66
    • Einhorn, B. (2007): Baidu Thinks It Can Play in Japan URL: http://www.businessweek.com/globalbiz/content/feb2007/gb20070215_649662.htm? chan=globalbiz_asia_technology Last access: 23/01/2009 Enquiro. (2008): Eye Tracking Studies URL:http://www.enquiroresearch.com/eyetracking-report.aspx Last access: 23/01/2009 Enquiro. (2004): Search Engine Usage in North America URL: http://pages.enquiroresearch.com/search-engine-usage-in-na.html? source=Search_Engine_Usage_In_North_America_whitepaper Last access: 19/01/2009 Eternal Dreamer. (2008): Why Google is so Successful? URL: http://crumja.wordpress.com/2008/05/20/why-google-is-so-successful/ Last access: 23/01/2009 Eye tools. (2009): Eyetools Eyetracking Research URL: http://www.eyetools.com/inpage/research_google_eyetracking_heatmap.htm Last access: 23/01/2009 Grallet, G. (2009): Baidu, un autre Google s'éveille URL: http://www.lexpress.fr/actualite/high- tech/baidu-un-autre-google-s-eveille_734826.html Last access: 23/01/2009 Houste, F. (2009): Russie: Yandex sera le moteur de recherche par défaut de Firefox URL: http://www.search-engine-feng-shui.com/2009/01/russie-yandex-sera-le-moteur-de-recherche-par- defaut-de-firefox/ Last access: 23/01/2009 Internet World Stats (2008): Internet Usage and Population in North America URL: http://www.internetworldstats.com/stats14.htm Last access: 23/01/2009 Internet World Stats (2008): INTERNET USAGE STATISTICS The Internet Big Picture URL:http://www.internetworldstats.com/stats.htm Last access: 23/01/2009 Internet World Stats. (2008): Internet Usage in Asia URL:http://www.internetworldstats.com/stats3.htm Last access: 23/01/2009 Kahn, J. (2005): Yahoo helped Chinese to prosecute journalist URL: http://www.iht.com/articles/2005/09/07/business/yahoo.php Last access: 23/01/2009 Kaplan, I. (2008): Bad Data Can Cost You Big Time URL: http://www.federationofcredit.com/base/document/Newsletter/IKaplanSept08.html Last access:23/01/2009 Koch, P. / Koch, S. (2009): How big is the Internet? URL: http://www.pandia.com/sew/383- websize.html. Last access: 19/01/2009 Koch, P. / Koch, S. (1999-2006): A short and easy search engine tutorial URL: http://www.pandia.com/goalgetter/index.html Last access: 19/01/2009 Leung, Hon-wing 梁漢榮. (2004): A study of computer science students' conceptions of information literacy and their experiences in information search process and use URL: http://sunzi.lib.hku.hk/hkuto/record/B29597730 Last access: 19/01/2009 Lipsman, A. (2007): 61 Billion Searches Conducted Worldwide in August/Google Ranks as Top Global Search Property URL: http://www.comscore.com/press/release.asp?press=1802 Last CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 63/66
    • access: 23/01/2009 Mar Hauksson, K. (2007): Global search report 2007 URL: http://www.e3internet.com/downloads/global-search-report-2007.pdf Last access: 23/01/2009 Merriam Webster. (2001): Google - Definition from the Merriam-Webster Online Dictionary URL: http://www.merriam-webster.com/dictionary/google Last access: 23/01/2009 Millward Brown. (2008): Top 100 most powerful brands 08 URL: http://www.millwardbrown.com/Sites/optimor/Media/Pdfs/en/BrandZ/BrandZ-2008-Report.pdf Last access: 23/01/2009 Netcraft (2009): January 2009 Web Server Survey URL:http://news.netcraft.com/archives/2009/01/16/ january_2009_web_server_survey.html Last access 23/01/2009 Ohayon, O. (2008): Google, moteur de recherche ou moteur de navigation? URL: http://fr.techcrunch.com/2008/10/30/fr-google-moteur-de-recherche-ou-moteur-de-navigation/ Last access: 23/01/2009 Peterson, Robert A. / Merino, Maria C. (2003): Consumer Information Search Behavior and the Internet URL: http://www3.interscience.wiley.com/cgi-bin/fulltext/102525298/PDFSTART Last access: 23/01/2009 Plesu, A. (2005): How Big Is the Internet? URL: http://news.softpedia.com/news/How-Big-Is-the- Internet-10177.shtml Last access: 23/01/2009 Rafat, A. (2008): Czech Portal Seznam Could Fetch $900 Million; Google, Apax, Warburg and Others in Fray URL: http://www.washingtonpost.com/wp- dyn/content/article/2008/08/15/AR2008081502517.html Last access: 23/01/2009 Rosen, J. (2008): Google’s Gatekeepers URL: http://www.nytimes.com/2008/11/30/magazine/30google-t.html? _r=2&partner=rss&emc=rss&pagewanted=all Last access: 23/01/2009 Schwartz, B. (2009): Firefox Drops Google For Yandex In Russia, But Big Loser May Be Rambler URL:http://searchengineland.com/firefox-drops-google-for-yandex-in-russia-but-big-loser-may- be-rambler-16107 Last access: 23/01/2009 Tobin, R. / Hotchkiss, G. / Lee, P. (2008): Chinese Search Engine Engagement URL: http://app.marketo.com/lp/enquiro/chinese-search-engine-engagement.html? source=Chinese_Search_Engine_Engagement_whitepaper Last access: 23/01/2009 Université de Lyon. (2007): De la documentation au plagiat URL: http://www.compilatio.net/files/sixdegres-univ-lyon_enquete-plagiat_sept07.pdf Last access: 23/01/2009 URFIST de Rennes. (2008): ENQUÊTE SUR LES BESOINS DE FORMATION DES DOCTORANTS Á LA MAÎTRISE DE L’INFORMATION SCIENTIFIQUE dans les Ecoles doctorales de Bretagne URL: http://www.uhb.fr/urfist/files/Synthese_Enquete_SCD-URFIST.pdf Last access: 23/01/2009 Wilsdon, N. (2007): Global Search Report 2007 URL: http://www.e3internet.com/downloads/global- search-report-2007.pdf Last access: 23/01/2009 Xiti Monitor. (2008): Baromètre des moteurs - Novembre 2008 CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 64/66
    • Encore un mois porteur pour Google URL: http://www.xitimonitor.com/fr-fr/barometre-des- moteurs/barometre-des-moteurs-novembre-2008/index-1-1-6-152.html Last access: 23/01/2009 CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 65/66
    • Afterword The first idea of Larry Page (one of the creator of Google) when making his PhD thesis was to « explore the mathematical properties of the World Wide Web, understanding its link structure as a huge graph »41. Without knowing it I went a bit in the same direction evaluating how to get the best out of the web. I hope I will keep moving forward on this topic and maybe one day set up a correct theory on the subject. 41 Battelle, John. (2005): The birth of Google. CHARDONNEAU Ronan - European Master in Business Studies 2007/2009 66/66