Topics in Advanced Information Retrieval


Published on

In this essay I will discuss the current nature of search engines and potential future improvements to search with particular emphasis on searching in the public domain.

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Topics in Advanced Information Retrieval

  1. 1. Topics in Advanced Information Retrieval Carrie Hall, 7179881 University of Manchester Abstract. In this essay I will discuss the current nature of search engines and potential future improvements to search with particular emphasis on searching in the public domain.“I have a dream for the Web [in which computers] become capable of analyzing all the data on theWeb – the content, links, and transactions between people and computers. A ‘Semantic Web’, whichshould make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade,bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligentagents’ people have touted for ages will finally materialize“ - Tim Berners-Lee, 1999Tim Berners-Lee was not the first person to envisage a world in which information was easilyaccessible to both humans and machines, it was an idea that was introduced by Vannevar Bushin the 1940’s with his concept of a Memex machine[1]. Then, as is the case now, the challengelay with how to organise an increasing amount of varied data so that it is easily retrievable. Thisdata is growing at an enormous rate; every two minutes a new bio-medical article ispublished[2], 48 hours of video is uploaded to YouTube[3], and nearly 100,000 tweets aresent[4]. Combine this with the amount of web searches that are made for data (Google had 4million searches every two minutes in 2009[5]) and it becomes clear that search is an importantpart of our use of the web.The aim of all search engines is to return accurate and complete results to a user’s query[6]. Asimplified version of how a general-purpose search works is as follows. The search enginecrawls the web fetching pages and indexes them based on the content of the page. Thekeywords from a user query are then matched to these indexes and the user is shown arelevance ranked list of pages that should contain what they are looking for. Precision and recallcan be used to calculate the usefulness of results, where precision is the fraction of relevantdocuments from all that are retrieved and recall is the fraction of retrieved documents that arerelevant. There is a trade-off between precision and recall[7] which connotes that the aim ofbeing both accurate and complete is impractical.It can be contended that there can be problems with all aspects of the search, which will be splitinto the following broad sections in order to address them: 1. User Input – what did the user type in? 2. User Meaning – what did the user mean by what they typed? 3. Formulating results – what did the user want returned?User InputThere is a need to handle misspellings and incorrect grammar. Choosing how to effectivelyinterpret the words and phrases used within the search, which includes the need to handlemultiple languages, can be even more challenging. 1
  2. 2. User MeaningUnderstanding the intention of the user can be difficult, which may occur if the user does notprovide enough information in their query or if it was ambiguous, for example the word ‘apple’could refer to Apple Inc or apple the fruit depending on context.Formulating resultsArguably the biggest potential problem lies with the amount of pages that are retrieved. Even ifrelevant pages are found, it can easily be hidden amongst thousands of mildly relevant orirrelevant results[8].Indeed it is true that search engines are getting better every day and some work has beenundertaken to make search engines more efficient, reliable and useful.User input can be improved by making it easier for people to find what they are looking forregardless of what they type in. Two examples are the Suggest’ and ‘Google Instant’ featuresintroduced towards the end of 2009. Google Instant in particular was interesting because it isnot search-as-you-type as it might appear; rather it is search-before-you-type. This means thatthe results for the most likely search, given what the user has typed, are shown[9]. Google claimthis feature is useful because people type slowly but read quickly[10] but it has receivednegative responses from more advanced users who claim that it is slow and causes morefrustration than a regular search[11]. These features do improve the user experience of searchbut it is not changing the fundamental nature of search insofar as the results are still a set ofdocuments which the user needs to manually navigate for the information they want.To aid in user meaning, Google has introduced advanced features to its search, from automaticcalculations, unit conversions and dictionary definitions to more complex searches usingwildcards1[13]. This will mean that simple queries can be resolved without needing to presentthe user with a series of documents.Search engines can decrease the number of irrelevant documents retrieved for a query bymaking it more difficult for web sites to improve their ranking by ‘faking’ the content of theirsite by using keyword-spamming and hidden text[14, 15]. Another potential way to improveresults is by searching for synonyms to the word the user originally typed, for example“computer table” will also bring back results for “computer desk”. WordNet is an English lexicaldatabase which is designed for finding lexical concepts from keywords programmatically, socould help achieve this purpose[16]. Several years ago Google acquired Oingo, a meaning-basedsearch engine which implements synonym matching which suggests they wanted to integrateinto their search engine[17]. A forum post from 2009 claimed that results for ‘vegetarianrecipes’ were being shown when ‘vegan recipes’ is being searched for[18] but upon recentexploration this does not seem to occur, which could show that Google’s search algorithm‘learns’ from the user query data.1 Wildcard: a character that can be used as a substitute for any of a class of characters in asearch, thereby greatly increasing the flexibility and efficiency of searches12. LinuxInformation Project. How to Use Wildcards. 2006; Available from:, 12. Ibid. In the context of a Google search this meansthat a search for ‘Isaac Newton discovered *’ would return constructive results. 2
  3. 3. In addition to using synonyms, another way that search engines can expand the user queryprogrammatically is by using stemming. The variation found in words is too great for a simpleterm matching algorithm to find[19]. Consider irregular verbs or plurals – if a user typed‘daffodils’ they would likely want to find pages containing the term ‘daffodil’. Stemming meansthat words are broken down into their base form which is then searched and usually retrievesmore results[19]. There are many algorithms which use this approach, Microsoft’s new Bingsearch uses a form of n-gram algorithm which breaks down words into a subsequence of itemssuch as syllables or phonemes[20], and matches documents on that basis.In other areas simple keyword matching searches are not sufficient, such as the e-sciencedomain. The e-science community has driven a lot of advances in the Semantic Web as theycollect vast amounts of diverse data and often need quick and correct results when finding theseresults. It is in this type of environment that failure is not tolerated as much.In these complex domains the use of ontologies is prevalent, which has been argued proves thatthe Semantic Web is achievable[21]. Ontologies can be defined as “explicit formal specificationsof the terms in the domain and relations among them”[22]. In essence they break down acomplex domain into a hierarchical structure with rich information about the interactionsbetween objects/resources. They are used in the context of the Semantic Web as they aremachine-readable and are used with an inference engine to obtain information about resourcesin the domain[6].An example of a large ontology in use today is the National Cancer Insitute’s Thesaurus andOntology which contains over 90,000 concepts and is still growing[23]. It contains descriptionsof genes, diseases, drugs and chemicals, anatomy, organisms, and proteins as well as thesemantic relationships between them. It uses the Web Ontology Language (OWL) to describe itsresources. OWL is a W3C recommendation which builds on RDF using XML and is used toexpress hierarchies and relationships between resources (ontologies)[24].Noesis is a semantic search engine that uses domain ontologies specifically designed forenvironmental scientists. Noesis allows the expert user to filter their search based on attributesof the object they are searching for. An example of this would be searching for sea grass andbeing able to filter by taxonomy, location or water type[25]. Results outside the ontology (e.g.moisturisers containing sea grass) are removed from the results. Filtering would be animportant feature for users to have as it would mean being able to use the search engine to findthings such as flights, car hire and insurance, rather than needing to go to several sites.Ontologies can be used in the context of information retrieval in three ways[26]. Firstly they canbe used by domain experts to represent their domain knowledge, secondly by web users toannotate web resources more efficiently and thirdly by end-users searching the web withqueries based on the ontology. Web resources that are semantically related in the ontology willbe retrieved but can mean that the end-users need to have a basic understanding of theontology used.Another criticism of using ontologies lies with the fact that relying on users to create metadatais difficult, and there aren’t experts in every subject that are willing to create an Ontology ofEverything, and this would take a significant amount of time[27]. This could be made more 3
  4. 4. possible with the use of automatic annotation which is problematic in itself as it is difficult toverify.The Resource Description Framework (RDF) is a language used with OWL and is specificallydesigned for describing resources on the web which is why it is often quoted as being one of thekey languages in the future of the Semantic Web[24]. It standardises the way things aredescribed (such as price) which could lead to search engines being able to more effectively filterresults. It does this by storing data in three parts called triples. A triple contains a subject,predicate and object which use URI’s2 to identify resources. An example of an RDF triple is [resource] [property] [value] The secret agent is Niki Devgood [subject] [predicate] [object]An interesting argument to the idea that RDF will be a big player in the semantic web involveslooking at statistics from Google Trends. Evidence from a report in 2006 suggests that that RDFis not a popular search, in fact more people search Google for ‘Prolog’ and ‘Fortran’[27]. Anupdated report confirms the reduction of interest in this technology (see Figure 1). This reportalso indicates there is more interest in other technologies, such as AJAX3, by looking at theamount of books and blog posts on the subject,There has been recent work into a search engine that queries the web like a database[28]. Theidea is that the end queries are simple to use but the results are complex. It is still in the earlystages and only has 10 entities (namely Person, Company, University, City, Novel, Player, Club,Song, Film and Award) but it works by telling the system which entity to look for and what therelations are between them[29]. This example query is taken from the project website anddemonstrates the complex questions that can be answered using this approach; the actual queryis shown in Figure 2 in the appendix:"Find an Australian actress, an Academy Award winning film, and a Grammy Award winningsong, where the actress stars the film and the song is the theme of that film”Results returned:Nicole Kidman, Batman Forever, Kiss from a RoseMelanie Griffith, Titanic (1997 film), My Heart Will Go OnMia Farrow, Midnight Cowboy, Without YouThis kind of complex result would have taken several queries and a lot of time for a human to dousing a current search engine, but due to the complex way in which the query needs to beformulated it may never reach mainstream use.It is important to take a moment to explore the human motivations for using search, and howpeople go about doing it. The ‘Principle of least effort’[30] implies that information-seekingusers will use the most convenient search method possible to them, thus it can be argued thatthe this search engine would deter a user due to its complex and multipart interface. TheWolfram Alpha search engine uses a ‘Google-style’ query box which may encourage more usersto use it. Advanced users (such as in the e-science domain) have a lot of experience with2 URI: Uniform Resource Identifier, used to uniquely identify resources3 AJAX: Asynchronous Javascript and XML 4
  5. 5. complex query structuring so they might be adverse to using this sort of open ended queryinput as it does not map with their mental model of a complex search engine.It has been suggested that there is an overlap between knowledge representation and model-driven architecture[31] and that this can be used to as a backbone to power a new semanticweb. The goal of model-driven architecture (MDA) is to separate business applications fromtechnologies used for implementation. It relies heavily on metadata created by the Meta-ObjectFacility (MOF). MDA technologies could be used as a foundation for ontology modelling in thesemantic web as both MDA and Semantic Web languages contain a similar specification (such assubsumption relations and relationships between classes) [31]. Furthermore UML4 and MOF aregraphical and it may be more straight-forward for experts to create ontologies using these toolsthan it would be with a knowledge representation language. The role of the Semantic Webwould then be to reason about these resources and would not be concerned with the complextask of managing the models. The Object Manangement Group (OMG) have created an OntologyDefinition Metamodel specification which enables modelling of ontologies by using UMLtools[32].As previously mentioned, simple queries results (like mathematical calculations and unitconversions) can be pulled from sources and displayed to the user. A more semantic searchengine could try and do this for more complex results, such as “winner of 2011 australian tennisopen”[33], rather than simply identifying potentially useful pages by matching the queryentered. Two examples of search engines attempting this are Sensebot5 and Wolfram Alpha6,although both of these search engines only cover a relatively small, selected domain. They dohowever show that it is possible to have search results without a list of documents.So far a lot of emphasis has been placed on finding textual answers to user queries, but whatabout semantic searching over images or video? In fact this is already appearing – nachofotohave created a semantic, time-based vertical image search engine[34] although it is only in betaversion and only contains information that is trending on the web.Another way in which search engines could be more intelligent is in the way they decide whichresults to filter when ambiguous searches are made by asking the user a question i.e. ‘Are yousearching for the company or the fruit?’ when it is presented with the term ‘apple’[35]. Theycould learn from collecting large numbers of user responses in order to guess at which resultthey were looking for. The previously mentioned Wolfram Alpha interprets this search as thecompany Apple but provides a very user friendly way to change the interpretation to severalothers that are available.One of the biggest areas of growth in recent years is that of the social web. The social web refersto web-sites that are driven by user participation such as Wikipedia, YouTube and Flickr[36].This is a good form of knowledge sharing, especially in newer areas of interest that do not havea defined structure that could be mapped into an ontology. User participation can be used tocreate and update metadata which will aid in the retrieval of items of interest. Therefore asmore people contribute the system becomes more useful, and this added information may bepreviously unknown[36].4 UML: Unified Modelling Language, see 5
  6. 6. TipTop7 is a social search engine that provides sentiment analysis of a subject and classifiesthem into positive and negative opinions[33, 37]. Twitter in particular is heavily used by thissearch engine. Twitter has been shown to influence public collective decision making and evenpredict economic changes. In particular a recent study showed that by integrating Twittersentiment with stock market prediction the accuracy increases from 73.3% to 86.7%[38]. Usingsentiment analysis could assist search engines when deciding how to rank documents or whichinformation to present to a user first.ConclusionThe amount of content on the web is immense and it is growing rapidly. It can be argued thatusing ontologies are very effective for domain-specific tasks where the users are expert usersbut an ontology that covers the whole world wide web seems unlikely. Ontologies need to bedecided upon, implemented and maintained which are all very large tasks in themselves. It hasbeen suggested that the only way the Semantic Web can succeed is by using several ontologiesfrom different communities[21] which, although feasible, will be demanding and time-consuming.Motivation for searching is another factor in deciding how a search engine should function –more advanced users who rely on the search engine to perform their job are likely to influencehow the search engine evolves. It is these users who have pushed changes in areas such as e-science which could suggest that these areas will continue to evolve and adapt to the amount ofgrowing data. When developing any sort of software system, often the most difficult thing isgetting the users to use it. In this respect, it is unlikely that any new public search engine willbecome widespread enough to overtake Google, Microsoft and Yahoo.Furthermore, can a computer really mimic human behaviour? Just as the content on the web ischanging rapidly, language and human behaviour also evolves and changes. Perhaps by the timean ontology or a natural language processor is developed that is complex enough to be used onthe general web, it will be out of date with both the users and the data. Furthermore, if all userssent well-formed information queries to a search engine then many problems would be solved.However users tend to type only a few words and expect a result[39]. After all, two searcheswith few keywords is likely to take the same amount of time as one long query but without theadvantage of being able to see immediate results.It is these subtleties in human behaviour that may be near to impossible for a computer systemto interpret, at least in the near future, but enough advances have been made that suggest thatsomething resembling a Semantic Web is certainly achievable in the long term. Word count: 29517 6
  7. 7. AppendixFigure 1Figure 2 7
  8. 8. References1. Bush, V., As We May Think. Atlantic Magazine, 1945.2. Nenadic, G., Data Integration and Analysis. 2010.3. SiteImpulse. YouTube Facts & Figures. 2010; Available from: Hachman, M., Twitter Tops 2 Billion Tweets Per Month, in PCMag. 2010.5. comScore. comScore Reports Global Search Market Growth of 46 Percent in 2009. 2009; Available from: rows_46_Percent_in_2009.6. Movv, S., Noesis: A Semantic Search Engine and Resource Aggregator for Atmospheric Science. American Geophysical Union, 2006.7. Buckland, M., The relationship between Recall and Precision. Journal of the American Society for Information Science, 1994. Volume 45, Issue 1: p. 12-19.8. Antoniou, G., A Semantic Web Primer. 2004, Massachusetts: Massachusetts Insitute of Technology.9. Google, Google Instant, behind the scenes, in Google Blog. 2009.10. Google. About Google Instant. 2009; Available from: Mello, J., Google Instant: Pros and Cons, in PCWorld. 2009.12. Linux Information Project. How to Use Wildcards. 2006; Available from: Google. Search Features. 2011; Available from: Google, Google does not use the keywords meta tag in web ranking, in Google Webmaster Blog. 2009.15. Google. Webmaster Guidelines. 2011; Available from: Miller, G., WordNet: A Lexical Database for English. Communications of the ACM, 1995. Volume 38, Issue 11.17. Radhakrishnan, A., Oingo Meaning Engine, Semantic Search & Google, in Search Engine Journal. 2007.18. Google. Vegan vs Vegetarian. 2009; Available from: =en.19. Hull, D., Stemming Algorithms. Journal of the American Society for Information Science, 1995. 8
  9. 9. 20. Gao, J., A Comparative Study of Bing Web N-gram Language Models for Web Search and Natural Language Processing. Proceeding of the 33rd Annual ACM SIGIR Conference, 2010.21. Staab, S., The Semantic Web Revisted. IEEE Intelligent Systems, 2006Issue May/June.22. Gruber, T., A translation approach to portable ontology specifications. Knowledge Acquisition, 1993.23. Jennifer Golbeck, G.F., Frank Hartel, Jim Hendler, Jim Oberthaler, Bijan Parsia, The National Cancer Institutes Thésaurus and Ontology. 2003.24. Altova. What is the Semantic Web? 2005.25. Physorg, Semantic science search engine knows that there is a difference. 2009.26. Corby, O., Querying the Semantic Web with Corese Search Engine. 2004.27. Zambonini, D., The 7 (f)laws of the Semantic Web, in 2006.28. Physorg, Grant to help researchers build better search engines. 2010.29. Li, C. Entity-Relationship Queries. 2011; Available from: Zipf, G., Human Behaviour and the Principle of Least Effort. 1949.31. Frankel, D., The Model Driven Semantic Web. Proceedings of First International Workshop on the Model-Driven Semantic Web, 2004.32. Group, O.M., Ontology Definition Metamodel. 2009.33. Hendler, J., Web 3.0: The Dawn of Semantic Search. IEEE Computer Society, 2010. Volume 43, Issue 1: p. 77-80.34. Zaino, J., Semantic Image Search: Next Up For a Major Search Engine? 2011.35. Physorg, Smarter than Google?, M. Breedveld, Editor. 2010.36. Gruber, T., Collective knowledge systems: Where the Social Web meets the Semantic Web. Journal of Web Semantics, 2007. Volume 6.37. TipTop. TipTop Search FAQs. 2009; Available from: Bollen, J., Twitter mood predicts the stock market. Journal of Computational Science, 2011. Volume 1.39. Zaino, J., Exploring Search, in 2010. 9