Information retrieval basics_v1.0


Published on

Published in: Education, Technology, Design
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Here is what a search environment for a company employee looks like
  • Need this slide in case some people are not familiar with how a IR system works. This is a very simplified standard architecture. In different scenarios some of these components may be absent.Depending on the level of the participants may spend some time explaining how each component works.
  • Currency: the timeliness of the informationWhen was the information published or posted?Has the information been revised or updated?Is the information current or out-of date for your topic?Are the links functional?Relevance: the importance of the information for your needsDoes the information relate to your topic or answer your question?Who is the intended audience?Is the information at an appropriate level (i.e. not too elementary or advanced for your needs)?Have you looked at a variety of sources before determining this is one you will use?Would you be comfortable using this source for a research paper?Authority: the source of the informationWho is the author/publisher/source/sponsor?Are the author's credentials or organizational affiliations given?What are the author's credentials or organizational affiliations given?What are the author's qualifications to write on the topic?Is there contact information, such as a publisher or e-mail address?Does the URL reveal anything about the author or source?     examples: .com (commercial), .edu (educational), .gov (U.S. government),                .org (nonprofit organization), or .net (network) Accuracy: the reliability, truthfulness, and correctness of the content, and Where does the information come from?Is the information supported by evidence?Has the information been reviewed or refereed?Can you verify any of the information in another source or from personal knowledge?Does the language or tone seem biased and free of emotion?Are there spelling, grammar, or other typographical errors?Purpose: the reason the information existsWhat is the purpose of the information? to inform? teach? sell? entertain? persuade?Do the authors/sponsors make their intentions or purpose clear?Is the information fact? opinion? propaganda?Does the point of view appear objective and impartial?Are there political, ideological, cultural, religious, institutional, or personal biases?By scoring each category on a scale from 1 to 10 (1 = worst, 10=best possible) you can give each site a grade on a 50 point scale for how high-quality it is!45 - 50 Excellent | 40 - 44 Good | 35 - 39 Average | 30 - 34 Borderline Acceptable | Below 30 - Unacceptable
  • Subject Directories can help one find more in-depth information on a certain subject, then just a plain search engine.Whether one is looking for articles for medical, academic or just plain curious, one way to find information is by using a basic search engine; however, if one is searching for information on a specific topic and wants to get direct to the point information, one needs to use a subject directory. However, which ones to choose and why can be difficult, so I compiled a list of the most commonly used ones and few hidden gems I found on the internet. Librarians’ Internet Index (LII) – Over 20,000 articles compiled by public librarians with completely reliable sourcesINFOMINE (Infomine.) – over 250,000 articles compiled by academic librarians, all reliable sources. We are talking college level information here. Want an A or a raise, this is a great sight for well researched information and all was written by (About.) – With nearly 2 million articles, is one of the leading subject directories. These articles are written by people with experience in the area in which they writeGoogle Directory (Google Directory) – With well over 5 million articles, this is by far the leader in subject directories. This is of course enhanced by the Google search engine, which means more results on the chosen topic of researchYahoo Directory (Yahoo Directory.) – With just over 4 million articles, Yahoo offers up lots of useful information. The only draw back is that this subject directory really works best with popular topics, not vague onesRead more:
  • The Million Book Project (or the Universal Library), was a book digitization project, led by Carnegie Mellon University School of Computer Science and University Libraries.[1] Working with government and research partners in India (Digital Library of India) and China, the project scanned books in many languages, using OCR to enable full text searching, and providing free-to-read access to the books on the web. As of 2007, they have completed the scanning of 1 million books and have made accessible the entire database from Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge."[2][3] It offers permanent storage of and free public access to collections of digitized materials, including websites, music, moving images, and nearly three million public-domain books; as of October 2012 it held over 10 petabytes in cultural material.[4]CiteSeer was a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. It became public in 1998 and had many new features unavailable in academic search engines at that time. The arXiv (pronounced "archive", as if the "X" were the Greek letterChi, χ) is an archive for electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance which can be accessed online. In many fields of mathematics and physics, almost all scientific papers are self-archived on the arXiv. On October 3, 2008, passed the half-million article milestone.[2] The preprint archive turned 20 years old on August 14, 2011.[3] By 2012 the submission rate has grown to more than 7000 per month.[4]
  • Web 2.0
  • The Deep Web (also called the Deepnet, the Invisible Web, the Undernet or the hidden Web) is World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines. It should not be confused with the dark Internet, the computers that can no longer be reached via Internet, or with the distributed filesharing network Darknet, which could be classified as a smaller part of the Deep Web.Mike Bergman, founder of BrightPlanet and credited with coining the phrase,[1] said that searching on the Internet today can be compared to dragging a net across the surface of the ocean: a great deal may be caught in the net, but there is a wealth of information that is deep and therefore missed.[2] Most of the Web's information is buried far down on dynamically generated sites, and standard search engines do not find it. Traditional search engines cannot "see" or retrieve content in the deep Web—those pages do not exist until they are created dynamically as the result of a specific search. The deep Web is several orders of magnitude larger than the surface Web.[3]Dynamic content: dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge. Unlinked content: pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks). Private Web: sites that require registration and login (password-protected resources). Contextual Web: pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence). Limited access content: sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs, or no-cache PragmaHTTP headers which prohibit search engines from browsing them and creating cached copies.[8]) Scripted content: pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or Ajax solutions. Non-HTML/text content: textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.
  • Business analytics (BA) refers to the skills, technologies, applications and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning.[1] Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods.Business analytics makes extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling,[2] and fact-based management to drive decision making. Analytics may be used as input for human decisions or may drive fully automated decisions. Business intelligence is querying, reporting, OLAP, and "alerts.
  • The Semantic Web is a collaborative movement led by the international standards body, the World Wide Web Consortium (W3C).[1] The standard promotes common data formats on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web dominated by unstructured and semi-structured documents into a "web of data". The Semantic Web stack builds on the W3C's Resource Description Framework (RDF).[2]According to the W3C, "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries."[2]YAGO2s is a huge semantic knowledge base, derived from WikipediaWordNet and GeoNames. Currently, YAGO2s has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.
  • In computing, linked data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.[1]Tim Berners-Lee, director of the World Wide Web Consortium, coined the term in a design note discussing issues around the Semantic Web project.[2] However, the idea is very old and is closely related to concepts including database network models, citations between scholarly articles, and controlled headings in library catalogs.[citation needed]Tim Berners-Lee gave a presentation on linked data at the TED 2009 conference.[4] In it, he restated the linked data principles as three "extremely simple" rules:All kinds of conceptual things, they have names now that start with HTTP.I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.FOAF (an acronym of Friend of a friend) is a machine-readableontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe him or herself. FOAF allows groups of people to describe social networks without the need for a centralised database.FOAF is a descriptive vocabulary expressed using the Resource Description Framework (RDF) and the Web Ontology Language (OWL). Computers may use these FOAF profiles to find, for example, all people living in Europe, or to list all people both you and a friend of yours know.[1][2] This is accomplished by defining relationships between people. Each profile has a unique identifier (such as the person's e-mail addresses, a Jabber ID, or a URI of the homepage or weblog of the person), which is used when defining these relationships.The GeoNames geographical database is available for download free of charge under a creative commons attribution license. It contains over 10 million geographical names and consists of over 8 million unique features whereof 2.8 million populated places and 5.5 million alternate names. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes. (more statistics ...). The data is accessible free of charge through a number of webservices and a daily database export. GeoNames is already serving up to over 30 million web service requests per day.
  • Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open hardware, open content, and open access. The philosophy behind open data has been long established (for example in the Mertonian tradition of science), but the term "open data" itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as data is often focused on non-textual material such as maps, genomes, connectomes, chemical compounds, mathematical and scientific formulae, medical data and practice, bioscience and biodiversity. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data is controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Advocates of open data argue that these restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by is a U.S. government website launched in late May 2009 by the then Federal Chief Information Officer (CIO) of the United States, VivekKundra.According to its website, "The purpose of is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government."[1]Open Data Commons is the home of a set of legal tools to help you provide and use Open DataD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
  • Recommender systems or recommendation systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user's social environment (collaborative filtering approaches).[1][2]
  • Information retrieval basics_v1.0

    1. 1. Center for the Study of New Media and Societywww.newmediacenter.ruInformation Retrieval BasicsSergey Chernov
    2. 2. Information search in action…5/24/2013 Sergey Chernov, Information Retrieval Basics Vladimir Pekhtin Alexey Navalny Doct_z
    3. 3. Public data5/24/2013 Sergey Chernov, Information Retrieval Basics
    4. 4. Resources and achievements Search engines Databases for property owners in Europe & USA List of Deputies of State Duma Man-hours invested in manual search and explorationResults: 500+ news, 150 articles, 20interviews and videos, Pekhtinresigned from Committee of Ethics5/24/2013 Sergey Chernov, Information Retrieval Basics
    5. 5. Outline for today Sources of Information Search strategies and tools Search Cases Assignments and Q&A Session5/24/2013 Sergey Chernov, Information Retrieval Basics
    6. 6. Outline for today Sources of Information Search strategies and tools Search Cases Assignments and Q&A Session5/24/2013 Sergey Chernov, Information Retrieval Basics
    7. 7. Information in numbers Facebook – 900 mln users Twitter – 500 mln Flickr – 50 mln Delicious – 5 mln Web – 1 trln5/24/2013 Sergey Chernov, Information Retrieval Basics
    8. 8. Information Retrieval Information Retrieval (IR) isfinding material (usuallydocuments) of an unstructurednature (usually text) thatsatisfies an information needfrom within large collections(usually stored on computers).8
    9. 9. Information DomainsDesktopEnterprise Web (Intranet)Public Web (Internet)DVDDiskFShareDBWebCMSE-mailPeopleWeb SitesOnlineLibrariesOnlineShopsSocialNetworks
    10. 10. Information Retrieval SystemDownloads/collects the dataProcesses the data and builds InvertedIndexEvaluates user queries against the index andcomputes a list of (ranked) resultsOrganizes and displays the results to theuser, facilitates navigation through theresult setCrawlerIndexerRankerDisplay
    11. 11. User Needs Need [Broder 2002, Rose and Levinson 2004] Informational – want to learn about something Navigational – want to go to that page Transactional – want to do something (web-mediated) Access a service Downloads Shop Gray areas Find a good hub Exploratory search “see what’s there”Low hemoglobinUnited AirlinesSeattle weatherMars surface imagesCanon S410Car rental BrasilSec. 19.4.111
    12. 12. How far do people look for results?(Source: WhitePaper_2006_SearchEngineUserBehavior.pdf)12
    13. 13. How to evaluate results? CRAAP Currency Relevance Authority Accuracy Purpose5/24/2013 Sergey Chernov, Information Retrieval Basics How old is the material? Does the age matter?History – better old info, medicine –fresh stuff. How well does it fit? Does it answer my question?Detailed enough? Who wrote it? Is the author is qualified to write?What about contact information? Is it supported by evidence? Refereed? Verifiable?Unbiased? Clearly written? What can you infer about authors‘ message? Is itfact, opinion or propaganda?California State University, Chico
    14. 14. Where to search? Web Subject directories Intranet and Desktop Digital libraries Social platforms Databases and Hidden Web Business analytics Wikipedia Photo stocks Open datasets and Linked Data Open Gov Data5/24/2013 Sergey Chernov, Information Retrieval Basics
    15. 15. Web5/24/2013 Sergey Chernov, Information Retrieval Basics
    16. 16. Subject directories5/24/2013 Sergey Chernov, Information Retrieval Basics
    17. 17. Intranet5/24/2013 Sergey Chernov, Information Retrieval Basics
    18. 18. Desktop5/24/2013 Sergey Chernov, Information Retrieval Basics
    19. 19. Digital libraries5/24/2013 Sergey Chernov, Information Retrieval Basics
    20. 20. Social platforms5/24/2013 Sergey Chernov, Information Retrieval Basics
    21. 21. Databases and Hidden Web5/24/2013 Sergey Chernov, Information Retrieval Basics
    22. 22. Business Analytics5/24/2013 Sergey Chernov, Information Retrieval Basics
    23. 23. Wikipedia5/24/2013 Sergey Chernov, Information Retrieval Basics
    24. 24. Photo stocks5/24/2013 Sergey Chernov, Information Retrieval Basics
    25. 25. Linked Data5/24/2013 Sergey Chernov, Information Retrieval Basics
    26. 26. Open Data5/24/2013 Sergey Chernov, Information Retrieval Basics
    27. 27. Outline for today Sources of Information Search strategies and tools Search Cases Assignments and Q&A Session5/24/2013 Sergey Chernov, Information Retrieval Basics
    28. 28. Search is a journeyIs that all?
    29. 29. Search is a journey
    30. 30. Search is a journey
    31. 31. Search is a journey
    32. 32. Search is a journey
    33. 33. Exploratory searchLookupQuestion answeringFact retrievalKnown-item searchNavigational searchLasts for secondsExploratory searchInvestigateLearnKnowledge acquisitionComprehensionComparisonDiscoverySerendipityIncremental searchDriven by uncertaintyNon-linear behaviorResult analysisLasts for hours
    34. 34. Exploratory behavior Learn About the search topic About the collection Reformulate query Broadening Narrowing Changing the focus Socialize Looking for experts Collaborative search
    35. 35. Search tools Web search engines Personalized search Faceted search Review services Geo-services Question answering Scientific search Domain-specific search Recommender systems5/24/2013 Sergey Chernov, Information Retrieval Basics
    36. 36. Web search engine5/24/2013 Sergey Chernov, Information Retrieval BasicsQuery suggestionsSnippets
    37. 37. Web search engine (2)5/24/2013 Sergey Chernov, Information Retrieval Basics
    38. 38. Web search engine (3) Search for pages that link to a URL – “link:” operatorlink: Search for pages that similar to a URL – “related:”related: Search for results from specific sites – “site:”site: strelkainstitute.com5/24/2013 Sergey Chernov, Information Retrieval Basics
    39. 39. Personalized search5/24/2013 Sergey Chernov, Information Retrieval Basics Personalization is a modeling of user’spreferences from previous interactions Queries, click-through analysis, eye tracking … Personalized Search usually implemented as: Re-ranking and filtering of the search results Personalized query expansion
    40. 40. 5/24/2013 Sergey Chernov, Information Retrieval Basics
    41. 41. Faceted searchIt’s about Result Analysis!facetfacet values
    42. 42. Faceted search (2)It’s about Query Reformulation!
    43. 43. Review services5/24/2013 Sergey Chernov, Information Retrieval Basics
    44. 44. Geo-services5/24/2013 Sergey Chernov, Information Retrieval Basics
    45. 45. Question answering5/24/2013 Sergey Chernov, Information Retrieval Basics
    46. 46. Scientific search5/24/2013 Sergey Chernov, Information Retrieval Basics
    47. 47. Scientific Search (2)5/24/2013 Sergey Chernov, Information Retrieval Basics
    48. 48. Domain-specific search5/24/2013 Sergey Chernov, Information Retrieval Basics
    49. 49. Recommender systems5/24/2013 Sergey Chernov, Information Retrieval Basics
    50. 50. Outline for today Sources of Information Search strategies and tools Search Cases Assignments and Q&A Session5/24/2013 Sergey Chernov, Information Retrieval Basics
    51. 51. Case 1: finding a research paper5/24/2013 Sergey Chernov, Information Retrieval Basics
    52. 52. Case 2: planning a trip5/24/2013 Sergey Chernov, Information Retrieval Basics
    53. 53. Case 3: looking for an expert5/24/2013 Sergey Chernov, Information Retrieval Basics
    54. 54. Case 4: market analysis5/24/2013 Sergey Chernov, Information Retrieval Basics
    55. 55. Outline for today Sources of Information Search strategies and tools Search Cases Assignments and Q&A Session5/24/2013 Sergey Chernov, Information Retrieval Basics
    56. 56. Practical assignment Construct 3 information needs, relevant to youreveryday experience (preparing for an interview,choosing a learning course, doing a homework, etc.) Search for the information, using maximum numberof sources and tools Share your experience5/24/2013 Sergey Chernov, Information Retrieval Basics