INFORMATION RETRIEVALA Look into the Science of Web Search Engines1Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pkMuhammad AtifQureshi
ContentsStory Mode LearningLearning by ImaginationAppendix2Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Story Mode Learning(Borrowed from Prof. Jimmy Lin,University of Maryland, Scientist in Twitter)3Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Information Retrieval SystemsInformationWhat is “information”?RetrievalWhat do we mean by “retrieval”?What are different types information needs?SystemsHow do computer systems fit into the human information seeking process?4Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Information?What do you think?There is no “correct” definitionCookie Monster’s definition: “news or facts about something”Different approaches:PhilosophyPsychologyLinguisticsElectrical engineeringPhysicsComputer scienceInformation science5Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Dictionary says…Oxford English Dictionaryinformation: informing, telling; thing told, knowledge, items of knowledge, newsknowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is knownRandom House Dictionaryinformation: knowledge communicated or received concerning a particular fact or circumstance; news6Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Intuitive NotionsInformation mustBe something, although the exact nature (substance, energy, or abstract concept) is not clear;Be “new”: repetition of previously received messages is not informativeBe “true”: false or counterfactual information is “mis-information”Be “about” somethingRobert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269.7Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Three Views of InformationInformation as processInformation as communicationInformation as message transmission and reception8Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
One ViewInformation = characteristics of the output of a processTells us something about the process and the inputInformation-generating process do not occur in isolationInputOutputProcessInputOutputInputOutputProcess1Process2InputOutput…9Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where’s the human?If a tree falls in the forest, and no one is around to hear it, is information transmitted?In the “information as process”: Yes, but that’s not very interesting to usWe’re concerned about information for human consumptionTransmission of information from one person to anotherRecording of informationReconstruction of stored information10Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Another ViewInformation science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient”This implies that the sender has knowledge of the recipient's structureText = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient”Information = “the structure of any text which is capable of changing the image-structure of a recipient”11Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Transfer of InformationCommunication = transmission of informationThoughtsThoughtsTelepathy?WordsWordsWritingSoundsSoundsSpeechEncodingDecoding12Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Information TheoryBetter called “communication theory”Developed by Claude Shannon in 1940’sConcerned with the transmission of electrical signals over wiresHow do we send information quickly and reliably?Underlies modern electronic communication:Voice and data traffic…Over copper, fiber optic, wireless, etc.Famous result: Channel Capacity TheoremFormal measure of information in terms of entropyInformation = “reduction in surprise”13Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Noisy Channel ModelCommunication = producing the same message at the destination that was sent at the sourceThe message must be encoded for transmission across a medium (called channel)But the channel is noisy and can distort the messageSemantics (meaning) is irrelevantchannelReceivermessageTransmitternoiseSourceDestinationmessage14Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
A SynthesisInformation retrieval as communication over time and space, across a noisy channelSenderRecipientEncodingDecodingTransmitterReceiverchannelstoragemessagemessageindexing/writingretrieval/readingnoiseSourceDestinationmessagemessagenoise15Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
“Retrieval?”“Fetch something” that’s been storedRecover a stored state of knowledgeSearch through stored messages to find some messages relevant to the task at handEncodingDecodingstorageSenderRecipientmessagemessageindexing/writingRetrieval/readingnoise16Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is IR?Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human userAnomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143.17Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Types of Information NeedsRetrospective“Searching the past”Different queries posed against a static collectionTime invariantProspective“Searching the future”Static query posed against a dynamic collectionTime dependent18Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Retrospective Searches (I)Ad hoc retrieval: find documents “about this”Known item searchDirected explorationIdentify positive accomplishments of the Hubble telescope since it was launched in 1991.Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.Find Jimmy Lin’s homepage.What’s the ISBN number of “Modern Information Retrieval”?Who makes the best chocolates?What video conferencing systems exist for digital reference desk services?19Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Retrospective Searches (II)Question answeringWho discovered Oxygen?When did Hawaii become a state?Where is Ayer’s Rock located?What team won the World Series in 1992?“Factoid”What countries export oil?Name U.S. cities that have a “Shubert” theater.“List”Who is Aaron Copland?What is a quasar?“Definition”20Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Prospective “Searches”FilteringMake a binary decision about each incoming documentRoutingSort incoming documents into different bins?Spam or not spam?Categorize news headlines: World? Nation? Metro? Sports?21Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What types of information?Text (Documents and portions thereof)XML and structured documentsImagesAudio (sound effects, songs, etc.) VideoSource codeApplications/Web services22Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Content-Based SearchThis is a relative new concept!What else would you search on?What’s more effective?Why is this hard in many applications?23Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Interesting ExamplesGoogle image searchGoogle video searchQuery by humminghttp://images.google.com/http://video.google.com/http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html24Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What about databases?What are examples of databases?Banks storing account informationRetailers storing inventoriesUniversities storing student gradesWhat exactly is a (relational) database?Think of them as a collection of tablesThey model some aspect of “the world”25Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
A (Simple) Database ExampleStudent TableDepartment TableCourse TableEnrollment Table26Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Database QueriesWhat would you want to know from a database?What classes is John Arrow enrolled in?Who has the highest grade in LBSC 690?Who’s in the history department?Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address?27Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Databases vs. IR28IRDatabasesWhat we’re retrievingMostly unstructured.  Free text with some metadata.Structured data. Clear semantics based on a formal model.Queries we’re posingVague, imprecise information needs (often expressed in natural language).Formally (mathematically) defined queries.  Unambiguous.Results we getSometimes relevant, often not.Exact.  Always correct in a formal sense.Interaction with systemInteraction is important.One-shot queries.Other issuesIssues downplayed.Concurrency, recovery, atomicity are all critical.Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Big PictureThe four components of the information retrieval environment:UserProcessSystemCollectionWhat computer geeks care about!What we care about!29Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Information Retrieval CycleResourceQueryRanked ListDocumentsquery reformulation,vocabulary learning,relevance feedbackDocumentssource reselectionSourceSelectionQueryFormulationSearchSelectionExaminationDelivery30Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Supporting the Search ProcessSourceSelectionResourceQueryFormulationQuerySearchRanked ListSelectionIndexingDocumentsIndexExaminationAcquisitionDocumentsCollectionDelivery31Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Simplification?ResourceQueryRanked ListDocumentsquery reformulation,vocabulary learning,relevance feedbackDocumentssource reselectionSourceSelectionIs this itself a vast simplification?QueryFormulationSearchSelectionExaminationDelivery32Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Tackling the IR ChallengeDivide and conquer!Strategy: use encapsulation to limit complexityApproach:Define interfaces (input and output) for each componentDefine the functions performed by each componentStudy each component in isolationRepeat the process within components as neededMake sure that this decomposition makes senseResult: a hierarchical decomposition33Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where do we make the cut?Study the IR black box in isolationSimple behavior: in goes query, out comes documentsOptimize the quality of documents that come outStudy everything else around the black boxPut the human back in the loop!SearchQueryRanked List34Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The IR Black BoxDocumentsQueryHits35Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Inside The IR Black BoxDocumentsQueryRepresentationFunctionRepresentationFunctionQuery RepresentationDocument RepresentationIndexComparisonFunctionHits36Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Central Problem in IRInformation SeekerAuthorsConceptsConceptsQuery TermsDocument TermsDo these represent the same concepts?37Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning by Imagination38Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Imagine a SystemWe have 1000s of web pages, what make these web pages different?May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing)39Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Realize Query NeedsWhat do we expect when query?Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan")How to manage data of web pagesBag of words data structure with/without position of words/terms (simply, posting list of words/terms)What’s the best match?We have many matching results, but what’s the order?40Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Order of Matching ResultsHow could we rank web pages? Via query content matching score against web pages i.e., content based methods Via importance of web pages i.e., link based methods41Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What does Content Tell?Content Information:Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy)So what does common words make? Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents)42Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Ranking MethodsContent based methods:Examples: Tf-idf with cosine similarity, bm25, etc.Link based methods:Examples: PageRank, HITS, etc.43Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is More in Ranking?What other measures we can take for ranking better?Combining content based methods with link based methodsHow about learning to rank by user click through data (apply machine learning)How about learning from social web (apply social science theories)44Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Lots of Web PagesHow about scalability? We have too many words, can we limit them? Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study)Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education)45Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Is it sufficient?Can we feel confident about how Web Search Engine works?No, it was just a summary for the day46Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Guess! what next you would see??47Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Our search engineYes we are making it48Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Appendix49Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
OutlineWhat is Research?How to prepare yourself for IR research?50Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Research?51Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Research?ResearchDiscover new knowledge Seek answers to questionsBasic researchGoal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? )Often driven by curiosity (but not always)High impact examples: relativity theory, DNA, … Applied researchGoal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?)Driven by practical needsHigh impact examples: computers, transistors, vaccinations, …The boundary is vague; distinction isn’t important52Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Why Research?FundingCuriosityUtility of ApplicationsAdvancement of TechnologyAmount of knowledgeApplicationDevelopmentApplied ResearchBasic Research53Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where’s IR Research?Information ScienceFundingQuality of LifeUtility of ApplicationsAdvancement of TechnologyAmount of knowledgeComputer ScienceApplicationDevelopmentApplied ResearchBasic Research54Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Research ProcessIdentification of the topic (e.g., Web search)Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art)Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data)Test hypothesis (e.g., compare X and Y on the data)Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)55Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Typical IR Research ProcessLook for a high-impact topic (basic or applied)New problem: define/frame the problem Identify weakness of existing solutions if anyPropose new methods Choose data sets (often a main challenge)Design evaluation measures (can be very difficult)Run many experiments (need to have clear research hypotheses)Analyze results and repeat the steps above if necessaryPublish research results56Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Research MethodsExploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”)Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”)Empirical research: evaluate and compare existing solutions  (e.g., “a comparative evaluation of link analysis methods for web search”) The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…57Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Types of Research Questions and ResultsExploratory (Framework): What’s out there? Descriptive (Principles): What does it look like? How does it work?Evaluative (Empirical results): How well does a method solve a problem? Explanatory (Causes): Why does something happen the way it happens? Predictive (Models): What would happen if xxx ?58Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Solid and High Impact ResearchSolid work: A clear hypothesis (research question) with conclusive result (either positive or negative)Clearly adds to our knowledge base (what can we learn from this work?)Implications: a solid, focused contribution is often better than a non-conclusive broad explorationHigh impact = high-importance-of-problem * high-quality-of-solutionhigh impact = open up an important problemhigh impact = close a problem with the best solutionhigh impact = major milestones in betweenImplications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best 59Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
How to Prepare Yourself for IR Research?60Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What it Takes to do Research?Curiosity: allow you to ask questionsCritical thinking: allow you to challenge assumptionsLearning: take you to the frontier of knowledgePersistence: so that you don’t give upRespect data and truth: ensure your research is solidCommunication: allow you to publish your work…61Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning about IR (1/2)Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…)Then read “Readings in IR” by Karen Sparck Jones,  Peter Willett And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdfRead other papers published in recent IR/IR-related conferences62Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning about IR (2/2)Getting more focused Choose your favorite sub-area (e.g., retrieval models)Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization)Stay in frontier:Keep monitoring literature in both IR and related areasBroaden your view: Keep an eye on Industry activities Read about industry trendsTry out novel prototype systemsFunding trendsRead request for proposals63Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Critical ThinkingDevelop a habit of asking questions, especially why questionsAlways try to make sense of what you have read/heard; don’t let any question pass byGet used to challenging everythingPractical adviceQuestion every claim made in a paper or a talk (can you argue the other way?) Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it)Force yourself to challenge one point in every talk that you attend and raise a question64Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Respect Data and TruthBe honest with the experiment results Don’t throw away negative results! Try to learn from negative resultsDon’t twist data to fit your hypothesis; instead, let the hypothesis choose dataBe objective in data analysis and interpretation; don’t mislead readers Aim at understanding/explanation instead of just good resultsBe careful not to over-generalize (for both good and bad results); you may be far from the truth65Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
CommunicationsGeneral communication skills: Oral and writtenFormal and informalTalk to people with different level of backgroundsBe clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) English proficiencyGet used to talking to people from different fields66Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
PersistenceWork only on topics that you are passionate aboutWork only on hypotheses that you believe inDon’t draw negative conclusions prematurely and give up easilypositive results may be hidden in negative resultsIn many cases, negative results don’t completely reject a hypothesis Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper)Think of possibilities of repositioning a work67Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Optimize Your TrainingKnow your strengths and weaknessesstrong in math vs. strong in system developmentcreative vs. thorough…Train yourself to fix weaknessesFind strategic partnersPosition yourself to take advantage of your strengths68Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Thank YouReach me on Twitter: @matifqEmail me: maqureshi@iba.edu.pk69Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk

Information Retrieval

  • 1.
    INFORMATION RETRIEVALA Lookinto the Science of Web Search Engines1Reach me on Twitter: @matifq Email maqureshi@iba.edu.pkMuhammad AtifQureshi
  • 2.
    ContentsStory Mode LearningLearningby ImaginationAppendix2Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 3.
    Story Mode Learning(Borrowedfrom Prof. Jimmy Lin,University of Maryland, Scientist in Twitter)3Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 4.
    Information Retrieval SystemsInformationWhatis “information”?RetrievalWhat do we mean by “retrieval”?What are different types information needs?SystemsHow do computer systems fit into the human information seeking process?4Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 5.
    What is Information?Whatdo you think?There is no “correct” definitionCookie Monster’s definition: “news or facts about something”Different approaches:PhilosophyPsychologyLinguisticsElectrical engineeringPhysicsComputer scienceInformation science5Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 6.
    Dictionary says…Oxford EnglishDictionaryinformation: informing, telling; thing told, knowledge, items of knowledge, newsknowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is knownRandom House Dictionaryinformation: knowledge communicated or received concerning a particular fact or circumstance; news6Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 7.
    Intuitive NotionsInformation mustBesomething, although the exact nature (substance, energy, or abstract concept) is not clear;Be “new”: repetition of previously received messages is not informativeBe “true”: false or counterfactual information is “mis-information”Be “about” somethingRobert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269.7Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 8.
    Three Views ofInformationInformation as processInformation as communicationInformation as message transmission and reception8Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 9.
    One ViewInformation =characteristics of the output of a processTells us something about the process and the inputInformation-generating process do not occur in isolationInputOutputProcessInputOutputInputOutputProcess1Process2InputOutput…9Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 10.
    Where’s the human?Ifa tree falls in the forest, and no one is around to hear it, is information transmitted?In the “information as process”: Yes, but that’s not very interesting to usWe’re concerned about information for human consumptionTransmission of information from one person to anotherRecording of informationReconstruction of stored information10Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 11.
    Another ViewInformation scienceis characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient”This implies that the sender has knowledge of the recipient's structureText = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient”Information = “the structure of any text which is capable of changing the image-structure of a recipient”11Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 12.
    Transfer of InformationCommunication= transmission of informationThoughtsThoughtsTelepathy?WordsWordsWritingSoundsSoundsSpeechEncodingDecoding12Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 13.
    Information TheoryBetter called“communication theory”Developed by Claude Shannon in 1940’sConcerned with the transmission of electrical signals over wiresHow do we send information quickly and reliably?Underlies modern electronic communication:Voice and data traffic…Over copper, fiber optic, wireless, etc.Famous result: Channel Capacity TheoremFormal measure of information in terms of entropyInformation = “reduction in surprise”13Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 14.
    The Noisy ChannelModelCommunication = producing the same message at the destination that was sent at the sourceThe message must be encoded for transmission across a medium (called channel)But the channel is noisy and can distort the messageSemantics (meaning) is irrelevantchannelReceivermessageTransmitternoiseSourceDestinationmessage14Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 15.
    A SynthesisInformation retrievalas communication over time and space, across a noisy channelSenderRecipientEncodingDecodingTransmitterReceiverchannelstoragemessagemessageindexing/writingretrieval/readingnoiseSourceDestinationmessagemessagenoise15Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 16.
    “Retrieval?”“Fetch something” that’sbeen storedRecover a stored state of knowledgeSearch through stored messages to find some messages relevant to the task at handEncodingDecodingstorageSenderRecipientmessagemessageindexing/writingRetrieval/readingnoise16Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 17.
    What is IR?Informationretrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human userAnomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143.17Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 18.
    Types of InformationNeedsRetrospective“Searching the past”Different queries posed against a static collectionTime invariantProspective“Searching the future”Static query posed against a dynamic collectionTime dependent18Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 19.
    Retrospective Searches (I)Adhoc retrieval: find documents “about this”Known item searchDirected explorationIdentify positive accomplishments of the Hubble telescope since it was launched in 1991.Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.Find Jimmy Lin’s homepage.What’s the ISBN number of “Modern Information Retrieval”?Who makes the best chocolates?What video conferencing systems exist for digital reference desk services?19Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 20.
    Retrospective Searches (II)QuestionansweringWho discovered Oxygen?When did Hawaii become a state?Where is Ayer’s Rock located?What team won the World Series in 1992?“Factoid”What countries export oil?Name U.S. cities that have a “Shubert” theater.“List”Who is Aaron Copland?What is a quasar?“Definition”20Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 21.
    Prospective “Searches”FilteringMake abinary decision about each incoming documentRoutingSort incoming documents into different bins?Spam or not spam?Categorize news headlines: World? Nation? Metro? Sports?21Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 22.
    What types ofinformation?Text (Documents and portions thereof)XML and structured documentsImagesAudio (sound effects, songs, etc.) VideoSource codeApplications/Web services22Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 23.
    Content-Based SearchThis isa relative new concept!What else would you search on?What’s more effective?Why is this hard in many applications?23Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 24.
    Interesting ExamplesGoogle imagesearchGoogle video searchQuery by humminghttp://images.google.com/http://video.google.com/http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html24Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 25.
    What about databases?Whatare examples of databases?Banks storing account informationRetailers storing inventoriesUniversities storing student gradesWhat exactly is a (relational) database?Think of them as a collection of tablesThey model some aspect of “the world”25Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 26.
    A (Simple) DatabaseExampleStudent TableDepartment TableCourse TableEnrollment Table26Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 27.
    Database QueriesWhat wouldyou want to know from a database?What classes is John Arrow enrolled in?Who has the highest grade in LBSC 690?Who’s in the history department?Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address?27Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 28.
    Databases vs. IR28IRDatabasesWhatwe’re retrievingMostly unstructured. Free text with some metadata.Structured data. Clear semantics based on a formal model.Queries we’re posingVague, imprecise information needs (often expressed in natural language).Formally (mathematically) defined queries. Unambiguous.Results we getSometimes relevant, often not.Exact. Always correct in a formal sense.Interaction with systemInteraction is important.One-shot queries.Other issuesIssues downplayed.Concurrency, recovery, atomicity are all critical.Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 29.
    The Big PictureThefour components of the information retrieval environment:UserProcessSystemCollectionWhat computer geeks care about!What we care about!29Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 30.
    The Information RetrievalCycleResourceQueryRanked ListDocumentsquery reformulation,vocabulary learning,relevance feedbackDocumentssource reselectionSourceSelectionQueryFormulationSearchSelectionExaminationDelivery30Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 31.
    Supporting the SearchProcessSourceSelectionResourceQueryFormulationQuerySearchRanked ListSelectionIndexingDocumentsIndexExaminationAcquisitionDocumentsCollectionDelivery31Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 32.
    Simplification?ResourceQueryRanked ListDocumentsquery reformulation,vocabularylearning,relevance feedbackDocumentssource reselectionSourceSelectionIs this itself a vast simplification?QueryFormulationSearchSelectionExaminationDelivery32Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 33.
    Tackling the IRChallengeDivide and conquer!Strategy: use encapsulation to limit complexityApproach:Define interfaces (input and output) for each componentDefine the functions performed by each componentStudy each component in isolationRepeat the process within components as neededMake sure that this decomposition makes senseResult: a hierarchical decomposition33Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 34.
    Where do wemake the cut?Study the IR black box in isolationSimple behavior: in goes query, out comes documentsOptimize the quality of documents that come outStudy everything else around the black boxPut the human back in the loop!SearchQueryRanked List34Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 35.
    The IR BlackBoxDocumentsQueryHits35Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 36.
    Inside The IRBlack BoxDocumentsQueryRepresentationFunctionRepresentationFunctionQuery RepresentationDocument RepresentationIndexComparisonFunctionHits36Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 37.
    The Central Problemin IRInformation SeekerAuthorsConceptsConceptsQuery TermsDocument TermsDo these represent the same concepts?37Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 38.
    Learning by Imagination38Reachme on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 39.
    Imagine a SystemWehave 1000s of web pages, what make these web pages different?May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing)39Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 40.
    Realize Query NeedsWhatdo we expect when query?Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan")How to manage data of web pagesBag of words data structure with/without position of words/terms (simply, posting list of words/terms)What’s the best match?We have many matching results, but what’s the order?40Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 41.
    Order of MatchingResultsHow could we rank web pages? Via query content matching score against web pages i.e., content based methods Via importance of web pages i.e., link based methods41Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 42.
    What does ContentTell?Content Information:Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy)So what does common words make? Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents)42Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 43.
    Ranking MethodsContent basedmethods:Examples: Tf-idf with cosine similarity, bm25, etc.Link based methods:Examples: PageRank, HITS, etc.43Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 44.
    What is Morein Ranking?What other measures we can take for ranking better?Combining content based methods with link based methodsHow about learning to rank by user click through data (apply machine learning)How about learning from social web (apply social science theories)44Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 45.
    Lots of WebPagesHow about scalability? We have too many words, can we limit them? Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study)Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education)45Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 46.
    Is it sufficient?Canwe feel confident about how Web Search Engine works?No, it was just a summary for the day46Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 47.
    Guess! what nextyou would see??47Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 48.
    Our search engineYeswe are making it48Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 49.
    Appendix49Reach me onTwitter: @matifq Email maqureshi@iba.edu.pk
  • 50.
    OutlineWhat is Research?Howto prepare yourself for IR research?50Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 51.
    What is Research?51Reachme on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 52.
    What is Research?ResearchDiscovernew knowledge Seek answers to questionsBasic researchGoal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? )Often driven by curiosity (but not always)High impact examples: relativity theory, DNA, … Applied researchGoal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?)Driven by practical needsHigh impact examples: computers, transistors, vaccinations, …The boundary is vague; distinction isn’t important52Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 53.
    Why Research?FundingCuriosityUtility ofApplicationsAdvancement of TechnologyAmount of knowledgeApplicationDevelopmentApplied ResearchBasic Research53Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 54.
    Where’s IR Research?InformationScienceFundingQuality of LifeUtility of ApplicationsAdvancement of TechnologyAmount of knowledgeComputer ScienceApplicationDevelopmentApplied ResearchBasic Research54Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 55.
    Research ProcessIdentification ofthe topic (e.g., Web search)Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art)Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data)Test hypothesis (e.g., compare X and Y on the data)Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)55Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 56.
    Typical IR ResearchProcessLook for a high-impact topic (basic or applied)New problem: define/frame the problem Identify weakness of existing solutions if anyPropose new methods Choose data sets (often a main challenge)Design evaluation measures (can be very difficult)Run many experiments (need to have clear research hypotheses)Analyze results and repeat the steps above if necessaryPublish research results56Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 57.
    Research MethodsExploratory research:Identify and frame a new problem (e.g., “a survey/outlook of personalized search”)Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”)Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”) The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…57Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 58.
    Types of ResearchQuestions and ResultsExploratory (Framework): What’s out there? Descriptive (Principles): What does it look like? How does it work?Evaluative (Empirical results): How well does a method solve a problem? Explanatory (Causes): Why does something happen the way it happens? Predictive (Models): What would happen if xxx ?58Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 59.
    Solid and HighImpact ResearchSolid work: A clear hypothesis (research question) with conclusive result (either positive or negative)Clearly adds to our knowledge base (what can we learn from this work?)Implications: a solid, focused contribution is often better than a non-conclusive broad explorationHigh impact = high-importance-of-problem * high-quality-of-solutionhigh impact = open up an important problemhigh impact = close a problem with the best solutionhigh impact = major milestones in betweenImplications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best 59Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 60.
    How to PrepareYourself for IR Research?60Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 61.
    What it Takesto do Research?Curiosity: allow you to ask questionsCritical thinking: allow you to challenge assumptionsLearning: take you to the frontier of knowledgePersistence: so that you don’t give upRespect data and truth: ensure your research is solidCommunication: allow you to publish your work…61Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 62.
    Learning about IR(1/2)Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…)Then read “Readings in IR” by Karen Sparck Jones, Peter Willett And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdfRead other papers published in recent IR/IR-related conferences62Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 63.
    Learning about IR(2/2)Getting more focused Choose your favorite sub-area (e.g., retrieval models)Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization)Stay in frontier:Keep monitoring literature in both IR and related areasBroaden your view: Keep an eye on Industry activities Read about industry trendsTry out novel prototype systemsFunding trendsRead request for proposals63Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 64.
    Critical ThinkingDevelop ahabit of asking questions, especially why questionsAlways try to make sense of what you have read/heard; don’t let any question pass byGet used to challenging everythingPractical adviceQuestion every claim made in a paper or a talk (can you argue the other way?) Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it)Force yourself to challenge one point in every talk that you attend and raise a question64Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 65.
    Respect Data andTruthBe honest with the experiment results Don’t throw away negative results! Try to learn from negative resultsDon’t twist data to fit your hypothesis; instead, let the hypothesis choose dataBe objective in data analysis and interpretation; don’t mislead readers Aim at understanding/explanation instead of just good resultsBe careful not to over-generalize (for both good and bad results); you may be far from the truth65Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 66.
    CommunicationsGeneral communication skills:Oral and writtenFormal and informalTalk to people with different level of backgroundsBe clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) English proficiencyGet used to talking to people from different fields66Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 67.
    PersistenceWork only ontopics that you are passionate aboutWork only on hypotheses that you believe inDon’t draw negative conclusions prematurely and give up easilypositive results may be hidden in negative resultsIn many cases, negative results don’t completely reject a hypothesis Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper)Think of possibilities of repositioning a work67Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 68.
    Optimize Your TrainingKnowyour strengths and weaknessesstrong in math vs. strong in system developmentcreative vs. thorough…Train yourself to fix weaknessesFind strategic partnersPosition yourself to take advantage of your strengths68Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 69.
    Thank YouReach meon Twitter: @matifqEmail me: maqureshi@iba.edu.pk69Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk