Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
INTRODUCTION TO INFORMATION RETRIEVAL
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments.
INFORMATION RETRIEVAL MODELS
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
Introduction to Enterprise Search. A two hour class to introduce Enterprise Search. It covers:
The problems enterprise search can solve
History of (web) search
How we search and find?
Current state of Enterprise Search + stats
Technical concept
Information quality
Feedback cycle
Five dimensions of Findability
This presentation was provided by Marydee Ojala of Information Today during the NISO event "The Impact of the Interface: Traditional and Non Traditional Content," held on November 20, 2019.
Designing Structure Part II: Information ArchtectureChristina Wodtke
Part two on Designing Structure for my General Assembly class on User Experience is about Information Architecture. We cover why classification is important, types of classification and trends in IA.
Search & Recommendation: Birds of a Feather?Toine Bogers
In just a little over half a century, the field of information retrieval has experienced spectacular growth and success, with IR applications such as search engines becoming a billion-dollar industry in the past decades. Recommender systems have seen an even more meteoric rise to success with wide-scale application by companies like Amazon, Facebook, and Netflix. But are search and recommendation really two different fields of research that address different problems with different sets of algorithms in papers published at distinct conferences?
In my talk, I want to argue that search and recommendation are more similar than they have been treated in the past decade. By looking more closely at the tasks and problems that search and recommendation try to solve, at the algorithms used to solve these problems and at the way their performance is evaluated, I want to show that there is no clear black and white division between the two. Instead, search and recommendation are part of a much more fluid continuum of methods and techniques for information access.
(Keynote at "Mind The Gap '14" workshop at the iConference 2014 in Berlin, Germany)
This presentation has been given at many SharePoint conferences around the world and focusing on preparing us for the new Managed Metadata Services in SharePoint 2010 and how we can put together good practices to understand our Metadata to deliver the most effective strategy.
Information Discovery and Search Strategies for Evidence-Based ResearchDavid Nzoputa Ofili
This event was on May 2, 2017 at Wesley University, Ondo State, Nigeria. I trained the university's staff (academic and non-academic) on "Information Discovery and Search Strategies for Evidence-Based Research" in an information/digital literacy session.
Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customers’ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
Information retrieval 1 introduction to irVaibhav Khanna
Information retrieval (IR) is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing
This tutorial gives an overview of how search engines and machine learning techniques can be tightly coupled to address the need for building scalable recommender or other prediction based systems. Typically, most of them architect retrieval and prediction in two phases. In Phase I, a search engine returns the top-k results based on constraints expressed as a query. In Phase II, the top-k results are re-ranked in another system according to an optimization function that uses a supervised trained model. However this approach presents several issues, such as the possibility of returning sub-optimal results due to the top-k limits during query, as well as the prescence of some inefficiencies in the system due to the decoupling of retrieval and ranking.
To address this issue the authors created ML-Scoring, an open source framework that tightly integrates machine learning models into Elasticsearch, a popular search engine. ML-Scoring replaces the default information retrieval ranking function with a custom supervised model that is trained through Spark, Weka, or R that is loaded as a plugin in Elasticsearch. This tutorial will not only review basic methods in information retrieval and machine learning, but it will also walk through practical examples from loading a dataset into Elasticsearch to training a model in Spark, Weka, or R, to creating the ML-Scoring plugin for Elasticsearch. No prior experience is required in any system listed (Elasticsearch, Spark, Weka, R), though some programming experience is recommended.
Similar to Introduction to Information Retrieval (20)
Entity Linking via Graph-Distance MinimizationRoi Blanco
Entity-linking is a natural-language--processing task that consists in identifying strings of text that refer to a particular
item in some reference knowledge base.
One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items.
Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, it turns out to be solvable in linear time under some more restrictive assumptions. For the general case, we propose several heuristics: one of these tries to enforce the above assumptions while the others try to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Large knowledge bases consisting of entities and relationships between them have become vital sources of information for many applications. Most of these knowledge bases adopt the Semantic-Web data model RDF as a representation model. Querying these knowledge bases is typically done using structured queries utilizing graph-pattern languages such as SPARQL. However, such structured queries require some expertise from users which limits the accessibility to such data sources. To overcome this, keyword search must be supported. In this paper, we propose a retrieval model for keyword queries over RDF graphs. Our model retrieves a set of subgraphs that match the query keywords, and ranks them based on statistical language models. We show that our retrieval model outperforms the-state-of-the-art IR and DB models for keyword search over structured data using experiments over two real-world datasets.
Extending BM25 with multiple query operatorsRoi Blanco
Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesRoi Blanco
Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.
Effective and Efficient Entity Search in RDF dataRoi Blanco
Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors combined have led to an increased interest in semantic search, i.e. access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.
Caching Search Engine Results over Incremental IndicesRoi Blanco
A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.
We study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, has not been addressed before. In this paper we give the first formalization of the problem, how it can be evaluated, and present a full evaluation dataset. We propose several methods to rank these sentences, namely retrieval-based, entity-ranking based and position-based. We found that traditional bag-of-words models perform relatively well when there is a match between an entity and a query in a given sentence, but they fail to find a support sentence for a substantial portion of entities. This can be improved by incorporating small windows of context sentences and ranking them appropriately.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. Acknowledgements
• Many of these slides were taken from other presentations
– P. Raghavan, C. Manning, H. Schutze IR lectures
– Mounia Lalmas’s personal stash
– Other random slide decks
• Textbooks
– Ricardo Baeza-Yates, Berthier Ribeiro Neto
– Raghavan, Manning, Schutze
– … among other good books
• Many online tutorials, many online tools available (full toolkits)
2
3. Big Plan
• What is Information Retrieval?
– Search engine history
– Examples of IR systems (you might now have known!)
• Is IR hard?
– Users and human cognition
– What is it like to be a search engine?
• Web Search
– Architecture
– Differences between Web search and IR
– Crawling
3
6. Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
Introduction to Information Retrieval
6
6
7. Information Retrieval (II)
• What do we understand by documents? How do
we decide what is a document and whatnot?
• What is an information need? What types of
information needs can we satisfy automatically?
• What is a large collection? Which environments
are suitable for IR
7
7
8. Basic assumptions of Information Retrieval
• Collection: A set of documents
– Assume it is a static collection
• Goal: Retrieve documents with information that is
relevant to the user’s information need and helps
the user complete a task
8
9. Key issues
• How to describe information resources or information-bearing
objects in ways that they can be effectively used
by those who need to use them ?
– Organizing/Indexing/Storing
• How to find the appropriate information resources or
information-bearing objects for someone’s (or your own)
needs
– Retrieving / Accessing / Filtering
9
10. Unstructured data
Unstructured data?
SELECT * from HOTELS
where city = Bangalore and
$$$ < 2
10
Cheap hotels in
Bangalore
CITY $$$ name
Bangalore 1.5 Cheapo one
Barcelona 1 EvenCheapoer
10
41. IR issues
• Find out what the user needs
… and do it quickly
• Challenges: user intention, accessibility, volatility,
redundancy, lack of structure, low quality, different data
sources, volume, scale
• The main bottleneck is human cognition and not
computational
41
42. IR is mostly about relevance
• Relevance is the core concept in IR, but nobody has a good
definition
• Relevance = useful
• Relevance = topically related
• Relevance = new
• Relevance = interesting
• Relevance = ???
• However we still want relevant information
42
43. • Information needs must be expressed as a query
– But users don’t often know what they want
• Problems
– Verbalizing information needs
– Understanding query syntax
– Understanding search engines
43
44. Understanding(?) the user
I am a hungry tourist in
Barcelona, and I want to
find a place to eat;
however I don’t want to
spend a lot of money
I want information
on places with
cheap food in
Barcelona
Info about bars in
Barcelona
Bar celona
Misconception
Mistranslation
Misformulation
44
45. Why this is hard?
• Documents/images/ video/speech/etc are complex. We
need some representation
• Semantics
– What do words mean?
• Natural language
– How do we say things?
• L Computers cannot deal with these easily
45
46. … and even harder
• Context
• Opinion
Funny? Talented? Honest?
46
48. What is it like to be a search engine?
• How can we figure out what you’re trying to do?
• Signal can be somehow weak, sometimes!
[ jaguar ]
[ iraq ]
[ latest release Thinkpad drivers touchpad ] [
ebay ]
[ first ]
[ google ]
[ brittttteny spirs ]
48
49. Search is a multi-step process
• Session search
– Verbalize your query
– Look for a document
– Find your information there
– Refine
• Teleporting
– Go directly to the site you like
– Formulating the query is too hard, you trust more
the final site, etc.
49
50. • Someone told me that in the mid-1800’s, people often would carry
around a special kind of notebook. They would use the notebook to
write down quotations that they heard, or copy passages from books
they’d read. The notebook was an important part of their education,
and it had a particular name.
– What was the name of the notebook?
50
Examples from Dan Russel
52. More tasks …
• Going beyond a search engine
– Using images / multimedia content
– Using maps
– Using other sources
• Think of how to express things differently (synonyms)
– A friend told me that there is an abandoned city in the waters of San Francisco
Bay. Is that true? If it IS true, what was the name of the supposed city?
• Exploring a topic further in depth
• Refining a question
– Suppose you want to buy a unicycle for your Mom or Dad. How would you find
it?
• Looking for lists of information
– Can you find a list of all the groups that inhabited California at the time of the
missions?
52
53. IR tasks
• Known-item finding
– You want to retrieve some data that you know they exist
– What year was Peter Mika born?
• Exploratory seeking
– You want to find some information through an iterative process
– Not a single answer to your query
• Exhaustive search
– You want to find all the information possible about a particular issue
– Issuing several queries to cover the user information need
• Re-finding
– You want to find an item you have found already
53
54. Scale
• >300TB of print data produced per year
– +Video, speech, domain-specific information (>600PB per year)
• IR has to be fast + scalable
• Information is dynamic
– News, web pages, maps, …
– Queries are dynamic (you might even change your information needs while
searching)
• Cope with data and searcher change
– This introduces tensions in every component of a search engine
54
55. Methodology
• Experimentation in IR
• Three fundamental types of IR research:
– Systems (efficiency)
– Methods (effectiveness)
– Applications (user utility)
• Empirical evaluation plays a critical role across all three types
of research
55
56. Methodology (II)
• Information retrieval (IR) is a highly applied scientific
discipline
• Experimentation is a critical component of the scientific
method
• Poor experimental methodologies are not scientifically
sound and should be avoided
56
58. 58
Task
Info
need
Verbal
form
query
Search
engine
Corpus
results
Query
refinement
59. User
Interface
Query
interpretation
Document
Collection
Crawling
Text Processing
Indexing
General Voodoo
Matching
Ranking
Metadata
Index
Document
Interpretation
59
64. Web Search
• Basic search technology shared with IR systems
– Representation
– Indexing
– Ranking
• Scale (in terms of data and users) changes the game
– Efficiency/architectural design decisions
• Link structure
– For data acquisition (crawling)
– For ranking (PageRank, HITS)
– For spam detection
– For extending document representations (anchor text)
• Adversarial IR
• Monetization
64
65. User Needs
• Need
– Informational – want to learn about something (~40% / 65%)
– Navigational – want to go to that page (~25% / 15%)
– Transactional – want to do something (web-mediated) (~35% / 20%)
• Access a service
• Downloads
• Shop
– Gray areas
• Find a good hub
• Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Seattle weather
Mars surface images
Canon S410
Car rental Brasil
65
66. How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
66
67. Users’ empirical evaluation of results
• Quality of pages varies widely
– Relevance is not enough
– Other desirable qualities (non IR!!)
• Content: Trustworthy, diverse, non-duplicated, well maintained
• Web readability: display correctly & fast
• No annoyances: pop-ups, etc.
• Precision vs. recall
– On the web, recall seldom matters
• What matters
– Precision at 1? Precision above the fold?
– Comprehensiveness – must be able to deal with obscure queries
• Recall matters when the number of matches is very small
• User perceptions may be unscientific, but are significant
over a large aggregate
67
68. Users’ empirical evaluation of engines
• Relevance and validity of results
• UI – Simple, no clutter, error tolerant
• Trust – Results are objective
• Coverage of topics for ambiguous queries
• Pre/Post process tools provided
– Mitigate user errors (auto spell check, search assist,…)
– Explicit: Search within results, more like this, refine ...
– Anticipative: related searches
• Deal with idiosyncrasies
– Web specific vocabulary
• Impact on stemming, spell-check, etc.
– Web addresses typed in the search box
• “The first, the last, the best and the worst …”
68
69. The Web document collection
• No design/co-ordination
• Distributed content creation, linking,
democratization of publishing
• Content includes truth, lies, obsolete
information, contradictions …
• Unstructured (text, html, …), semi-structured
(XML, annotated photos), structured
(Databases)…
• Scale much larger than previous text collections
… but corporate records are catching up
• Growth – slowed down from initial “volume
doubling every few months” but still expanding
• Content can be dynamically generated The Web
69
70. Basic crawler operation
• Begin with known “seed” URLs
• Fetch and parse them
–Extract URLs they point to
–Place the extracted URLs on a queue
• Fetch each URL on the queue and
repeat
70
71. Crawling picture
Web
URLs frontier
Unseen Web
URLs crawled
and parsed
Seed
pages
71
72. Simple picture – complications
• Web crawling isn’t feasible with one machine
– All of the above steps distributed
• Malicious pages
– Spam pages
– Spider traps – including dynamically generated
• Even non-malicious pages pose challenges
– Latency/bandwidth to remote servers vary
– Webmasters’ stipulations
• How “deep” should you crawl a site’s URL hierarchy?
– Site mirrors and duplicate pages
• Politeness – don’t hit a server too often
72
73. What any crawler must do
• Be Polite: Respect implicit and explicit
politeness considerations
– Only crawl allowed pages
– Respect robots.txt
• Be Robust: Be immune to spider traps
and other malicious behavior from
web servers
–Be efficient
73
74. What any crawler should do
• Be capable of distributed operation: designed to
run on multiple distributed machines
• Be scalable: designed to increase the crawl rate
by adding more machines
• Performance/efficiency: permit full use of
available processing and network resources
74
75. What any crawler should do
• Fetch pages of “higher quality” first
• Continuous operation: Continue fetching
fresh copies of a previously fetched page
• Extensible: Adapt to new data formats,
protocols
75
76. Updated crawling picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
76
78. Document views
sailing
greece
mediterranean
fish
sunset
Author = “B. Smith”
Crdate = “14.12.96”
Ladate = “11.07.02”
Sailing in
Greece
B. Smith
content
view
head
title
author
chapter
section
section
structure
view
data
view
layout
view
78
79. What is a document: document views
• Content view is concerned with representing the content
of the document; that is, what is the document about.
• Data view is concerned with factual data associated with
the document (e.g. author names, publishing date)
• Layout view is concerned with how documents are
displayed to the users; this view is related to user interface
and visualization issues.
• Structure view is concerned with the logical structure of
the document, (e.g. a book being composed of chapters,
themselves composed of sections, etc.)
79
80. Indexing language
• An indexing language:
– Is the language used to describe the content of
documents (and queries)
– And it usually consists of index terms that are derived
from the text (automatic indexing), or arrived at
independently (manual indexing), using a controlled
or uncontrolled vocabulary
– Basic operation: is this query term present in this
document?
80
81. Generating document representations
• The building of the indexing language, that is generating
the document representation, is done in several steps:
– Character encoding
– Language recognition
– Page segmentation (boilerplate detection)
– Tokenization (identification of words)
– Term normalization
– Stopword removal
– Stemming
– Others (doc. Expansion, etc.)
81
82. Generating document representations: overview
documents
tokens
stop-words
stems
terms (index terms)
tokenization
remove noisy words
reduce to stems
+ others: e.g.
- thesaurus
- more complex
processing
82
83. Parsing a document
• What format is it in?
– pdf/word/excel/html?
• What language is it in?
• What character set is in use?
– (ISO-8818, UTF-8, …)
But these tasks are often done heuristically …
83
84. Complications: Format/language
• Documents being indexed can include docs from many
different languages
– A single index may contain terms from many languages.
• Sometimes a document or its components can contain
multiple languages/formats
– French email with a German pdf attachment.
– French email quote clauses from an English-language
contract
• There are commercial and open source libraries that can
handle a lot of this stuff
84
85. Complications: What is a document?
We return from our query “documents” but there are often
interesting questions of grain size:
What is a unit document?
– A file?
– An email? (Perhaps one of many in a single mbox file)
• What about an email with 5 attachments?
– A group of files (e.g., PPT or LaTeX split over HTML pages)
85
86. Tokenization
• Input: “Friends, Romans and Countrymen”
• Output: Tokens
– Friends
– Romans
– Countrymen
• A token is an instance of a sequence of characters
• Each such token is now a candidate for an index entry, after
further processing
• But what are valid tokens to emit?
86
87. Tokenization
• Issues in tokenization:
– Finland’s capital
Finland AND s? Finlands? Finland’s?
– Hewlett-Packard Hewlett and Packard as two
tokens?
• state-of-the-art: break up hyphenated sequence.
• co-education
• lowercase, lower-case, lower case ?
• It can be effective to get the user to put in possible hyphens
– San Francisco: one token or two?
• How do you decide it is one token?
87
88. Numbers
• 3/20/91 Mar. 12, 1991 20/3/91
• 55 B.C.
• B-52
• My PGP key is 324a3df234cb23e
• (800) 234-2333
• Often have embedded spaces
• Older IR systems may not index numbers
But often very useful: think about things like looking up error
codes/stacktraces on the web
• Will often index “meta-data” separately
Creation date, format, etc.
88
89. Tokenization: language issues
• French
– L'ensemble one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
– Until at least 2003, it didn’t on Google
» Internationalization!
• German noun compounds are not segmented
– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’
– German retrieval systems benefit greatly from a compound splitter
module
– Can give a 15% performance boost for German
89
90. Tokenization: language issues
• Chinese and Japanese have no spaces between words:
– 莎拉波娃现在居住在美国东南部的佛罗里达。
– Not always guaranteed a unique tokenization
• Further complicated in Japanese, with multiple alphabets
intermingled
– Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
90
91. Tokenization: language issues
• Arabic (or Hebrew) is basically written right to left, but with certain items
like numbers written left to right
• Words are separated, but letter forms within a word form complex
ligatures
← → ← → ← start
‘Algeria achieved its independence in 1962 after 132 years of French
occupation.’
• With Unicode, the surface presentation is complex, but the stored
form is straightforward
91
92. Stop words
• With a stop list, you exclude from the dictionary entirely the commonest
words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words
• But the trend is away from doing this:
– Good compression techniques means the space for including stop words in a system
can be small
– Good query optimization techniques mean you pay little at query time for including
stop words.
– You need them for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
92
93. Normalization to terms
• Want: matches to occur despite superficial differences in the
character sequences of the tokens
• We may need to “normalize” words in indexed text as well as query words
into the same form
– We want to match U.S.A. and USA
• Result is terms: a term is a (normalized) word type, which is an entry in
our IR system dictionary
• We most commonly implicitly define equivalence classes of terms by, e.g.,
– deleting periods to form a term
• U.S.A., USA USA
– deleting hyphens to form a term
• anti-discriminatory, antidiscriminatory antidiscriminatory
93
94. Normalization: other languages
• Accents: e.g., French résumé vs. resume.
• Umlauts: e.g., German: Tuebingen vs. Tübingen
– Should be equivalent
• Most important criterion:
– How are your users like to write their queries for these words?
• Even in languages that standardly have accents, users often may not type
them
– Often best to normalize to a de-accented term
• Tuebingen, Tübingen, Tubingen Tubingen
94
95. Case folding
• Reduce all letters to lower case
– exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
– Often best to lower case everything, since users will use lowercase
regardless of ‘correct’ capitalization…
• Longstanding Google example: [fixed in 2011…]
– Query C.A.T.
– #1 result is for “cats” (well, Lolcats) not Caterpillar Inc.
95
96. Normalization to terms
• An alternative to equivalence classing is to do asymmetric
expansion
• An example of where this may be useful
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows
• Potentially more powerful, but less efficient
96
97. Thesauri and soundex
• Do we handle synonyms and homonyms?
– E.g., by hand-constructed equivalence classes
• car = automobile color = colour
– We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under
car-automobile (and vice-versa)
– Or we can expand a query
• When the query contains automobile, look under car as
well
• What about spelling mistakes?
– One approach is Soundex, which forms equivalence classes of
words based on phonetic heuristics
97
98. Lemmatization
• Reduce inflectional/variant forms to base form
• E.g.,
– am, are, is be
– car, cars, car's, cars' car
• the boy's cars are different colors the boy car be
different color
• Lemmatization implies doing “proper” reduction to
dictionary headword form
98
99. Stemming
• Reduce terms to their “roots” before indexing
• “Stemming” suggests crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
99
100. – Affix removal
• remove the longest affix: {sailing, sailor} => sail
• simple and effective stemming
• a widely used such stemmer is Porter’s algorithm
– Dictionary-based using a look-up table
• look for stem of a word in table: play + ing => play
• space is required to store the (large) table, so often not practical
100
101. Stemming: some issues
• Detect equivalent stems:
– {organize, organise}: e as the longest affix leads to {organiz,
organis}, which should lead to one stem: organis
– Heuristics are therefore used to deal with such cases.
• Over-stemming:
– {organisation, organ} reduced into org, which is incorrect
– Again heuristics are used to deal with such cases.
101
102. Porter’s algorithm
• Commonest algorithm for stemming English
– Results suggest it’s at least as good as other stemming options
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound command, select
the one that applies to the longest suffix.
102
103. Typical rules in Porter
• sses ss
• ies i
• ational ate
• tional tion
103
104. Language-specificity
• The above methods embody transformations that are
– Language-specific, and often
– Application-specific
• These are “plug-in” addenda to the indexing process
• Both open source and commercial plug-ins are
available for handling these
104
105. Does stemming help?
• English: very mixed results. Helps recall for some queries but
harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish, …
– 30% performance gains for Finnish!
105
106. Others: Using a thesaurus
• A thesaurus provides a standard vocabulary for indexing
(and searching)
• More precisely, a thesaurus provides a classified
hierarchy for broadening and narrowing terms
bank: 1. Finance institute
2. River edge
– if a document is indexed with bank, then index it with
“finance institute” or “river edge”
– need to disambiguate the sense of bank in the text: e.g. if
money appears in the document, then chose “finance
institute”
• A widely used online thesaurus: WordNet
106
107. Information storage
• Whole topic on its own
• How do we keep fresh copies of the web manageable by a cluster of
computers and are able to answer millions of queries in milliseconds
– Inverted indexes
– Compression
– Caching
– Distributed architectures
– … and a lot of tricks
• Inverted indexes: cornerstone data structure of IR systems
– For each term t, we must store a list of all documents that contain t.
– Identify each doc by a docID, a document serial number
– Index construction is tricky (can’t hold all the information needed in memory)
107
109. • Most basic form:
– Document frequency
– Term frequency
– Document identifiers
109
term Term id df
a 1 4
as 2 3
(1,2), (2,5), (10,1), (11,1)
(1,3), (3,4), (20,1)
110. • Indexes contain more information
– Position in the document
• Useful for “phrase queries” or “proximity queries”
– Fields in which the term appears in the document
– Metadata …
– All that can be used for ranking
110
(1,2, [1, 1], [2,10]), …
Field 1 (title), position 1
111. Queries
• How do we process a query?
• Several kinds of queries
– Boolean
•Chicken AND salt
• Gnome OR KDE
• Salt AND NOT pepper
– Phrase queries
– Ranked
111
112. List Merging
•“Exact match” queries
– Chicken AND curry
– Locate Chicken in the dictionary
– Fetch its postings
– Locate curry in the dictionary
–Fetch its postings
–Merge both postings
112
116. Models of information retrieval
• A model:
– abstracts away from the real world
– uses a branch of mathematics
– possibly: uses a metaphor for searching
116
117. Short history of IR modelling
• Boolean model (±1950)
• Document similarity (±1957)
• Vector space model (±1970)
• Probabilistic retrieval (±1976)
• Language models (±1998)
• Linkage-based models (±1998)
• Positional models (±2004)
• Fielded models (±2005)
117
118. The Boolean model (±1950)
• Exact matching: data retrieval (instead of
information retrieval)
– A term specifies a set of documents
– Boolean logic to combine terms / document sets
– AND, OR and NOT: intersection, union, and
difference
118
119. Statistical similarity between documents (±1957)
• The principle of similarity
"The more two representations agree in given elements and their
distribution, the higher would be the probability of their representing
similar information”
(Luhn 1957)
It is here proposed that the frequency of word [term] occurrence in an
article [document ] furnishes a useful measurement of word [term]
significance”
119
121. Zipf’s law
• Relative frequencies of terms.
• In natural language, there are a few very frequent terms and very many
very rare terms.
• Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
• cfi ∝ 1/i = K/i where K is a normalizing constant
• cfi is collection frequency: the number of occurrences of the term ti in the
collection.
• Zipf’s law holds for different languages
121
122. Zipf consequences
• If the most frequent term (the) occurs cf1 times
– then the second most frequent term (of) occurs cf1/2 times
– the third most frequent term (and) occurs cf1/3 times …
• Equivalent: cfi = K/i where K is a normalizing factor, so
– log cfi = log K - log i
– Linear relationship between log cfi and log i
• Another power law relationship
122
124. Luhn’s analysis -Observation
terms by rank order
frequency of terms
f
resolving power
r
upper cut-off lower cut-off
common terms
rare terms
significant terms
Resolving power of significant terms:
ability of terms to discriminate document content
peak at rank order position half way between the two cut-offs
124
125. Luhn’s analysis - Implications
• Common terms are not good at representing document
content
– partly implemented through the removal of stop words
• Rare words are also not good at representing document
content
– usually nothing is done
– Not true for every “document”
• Need a means to quantify the resolving power of a term:
– associate weights to index terms
– tf×idf approach
125
126. Ranked retrieval
• Boolean queries are good for expert users with precise
understanding of their needs and the collection.
– Also good for applications: Applications can easily consume
1000s of results.
• Not good for the majority of users.
– Most users incapable of writing Boolean queries (or they are,
but they think it’s too much work).
– Most users don’t want to wade through 1000s of results.
• This is particularly true of web search.
127. Feast or Famine
• Boolean queries often result in either too few (=0) or too
many (1000s) results.
• Query 1: “standard user dlink 650” → 200,000 hits
• Query 2: “standard user dlink 650 no card found”: 0 hits
• It takes a lot of skill to come up with a query that produces
a manageable number of hits.
– AND gives too few; OR gives too many
128. Ranked retrieval models
• Rather than a set of documents satisfying a query expression,
in ranked retrieval, the system returns an ordering over the
(top) documents in the collection for a query
• Free text queries: Rather than a query language of operators
and expressions, the user’s query is just one or more words in
a human language
• In principle, there are two separate choices here, but in
practice, ranked retrieval has normally been associated with
free text queries and vice versa
128
129. Feast or famine: not a problem in ranked retrieval
• When a system produces a ranked result set, large result sets
are not an issue
– Indeed, the size of the result set is not an issue
– We just show the top k ( ≈ 10) results
– We do not overwhelm the user
– Premise: the ranking algorithm works
130. Scoring as the basis of ranked retrieval
• We wish to return in order the documents most likely to
be useful to the searcher
• How can we rank-order the documents in the collection
with respect to a query?
• Assign a score – say in [0, 1] – to each document
• This score measures how well document and query
“match”.
131. Query-document matching scores
• We need a way of assigning a score to a query/document
pair
• Let’s start with a one-term query
• If the query term does not occur in the document: score
should be 0
• The more frequent the query term in the document, the
higher the score (should be)
• We will look at a number of alternatives for this.
132. Bag of words model
• Vector representation does not consider the ordering of
words in a document
• John is quicker than Mary and Mary is quicker than John
have the same vectors
• This is called the bag of words model.
133. Term frequency tf
• The term frequency tf(t,d) of term t in document d is defined
as the number of times that t occurs in d.
• We want to use tf when computing query-document match
scores. But how?
• Raw term frequency is not what we want:
– A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
– But not 10 times more relevant.
• Relevance does not increase proportionally with term
frequency.
134. Log-frequency weighting
• The log frequency weight of term t in d is
1 log tf , if tf 0
10 t,d t,d
0, otherwise
t,d w
• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
• Score for a document-query pair: sum over terms t in both q and d:
• score
• The score is 0 if none of the query terms is present in the document.
t q d t d (1 log tf ) ,
135. Document frequency
• Rare terms are more informative than frequent terms
– Recall stop words
• Consider a term in the query that is rare in the collection (e.g.,
arachnocentric)
• A document containing this term is very likely to be relevant to
the query arachnocentric
• → We want a high weight for rare terms like arachnocentric.
136. Document frequency, continued
• Frequent terms are less informative than rare terms
• Consider a query term that is frequent in the collection (e.g., high,
increase, line)
• A document containing such a term is more likely to be relevant than a
document that does not
• But it’s not a sure indicator of relevance.
• → For frequent terms, we want high positive weights for words like high,
increase, and line
• But lower weights than for rare terms.
• We will use document frequency (df) to capture this.
137. idf weight
• dft is the document frequency of t: the number of documents that contain
t
– dft is an inverse measure of the informativeness of t
– dft N
• We define the idf (inverse document frequency) of t by
– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
idf log ( /df ) t 10 t N
138. Effect of idf on ranking
• Does idf have an effect on ranking for one-term queries, like
– iPhone
• idf has no effect on ranking one term queries
– idf affects the ranking of documents for queries with at least
two terms
– For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.
138
139. tf-idf weighting
• The tf-idf weight of a term is the product of its tf weight and its
idf weight.
w log(1 tf )
log ( N
/ df ) t , d
t ,d 10 t • Best known weighting scheme in information retrieval
– Note: the “-” in tf-idf is a hyphen, not a minus sign!
– Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
140. Score for a document given a query
tÎqÇd å
• There are many variants
– How “tf” is computed (with/without logs)
– Whether the terms in the query are also weighted
– …
140
Score(q,d) = tf.idft,d
141. Documents as vectors
• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of dimensions when
you apply this to a web search engine
• These are very sparse vectors - most entries are zero.
142. Statistical similarity between documents (±1957)
• Vector product
– If the vector has binary components, then the product
measures the number of shared terms
– Vector components might be "weights"
score q d q
d
k k
matching terms
( , )
k
143. Why distance is a bad idea
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and the
distribution of
terms in the
document d2 are
very similar.
144. Vector space model (±1970)
• Documents and
queries are vectors in
a high-dimensional
space
• Geometric measures
(distances, angles)
145. Vector space model (±1970)
• Cosine of an angle:
– close to 1 if angle is small
– 0 if vectors are orthogonal
2
m
d q
k k k
d q
m
k 1
k
2
m
k 1
k
1
( ) ( )
cos( , )
d q
1 ( )2
m
k
k
i
i
m
k
k k
v
v
d q n d n q n v
1
cos( , ) ( ) ( ), ( )
146. Vector space model (±1970)
• PRO: Nice metaphor, easily explained;
Mathematically sound: geometry;
Great for relevance feedback
• CON: Need term weighting (tf-idf);
Hard to model structured queries
147. Probabilistic IR
• An IR system has an uncertain understanding of user’s queries and
makes uncertain guesses on whether a document satisfies a query
or not.
• Probability theory provides a principled foundation for reasoning
under uncertainty.
• Probabilistic models build upon this foundation to estimate how
likely it is that a document is relevant for a query.
147
148. Event Space
• Query representation
• Document representation
• Relevance
• Event space
• Conceptually there might be pairs with same q and d,
but different r
• Some times include include user u, context c, etc.
148
149. Probability Ranking Principle
• Robertson (1977)
– “If a reference retrieval system’s response to each
request is a ranking of the documents in the collection
in order of decreasing probability of relevance to the
user who submitted the request, where the
probabilities are estimated as accurately as possible
on the basis of whatever data have been made
available to the system for this purpose, the overall
effectiveness of the system to its user will be the best
that is obtainable on the basis of those data.”
• Basis for probabilistic approaches for IR
149
150. Dissecting PRP
• Probability of relevance
• Estimated accurately
• Based on whatever data available
• Best possible accuracy
– The perfect IR system!
– Assumes relevance is independent on other
documents in the collection
150
151. Relevance?
• What is ?
– Isn’t it decided by the user? her opinion?
• User doesn’t mean a human being!
– We are working with representations
– ... or parts of the reality available to us
• 2/3 keywords, no profile, no context ...
– relevance is uncertain
• depends on what the system sees
• may be marginalized over all the
unseen context/profiles
151
152. Retrieval as binary classification
• For every (q,d), r takes two values
– Relevant and non-relevant documents
– can be extended to multiple values
• Retrieve using Bayes’ decision
– PRP is related to the Bayes error rate (lowest
possible error rate for a class)
– How do we estimate this probability?
152
153. PRP ranking
• How to represent the random variables?
• How to estimate the model’s parameters?
153
154. • d is a binary vector
• Multiple Bernoulli variables
• Under MB, we can decompose into a
product of probabilities, with likelihoods:
154
155. If the terms are not in the query:
Otherwise we need estimates for them!
155
156. Estimates
• Assign new weights for query terms based on relevant/non-relevant
documents
• Give higher weights to important terms:
Relevant Non-relevant
156
Document with
t
r n-r n
Document
without t
R-r N-r-R+r N-n
R N-R
157. Robertson-Spark Jones weight
157
Relevant docs with t
Relevant docs without t
Non-relevant docs with t
Non-relevant docs without t
158. Estimates without relevance info
• If we pick a relevant document, words are equally like to be
present or absent
• Non-relevant can be approximated with the collection as a
whole
158
160. Modeling TF
• Naïve estimation: separate probability for every
outcome
• BIR had only two parameters, now we have plenty
(~many outcomes)
• We can plug in a parametric estimate for the term
frequencies
• For instance, a Poisson mixture
160
161. Okapi BM25
• Same ranking function as before but with new
estimates. Models term frequencies and
document length.
• Words are generated by a mixture of two
Poissons
• Assumes an eliteness variable (elite ~ word
occurs unusually frequently, non-elite ~ word
occurs as expected by chance).
161
163. BM25
• In order to approximate the formula, Robertson and Walker came up
with:
• Two model parameters
• Very effective
• The more words in common with the query the better
• Repetitions less important than different query words
– But more important if the document is relatively long
163
164. Generative Probabilistic Language Models
• The generative approach – A generator which produces
events/tokens with some probability
– Probability distribution over strings of text
– URN Metaphor – a bucket of different colour balls (10 red, 5
blue, 3 yellow, 2 white)
• What is the probability of drawing a yellow ball? 3/20
• what is the probability of drawing (with replacement) a red ball and a
white ball? ½*1/10
– IR Metaphor: Documents are urns, full of tokens (balls) of (in)
different terms (colors)
165. What is a language model?
• How likely is a string of words in a “language”?
– P1(“the cat sat on the mat”)
– P2(“the mat sat on the cat”)
– P3(“the cat sat en la alfombra”)
– P4(“el gato se sentó en la alfombra”)
• Given a model M and a observation s we want
– Probability of getting s through random sampling from M
– A mechanism to produce observations (strings) legal in M
• User thinks of a relevant document and then picks some keywords
to use as a query
165
166. Generative Probabilistic Models
• What is the probability of producing the query from a document? p(q|d)
• Referred to as query-likelihood
• Assumptions:
• The probability of a document being relevant is strongly correlated with
the probability of a query given a document, i.e. p(d|r) is correlated
with p(q|d)
• User has a reasonable idea of the terms that are like to appear in the
“ideal” document
• User’s query terms can distinguish the “ideal” document from the rest
of the corpus
• The query is generated as a representative of the “ideal” document
• System’s task is to estimate for each of the documents in the collection,
which is most likely to be the “ideal” document
167. Language Models (1998/2001)
• Let’s assume we point blindly, one at a time, at 3 words
in a document
– What is the probability that I, by accident, pointed at the words
“Master”, “computer” and “Science”?
– Compute the probability, and use it to rank the documents.
• Words are “sampled” independently of each other
– Joint probability decomposed into a product of marginals
– Estimation of probabilities just by counting
• Higher models or unigrams?
– Parameter estimation can be very expensive
168. Standard LM Approach
• Assume that query terms are drawn identically and
independently from a document
169. Estimating language models
• Usually we don’t know M
• Maximum Likelihood Estimate of
– Simply use the number of times the query term occurs in
the document divided by the total number of term
occurrences.
• Zero Probability (frequency) problem
169
170. Document Models
• Solution: Infer a language model for each document,
where
• Then we can estimate
• Standard approach is to use the probability of a term to
smooth the document model.
• Interpolate the ML estimator with general language
expectations
171. Estimating Document Models
• Basic Components
– Probability of a term given a document (maximum likelihood estimate)
– Probability of a term given the collection
– tf(t,d) is the number of times term t occurs in document d (term frequency)
173. Implementation as vector product
df t
tf t D
p t
'
( )
( ' )
( )
t
df t
'
( , )
( ' , )
( | )
t
tf t D
p t D
Recall:
score q d q dk
q
tf k q
( , ) .
( , )
tf k d df t
( , ) ( )
k
tf.idf of term k in document d
Odds of the probability of
Inverse length of d Term importance
1
.
( ) ( , )
log
Matching Text
t
t
k
k
k
df k tf t d
d
174. Document length normalization
• Probabilistic models assume causes for documents differing in
length
– Scope
– Verbosity
• In practice, document length softens the term frequency
contribution to the final score
– We’ve seen it in BM25 and LMs
– Usually with a tunable parameter that regulates the
amount of softening
– Can be a function of the deviation of the average
document length
– Can be incorporated into vanilla tf-idf
174
175. Other models
• Modeling term dependencies (positions) in the language
modeling framework
– Markov Random Fields
• Modeling matches (occurrences of words) in different
parts of a document -> fielded models
– BM25F
– Markov Random Fields can account for this as well
175
176. More involved signals for ranking
• From document understanding to query
understanding
• Query rewrites (gazetteers, spell correction),
named entity recognition, query suggestions,
query categories, query segmentation ...
• Detecting query intent, triggering verticals
– direct target towards answers
– richer interfaces
176
177. Signals for Ranking
• Signals for ranking: matches of query terms in
documents, query-independent quality measures,
CTR, among others
• Probabilistic IR models are all about counting
– occurrences of terms in documents, in sets of
documents, etc.
• How to aggregate efficiently a large number of
“different” counts
– coming from the same terms
– no double counts!
177
178. Searching for food
• New York’s greatest pizza
‣ New OR York’s OR greatest OR pizza
‣ New AND York’s AND greatest AND pizza
‣ New OR York OR great OR pizza
‣ “New York” OR “great pizza”
‣ “New York” AND “great pizza”
‣ York < New AND great OR pizza
• among many more.
178
179. “Refined”matching
• Extract a number of virtual regions in the document
that match some version of the query (operators)
– Each region provides a different evidence of
relevance (i.e. signal)
• Aggregate the scores over the different regions
• Ex. :“at least any two words in the query appear
either consecutively or with an extra word between
them”
179
181. Remember BM25
• Term (tf) independence
• Vague Prior over terms not
appearing in the query
• Eliteness - topical model that
perturbs the word distribution
• 2-poisson distribution of term
frequencies over relevant and non-relevant
documents
181
182. Feature dependencies
• Class-linearly dependent (or affine) features
– add no extra evidence/signal
– model overfitting (vs capacity)
• Still, it is desirable to enrich the model with more
involved features
• Some features are surprisingly correlated
• Positional information requires a large number of
parameters to estimate
• Potentially up to
182
183. Query concept segmentation
• Queries are made up of basic conceptual units,
comprising many words
– “Indian summer victor herbert”
• Spurious matches: “san jose airport” -> “san jose
city airport”
• Model to detect segments based on generative
language models and Wikipedia
• Relax matches using factors of the max ratio
between span length and segment length
183
184. Virtual regions
• Different parts of the document
provide different evidence of
relevance
• Create a (finite) set of (latent)
artificial regions and re-weight
184
185. Implementation
• An operator maps a query to a set of queries,
which could match a document
• Each operator has a weight
• The average term frequency in a document is
185
186. Remarks
• Different saturation (eliteness) function?
– learn the real functional shape!
– log-logistic is good if the class-conditional
distributions are drawn from an exp. family
• Positions as variables?
– kernel-like method or exp. #parameters
• Apply operators on a per query or per query class
basis?
186
187. Operator examples
• BOW: maps a raw query to the set of queries
whose elements are the single terms
• p-grams: set of all p-gram of consecutive terms
• p-and: all conjunctions of p arbitrary terms
• segments: match only the “concepts”
• Enlargement: some words might sneak in
between the phrases/segments
187
189. ... not that far away
term frequency
link information
query intent information
editorial information
click-through information
geographical information
language information
user preferences
document length
document fields
other gazillion sources of information
189
190. Dictionaries
• Fast look-up
– Might need specific structures to scale up
• Hash tables
• Trees
– Tolerant retrieval (prefixes)
– Spell checking
• Document correction (OCR)
• Query misspellings (did you mean … ?)
• (Weighted) edit distance – dynamic programming
• Jaccard overlap (index character k-grams)
• Context sensitive
• http://norvig.com/spell-correct.html
– Wild-card queries
• Permuterm index
• K-gram indexes
190
191. Hardware basics
• Access to data in memory is much faster than access to data on disk.
• Disk seeks: No data is transferred from disk while the disk head is being
positioned.
• Therefore: Transferring one large chunk of data from disk to memory is
faster than transferring many small chunks.
• Disk I/O is block-based: Reading and writing of entire blocks (as opposed
to smaller chunks).
• Block sizes: 8KB to 256 KB.
191
192. Hardware basics
• Many design decisions in information retrieval are based on the
characteristics of hardware
• Servers used in IR systems now typically have several GB of main memory,
sometimes tens of GB.
• Available disk space is several (2-3) orders of magnitude larger.
• Fault tolerance is very expensive: It is much cheaper to use many regular
machines rather than one fault tolerant machine.
192
194. MapReduce
• The index construction algorithm we just described is an instance of
MapReduce.
• MapReduce (Dean and Ghemawat 2004) is a robust and conceptually
simple framework for distributed computing …
• … without having to write code for the distribution part.
• They describe the Google indexing system (ca. 2002) as consisting of a
number of phases, each implemented in MapReduce.
• Open source implementation Hadoop
– Widely used throughout industry
194
195. MapReduce
• Index construction was just one phase.
• Another phase: transforming a term-partitioned index
into a document-partitioned index.
– Term-partitioned: one machine handles a subrange of
terms
– Document-partitioned: one machine handles a
subrange of documents
• Msearch engines use a document-partitioned index for
better load balancing, etc.
195
196. Distributed IR
• Basic process
– All queries sent to a director machine
– Director then sends messages to many index servers
• Each index server does some portion of the query processing
– Director organizes the results and returns them to the user
• Two main approaches
– Document distribution
• by far the most popular
– Term distribution
196
197. Distributed IR (II)
• Document distribution
– each index server acts as a search engine for a small fraction of
the total collection
– director sends a copy of the query to each of the index servers,
each of which returns the top k results
– results are merged into a single ranked list by the director
• Collection statistics should be shared for effective ranking
197
198. Caching
• Query distributions similar to Zipf
• About ½ each day are unique, but some are very popular
– Caching can significantly improve effectiveness
• Cache popular query results
• Cache common inverted lists
– Inverted list caching can help with unique queries
– Cache must be refreshed to prevent stale data
198
199. Others
• Efficiency (compression, storage, caching,
distribution)
• Novelty and diversity
• Evaluation
• Relevance feedback
• Learning to rank
• User models
– Context, personalization
• Sponsored Search
• Temporal aspects
• Social aspects
199
Not only the data is different, also the queries, and the results we get from it!
To the surprise of many, the search box has become the preferred method of information access.
Customers ask: Why can’t I search my database in the same way?
Archie is a tool for indexing FTP archives, allowing people to find specific files. It is considered to be the first Internet search engine.
In the summer of 1993, no search engine existed for the web, just catalog
One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon University) was launched and became a major commercial endeavor.
In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape's web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.[7][8]
Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet.[6]
Aardvark was a social search service that connected users live with friends or friends-of-friends who were able to answer their questions, also known as a knowledge market. Bought by google 2010
Kaltix Corp., commonly known as Kaltix is a personalized search engine company founded at Stanford University in June 2003 by Sep Kamvar, Taher Haveliwala and Glen Jeh.[1][2] It was acquired by Google in September 2003.
How do we communicate with search engines
Information needs must be expressed as a query
– But users don’t often know what they want
ASK
Hypothesis Belkin et al (1982)
Proposed a model called Anomalous State of Knowledge
ASK
hypothesis:
– difficult for people to define exactly what their information need is, because that information is a gap in their knowledge
- Search Engines should look for information that fills those gaps
Interesting ideas, little practical impact (yet)
Under specified
Ambiguous
Context sensitive
represent different types of search
– E.g. decision making
– background search
– fact search
Need to have fairly deep knowledge...
– What sites are possible
– What’s in a given site (what’s likely to be there)
– Authority of source / site
– Index structure (time, place, person, ...) what kinds of searches?
– How to read a SERP critically
Commonplace book
Start with the simplest search you can think of:
[ upper lip indentation ]
If it’s not right, you can always modify it.
• When I did this, I clicked on the first result, which took me to Yahoo Answers. There’s a nice article there about something called the philtrum.
Ghost town vs abandoned
1750
Search for images with creative commons attributions
The need is verbalized mentally
Queries and documents must share a (at least comparable if not the same) representation
SCC – single connected component
IN – pages not discovered yet
OUT – sites that contain only in-host link
Tendrils – can’t reach or be reached from the SCC
creation of indefinitely deep directory structures like http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....
dynamic pages like calendars that produce an infinite number of pages for a web crawler to follow.
pages filled with a large number of characters, crashing the lexical analyzer parsing the page.
pages with session-id's based on required cookies.
Data: ; this type of data is conventionally dealt with a database management system.
Structure: With this view, documents are not treated as flat entities, so a document and its components (e.g. sections) can be retrieved
How do we arrive to the content representation of a document?
Nontrivial issues. Requires some design decisions.
Nontrivial issues. Requires some design decisions.
Matches are then more likely to be relevant, and since the documents are smaller it will be much easier for the user to find the relevant passages in the document. But why stop there? We could treat individual sentences as mini-documents. It becomes clear that there is a precision/recall tradeoff here. If the units get too small, we are likely to miss important passages because terms were distributed over several mini-documents, while if units are too large we tend to get spurious matches and the relevant information is hard for the user to find.
The problems with large document units can be alleviated by use of explicit or implicit proximity search
A simple strategy is to just split on all non-alphanumeric characters – bad
you always want to do the exact same tokenization of document and query words, generally by processing queries with the same tokenize
Conceptually, splitting on white space can also split what should be re- garded as a single token. This occurs most commonly with names (San Fran- cisco, Los Angeles) but also with borrowed foreign phrases (au fait)
Index numbers -> (One answer is using n-grams: IIR ch. 3)
Methods of word segmentation vary from having a large vocabulary and taking the longest vocabulary match with some heuristics for unknown words to the use of machine learning sequence models, such as hidden Markov models or condi- tional random fields, trained over hand-segmented words
No unique tokenization + completely different interpretation of a sequence depending on where you split
Nevertheless: “Google ignores common words and characters such as where, the, how, and other digits and letters which slow down your search without improving the results.” (Though you can explicitly ask for them to remain.)
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the to- kens.4 The most standard way to normalize is to implicitly create equivalence classes, which are normally named after one member of the set. For instance, if the tokens anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory, in both the document text and queries, then searches for one term will retrieve documents that contain either.
The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than being fully calculated in advance: the terms that happen to become identical as the result of these rules are the equivalence classes. It is only easy to write rules of this sort that remove characters. Since the equivalence classes are implicit, it is not obvious when you might want to add characters. For instance, it would be hard to know to turn antidiscriminatory into anti-discriminatory.
An alternative to creating equivalence classes is to maintain relations between not normalized tokens. This method can be extended to hand-constructed lists of synonyms such as car and automobile, a topic we discuss further in
Too much equivalence class
Why not the reverse?
Also stemmers based on N-grams-based
For example trigrams: information => {inf, nfo, for, etc}
careses
parties
separational -> separate
factional -> faction
Compression
Cache pressure
The distribution of term frequencies is similar for different texts of significant large size.
Heaps’ law gives the vocabulary size in collections.
Positional indexes are helpful, but we’ll ignore them for now
(Salton & McGill 1983)
The classifier that assigns a vector x to the class with the highest posterior is called the Bayes classifier.
The error associated with this classifier is called the Bayes error. This is the lowest possible error rate for any classifier over the distribution of all examples and for a chosen hypothesis space
A complete probability distribution over documents
− defines likelihood for any possible document d (observation)
− P(relevant) via P(document): PR∣d∝Pd∣RPR
− can “generate” synthetic documents willsharesomepropertiesoftheoriginalcollection
Not all IR Models do this – possible to estimate p(R|d) directly – log regression
Assumptions: one relevance value for every word w
Words are conditionally independent given R – false, but allows to lower down the number of parameters
All words absent are equally likely to be observed in relevant and not relevant classes
One relevance status value per word
empty document (all words absent) is equally likely
to be observed in relevant and non-relevant classes (provides a natural zero) - practical reason, only score terms that appear in the query (TAT)
Doesn’t model word dependence. Doesn’t account for document length. Doesn’t model word frequencies
Now D_t = d_t account for the number of times we observe the term in the document (we have a vector of frequencies)
Can we seen as a probabilisitic automata
They originate from probabilistic models of language gen-
eration developed for automatic speech recognition systems in the early 1980's
(see e.g. Rabiner 1990). Automatic speech recognition systems combine prob-
abilities of two distinct models: the acoustic model, and the language model.
The acoustic model might for instance produce the following candidate texts in
decreasing order of probability: \food born thing", \good corn sing", \mood
morning", and \good morning". Now, the language model would determine
that the phrase \good morning" is much more probable, i.e., it occurs more
frequently in English than the other phrases. When combined with the acoustic
model, the system is able to decide that \good morning" was the most likely
utterance, thereby increasing the system's performance.
For information retrieval, language models are built for each document. By
following this approach, the language model of the book you are reading now
would assign an exceptionally high probability to the word \retrieval" indicating
that this book would be a good candidate for retrieval if the query contains
this word.
For some applications we want all this highly probable P3 In IR P1=P2
Veto terms
Original multiple bernoulli, multinomial widely used now
accountsformultiplewordoccurrencesinthequery(primitive)– wellunderstood:lotsofresearchinrelatedfields(andnowinIR) – possibilityforintegrationwithASR/MT/NLP(sameeventspace)
Discounting methods
Problem with all discounting methods:
– discounting treats unseen words equally (add or subtract ε) – somewordsaremorefrequentthanothers
Essentially, the data model and retrieval function are one and the same
Different ways of smoothing, dirichlet priors smoothing particularly popular