Building Search Systems for the Enterprise

Building Search SystemsBuilding Search Systems
for the Enterprisefor the Enterprise
IBM Research – Almaden
ACMACM
SIGIRSIGIR 20112011
Beijing, China
(on behalf of Shivakumar Vaithyanathan)
Yunyao Li

• Search for the EnterpriseSearch for the Enterprise
• Programmable Search (overview)Programmable Search (overview)
• Backend AnalyticsBackend Analytics
• Search RuntimeSearch Runtime
• Foundations and PrinciplesFoundations and Principles
• Concluding RemarksConcluding Remarks
outlineoutline
2

Experience at IBM Internal SearchExperience at IBM Internal Search
• IBM deployed a commercially available search engine
– Implementing standard IR techniques
• Search quality went down over time to the point that
Search results were unacceptable!Search results were unacceptable!
Success (≥ 1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWW’07]
So, they implemented various solutions…
3
To the administrators managing the engine,
exposed knobs were insufficient

Attempts to Improve SearchAttempts to Improve Search
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake
terms to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for
the top 1200+ queries
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake
terms to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for
the top 1200+ queries
Didn’t help…
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?
4

What are the Problems?What are the Problems?
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology!
Domain-specific meaning!
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions!
popcorn search
conference call!
These problems are not specific
to enterprise search… but:
• Result 1: IBM Travel: Per Diem
• Result 2: IBM Travel: Per Diem Rates
• Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
5
…

The Enterprise Challenge!The Enterprise Challenge!
Domain-specific meaning! Domain-specific repetitions!
Generic search solutionGeneric search solution that is
customizable and maintainable in every
domain
Generic search solutionGeneric search solution that is
customizable and maintainable in every
domain
Simple customization with reasonable effort!Simple customization with reasonable effort!
Programmable SearchProgrammable Search
Ongoing search-quality managementOngoing search-quality management
6
Continually changing terminology!

outlineoutline
7

Programmable Search: Main IdeaProgrammable Search: Main Idea
• Goals:Goals:
– Transparency
• Know “precisely” why every result item is being brought back
• Understand how changes in content/intents affect search
– Maintainability and “Debugability”
• Ranking logic is guided by explicit rules
• Properly react to changes in content/intents
• Building blocks:Building blocks:
– Deep analytics on documents
– Domain-specific analysis of queries
– Transparent customizable rule-driven ranking
runtime rulesruntime rules
backend
analytics
backend
analytics
interpretationsinterpretations
8

Distributed Analytics Platform
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backend
analytics
backend
analytics
runtime rulesruntime rulesinterpretationsinterpretations
Implementation Architecture
backend
frontend
9

outlineoutline
10

Backend Analytics:Backend Analytics: 3 Parts3 Parts
Local AnalysisLocal Analysis
(per-page analysis)
(per-page analysis)
Global AnalysisGlobal Analysis
(cross-page analysis)
Token GenerationToken Generation
(TG)
(TG)
index
11

• Categorizing pages
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
– Geo classification
• Associate documents with the relevant countries & regions
• Annotating pages
– Identify HomePage annotation for people, projects,
communities, …
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
12

G J Chaitin Home Page
13
Homepage IdentificationHomepage Identification
Title ExtractionTitle Extraction
Matching title
patterns
Matching title
patterns
Title
s
Dictionary
Match
Dictionary
Match
Home Page for
G J Chaitin
• http://w3.ibm.com/hr/idp/
• http://w3-03.ibm.com/isc/index.html
• http://chis.at.ibm.com/
URL ExtractionURL Extraction
URLs
Matching URL
patterns
Matching URL
patterns
Homepage for:  idp  isc  chis
Employee
directory
… many more …
Intranet
page
Intranet
page
More details in
[Zhu et al., WWW’07]

14 IBM Confidential14 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for “Paula Summa”?
Role of Global AnalysisRole of Global Analysis
14

PersonPerson
TitleTitle
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
 Ho Ching-Tien  Tien Ho  Ho, Tien
 Howard Ho  Ching-Tien H.  ...
Global Technology Services
TG
personNameTG
 Howard  Ho Ching  Tien  ...
 gts  Global Technology Services
 Global Technology  Technology
Services  Global  Technology  ...
 GlobalTechnologyServices
nGramTG
spaceTG
acronymTG
nGramTG
……
… 15
…
…

outlineoutline
16

3 Phases of Runtime Flow
Search QuerySearch Query
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
• Query interpretation
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
By relevance buckets
+ conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 3:Phase 3:
ResultResult
ConstructionConstruction
• Grouping rules
• Re-ranking rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
17

Phase 3:Phase 3: Result Construction
Phase 2:Phase 2: Relevance Ranking
Phase 1:Phase 1: Query SemanticsQuery Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More DetailsRuntime Flow in More Details
18

Runtime Rules:Runtime Rules: Pattern-Action Language
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
• ibm germany
• info india
Rewrite into “[country] hr”
(e.g., germany hr)
ENDS_WITH installation
• acrobat installation
• db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
• driving directions to almaden
• directions to watson from jfk
Pages of “siteserv” category
should be ranked higher
STARTS_WITH
[d=PERSON]
• john kelly biography
• steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern → Action
19

Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
20

21
What’s Best for Benefits?What’s Best for Benefits?
The most important IBM page for benefits
changes over time: currently it is netbenefits
The most important IBM page for benefits
changes over time: currently it is netbenefits
21

Rewrite RulesRewrite Rules
benefits  netbenefits
interpretations
result aggregation
ordered results
grouping rules
re-ranking rules
benefits, netbenefits
benefits  netbenefits
rewrite rules
queries
benefits search
22

Interpretations
Scenario: An IBM employee wants
to download Lotus Symphony 1.3
Scenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3  category=issi software=symphony 1.3
result aggregation
ordered results
grouping rules
re-ranking rules
rewrite rules
queries
interpretations
download symphony 1.3 search
23

24
IBM Confidential
People with
first name Jim
People with
first name Jim
How can we avoid pages
from people category?
How can we avoid pages
from people category?
java  jim
Complex RulesComplex Rules
24

java  jim and not in person category
Complex RulesComplex Rules
result aggregation
ordered results
grouping rules
re-ranking rules
interpretations
rewrite rules
queries
java search
25

Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
26

PersonPerson
TitleTitle
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
personNameTG
 gts  Global Technology Services
 Global Technology  Technology
Services  Global  Technology  ...
nGramTG
spaceTG
acronymTG
nGramTG
……
…
…
…
 Ho Ching-Tien  Tien Ho  Ho, Tien
 Howard Ho  Ching-Tien H.  ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
27

Annotation + TG  Relevance Bucket
… 28…
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search
Relevance bucketsRelevance buckets
•Buckets are ranked
– Based on annotation type
– Based on TG quality
•A page can belong to
multiple buckets
•Within each bucket,
ranking is by
conventional IR
……

Ranking by Relevance Buckets
grouping rules
re-ranking rules
interpretations
rewrite rules
queries
result aggregation
ordered results
employment verification search
29

Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
30

• Grouping rules define how search results should
be grouped together
• Search administrators can improve the diversity
of search results (in 1st
page)
– Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, …


Grouping RulesGrouping Rules
Query pattern
31

Need first page diversityNeed first page diversity
Flooding with Similar PagesFlooding with Similar Pages
32

33
33 IBM Confidential
Grouping Rule to the RescueGrouping Rule to the Rescue
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
rewrite rules
queries
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
33

• Re-ranking rules adjust ranking of
search results based on categories
• Example: search administrator specifies the
important sources of “hot/current topics”
Re-ranking RulesRe-ranking Rules
Hot topics Rank these categories higher
 Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
34

BluepediaBluepedia
Technical NewsTechnical News
Re-ranking Rule for Hot TopicsRe-ranking Rule for Hot Topics
Homepages of
“About IBM”
Homepages of
“About IBM”
Hot topics Rank these categories higher
 Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
35

Re-ranking Rules for Person QueriesRe-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files
Media_librar
y
Media_librar
y
executive_cornerexecutive_corner
interpretations
rewrite rules
queries
result aggregation
ordered results
grouping rules
re-ranking rules
Paula Summa search
36

Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
+ conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
37

What Administrators Need…
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
Okay… but:
The proof of the pudding is in the eating!The proof of the pudding is in the eating!
Recap:
38

outlineoutline
40

Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
By relevance buckets +
conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
Foundations of Programmable SearchFoundations of Programmable Search
• Developed a framework laying the foundations
and principles of programmable search
• Formal search model and rule language
– Formalize “rules”, “interpretations,” “relevance
buckets,” and so on
Fagin, Kimelfeld, Li, Raghavan, Vaithyanathan: Understanding
queries in a search database system. PODS 2010.
41

Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
Example of a Principle: Rule SemanticsExample of a Principle: Rule Semantics
• How to apply rewrite rules to the search query?
• Simple way: each rule applied once, predefined order
• “Thorough” way: least fixpoint (apply repeatedly)
– Problem: “bad” (combinations of) rules lead to non-termination
• Real problem: detecting non-termination is undecidable
– Good news: robust & tractable “safety” guarantees termination
Fagin, Kimelfeld, Li, Raghavan, Vaithyanathan: Rewrite
rules for search database systems. PODS 2011.
42

outlineoutline
43

Summary & Future WorkSummary & Future Work
Programmable search:Programmable search:
 Simple & flexibleSimple & flexible customizationcustomization
 Search quality managementSearch quality management
Programmable search:Programmable search:
 Simple & flexibleSimple & flexible customizationcustomization
 Search quality managementSearch quality management
Backend Analytics
Local analysisLocal analysis
(per-page analysis)
Local analysisLocal analysis
(per-page analysis)
(TG)
(TG)
[Fagin et al.,
PODS’10,
PODS’11]
Future Research: ToolingFuture Research: Tooling
• Search provenance
• Rule suggestion
• Utilization of relevance buckets
[Li et al.,
SIGIR’06,
Zhu et al.,
WWW’07]
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 1:Phase 1:
QueryQuery
SemanticsSemantics
• Rewrite rules
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
conventional IR
Phase 2:Phase 2:
RelevanceRelevance
RankingRanking
conventional IR
Phase 3:Phase 3:
ResultResult
• Grouping rules
Phase 3:Phase 3:
ResultResult
• Grouping rules
44

Building Search Systems for the Enterprise

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Search Systems for the Enterprise

Similar to Building Search Systems for the Enterprise (20)

More from Yunyao Li

More from Yunyao Li (20)

Recently uploaded

Recently uploaded (20)

Building Search Systems for the Enterprise

Editor's Notes