Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Enterprise Search in the
Big Data Era
Yunyao Li
Ziyang Liu
Huaiyu Zhu
IBM Research - Almaden
NEC Labs
IBM Research - Almaden

1
Enterprise Search
• Providing intuitive access to an organization’s
various digital content
1
Report Find
• IDC report [IDC 05] • $5k/person/year wasted salary due to poor search
• 9-10hr/person/week doing search
• unsuccessful 1/3-1/2 of the time
• Butler Group
[Edwards 06]
• 10% of salary cost wasted through ineffective search
• Accenture survey
[Accenture 07]
• Middle managers spend 2 hr/day searching
• >50% of what they found have not value
• Hawking, Enterprise Search, http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf
• [IDC 05] the enterprise workplace: How it will change the way we work”. IDC Report 32919
• [Edwards 06] www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf
• [Accenture 07] http://newsroom.accenture.com/article_display.cfm?article_id=4484

2
Magic
Search from User’s Point of View
Results
1 ……………..
2 ……….
3 ……………..
4 ……….
…………
……………
INTRODUCTION SEARCH

3
What Happens Behind the Scene
Backend
Collect data
Analyze data
Index data
Frontend
Serve user queries
Return results
Index
Data
Source
INTRODUCTION SEARCH

4
How Does a Query Match a Document?
Index
Document
………………..
………… …………
… ….. ……..
…………………
…………
Document
………………..
………… …………
… ….. ……..
…………………
…………
Results
Doc 1 ………..
Doc 2 …….
Doc 3 …………..
Doc 4 ……….
…………
……………
Analyze query
Present results
Analyze document
Search index
Build index
INTRODUCTION SEARCH

5
Search Is More Than Keyword Match
• Specific features in documents are important
– Title, url, person name, product, actions, …
• Features combine to form higher level concepts
– In document: Home page + person personal homepage
– Cross document: URL link analysis, …
• The string representation in document may not match that in
user query
– Person name: Bill Clinton William Jefferson Clinton
• User queries may be ambiguous
– Multiple interpretations
• Presenting the results to user
– Ranking, grouping, interactive refinement
INTRODUCTION SEARCH

6
Internet vs Enterprise – Web data
[Fagin WWW2003]
Internet Enterprise
Creation of
content
• Democratic
• Appealing to reader
• Links approval
• Bureaucratic
• Conform to mandate
• Links internal structure
Relevant
query results
• Large number
• Overlapping information
• Reasonable subset suffices
• Ranking is more universal
• Small number
• Specific function
• Specific pages required
• Ranking is relative to query
Spamming
• Spam infested
• Ranking can only be based
on external authority
• Mostly spam-free
• Ranking based on content
or metadata are reliable
Search
engine
friendliness
• Web pages designed to be
search results
• Web page document
• Documents not designed to
be search results
• Special treatment
INTRODUCTION ENTERPRISE VS INTERNET

7
Internet vs Enterprise – Big Data
Internet Enterprise
Content being
searched
• Sources: Web crawl
• Formats: html, xml, pdf,
• Variety of sources
• Variety of formats:
• Email, database, application-
specific access and formats
Search queries
/expected
results
• Target: web pages, office
documents
• Expect list of documents
• Expect little personalization
• Return result directly
• Target: rows, figures, experts, ...
• Expect customized results
• Personalization required:
geography, access,
• Customize results
Related
information
• Link approval
• Small number of domain-
specific knowledge
• Generic analysis
• Link organization structure
• Large number of dynamic
domain-specific knowledge
• Highly specialized analysis
Skill set of
search admins
• Large number of admins
• Search experts
• Facilitate update of search
algorithms
• Small number of admins
• Domain experts
• Facilitate use of domain
knowledge
INTRODUCTION ENTERPRISE VS INTERNET

8
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
INTRODUCTION TUTORIAL OVERVIEW COMPONENTS

9
Search Engine Architecture
Backend
Collect data
Analyze data
Backend
Collect data
Analyze data
Admin
System performance
Search quality control
Frontend
Search index
Present results
Interact with user
index
Data
source

10
Main Backend Functions
Analysis (Understand)
Information extraction
Analyse and transform data
Indexing (Prepare for search)
Generate terms suitable for match queries
Index search terms
index
Document Ingestion (Collect)
Collect all the data to be searched
Transform and store as documents
Local Analysis
(in-document analysis)
Global Analysis
(cross-document analysis)

11
Backend Section Outline
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing

12
Typical analytics pipeline
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Collect data
• Transform to uniform
document format
• Store in document store
Data ingestion
• Collect data
• Transform to uniform
document format
• Store in document store
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
Global analysis
filter documents
index
Indexing
• Generate search terms,
• Index documents by
search terms
Indexing
search terms
Local analysis:
• Information extraction
from each document
Local analysis:
from each document
DI
BACKEND OVERVIEW

13
Digression: Classical IR
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Given set of files
Data ingestion
• Given set of files
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Calculate statistics of
terms in documents
Global analysis
• Calculate statistics of
terms in documents
index
Indexing
• Index by terms with
statistics
Indexing
• Index by terms with
statistics
Local analysis:
• Tokenize
• Stop wording
• Stemming
• Form n-grams
Local analysis:
• Tokenize
• Stop wording
• Stemming
• Form n-grams
DI
BACKEND OVERVIEW

14
Digression: Classical Web search
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Crawl web pages
Data ingestion
• Crawl web pages
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Calculate eigenvalues of
connection matrix
Global analysis
• Calculate eigenvalues of
connection matrix
index
Indexing
• Generate search terms
search terms, with page
rank
Indexing
• Generate search terms
search terms, with page
rank
Local analysis:
• Extract out links
Local analysis:
• Extract out links
DI
BACKEND OVERVIEW

15
Demands of Enterprise Search
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Handle variety of sources
• Handle variety of formats
• Deal with access policy
• Deal with update policy
Data ingestion
• Handle variety of sources
• Handle variety of formats
• Deal with access policy
• Deal with update policy
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
filter documents
Global analysis
filter documents
index
Indexing
search terms
Indexing
search terms
Local analysis:
• Incorporate domain knowledge
• Extract rich set of semantics
• Categorize documents
Local analysis:
• Incorporate domain knowledge
• Extract rich set of semantics
• Categorize documents
DI
BACKEND OVERVIEW

16
• Efficient incremental updates
– Fast turn around time for updates
• System performance and reliability
– Scaling with data size and resource available
– Fault tolerance
• Ease of administration quality improvement
– Allow search admin to customize domain specific
configurations
BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES
Desiderata of backend

17
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing

18
Data Ingestion
BACKEND DATA INGESTION
Doc. Store
Crawl/push
Web DB App
Convert to
document
…
Convert
to text
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file1.pdf
Docid: 0002
___________
…….ABCD…..
… 01/12 …………
… … ……..
………
………..
… ………
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file.pdf
Email +attach
Docid: 0002
___________
title: ABCD.
Date: 01/12
…………
… … ……..
………
………..
… ………
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file.pdf
Variety of
sources
Support update &
retention policy
Pdf file

19
Document-centric View
• Data as a collection of documents
– Document as unit of storage and search result.
– Three major components
• Unique document identifier in the whole system
• Metadata fields: url, date, language, …
• Content field: text to be searched
• Representation of data of different structures
– Web pages Each page is a document
– Relational data Each row is a document
– Hierarchical data Each node is a document

20
Push vs Pull
Pull Push
Definition
• Search engine initiate
transfer of data
• (Web crawler)
• Content owner initiate transfer
of data
• (Apps with push notice)
Advantage
• Operated by search engine
• Use standard crawlers
• Can handle special access
methods
• Easy to adjust refresh rate
• Easy to handle special format
Disadvantage
• Difficult to access special
data sources
• Difficult to adjust domain
specific treatment
• Need synchronization with
content owner
Applicability
• Prevalent for Internet
• Also useful for enterprise
• Rare for Internet
• Very important for enterprise

21
Transform the Data
• Format conversion
– Convert content to text: pdf, doc, …
• Keep as much structure as possible
• Metadata conversion
– Obtain and transform metadata: HTTP headers,
DB table metadata, …
• Merge /split documents
– One-to-many: Zip file, email thread, attachments
– Many-to-one: social tags merge to original doc

22
Storage options
Options Pro Con
SQL database
• Traditional RDBM strengths
• Support insert, update,
delete, fielded query,
• Too much system overhead
Indexing
engine
(Lucene)
• Closer to document centric
view
• Supports insert, delete,
fielded query
• No direct in-document update
• Need special treatment for
distributed processing
NoSQL
databases
• Light weight
• Sufficient for simple use
• May lack features in the future
• Transaction?
Issues to consider
• In document update
• Access/Retention policy
• Parallel processing

23
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing

24
Local Analysis
• Annotating pages
– Extract structured elements: title, header, …
– Extract features for people, projects,
communities, …
– Extract features for cross-document analysis.
• Categorizing pages
– Label by standard categories
• Language, geography, date, …
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
Local analysis is essentially information extraction
BACKEND LOCAL ANALYSIS

25
Rule-based IE ML-based IE
PRO
• Declarative
• Easy to comprehend
• Easy to maintain
• Easy to incorporate domain
knowledge
• Easy to debug
• Trainable
• Adaptable
• Reduces manual effort
CON
• Heuristic
• Requires tedious manual
labor
• Requires labeled data
• Requires retraining for domain
adaptation
• Requires ML expertise to use or
maintain
• Opaque (not transparent)
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Rule-based vs. Learning-based IE

26
Commercial Vendors
(2013)
NLP Papers
(2003-2012)
100%
50%
0%
3.5%
21%
75%
Rule-
Based
Hybrid
Machine
Learning
Based
45%
22%
33%
Large Vendors
67%
17%
17%
All Vendors
• GATE Information Extraction
• IBM InfoSphere BigInsights
• Microsoft FAST
• SAP HANA
• SAS Text Analytics
• HP Autonomy
• Attensity
• Clarabridge
Example Industrial Systems
Source: [CLR2013] Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!, EMNLP 2013
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Landscape of Entity Extraction
Implementations

27
Intranet
page
NavPanel Extraction
NavPanels
Self link
identification
Title Extraction
Matching title
patterns
Titles
Dictionary
Match
Person name
dictionary
Person name in title
Title Extraction
Matching title
patterns
Titles
Title Name
URL Extraction
URLs
Matching URL
patterns
URL Name
Person name dictionary = employee directory
IBM Global Services Security Home
IBM Global Services Security
G J Chaitin Home Page
G J Chaitin
1. http://w3-03.ibm.com/marketing/
2. http://w3-03.ibm.com/isc/index.html
3. http://chis.at.ibm.com/
1. marketing
2. isc
3. chis
BACKEND LOCAL ANALYSIS EXAMPLES
[Zhu et al., WWW’07]
Local analysis for different features

28
Consolidation
– Example: Document language consolidation
• HTTP header Accept-Language: en-us,en;q=0.5
• Meta tags <meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
• Document text encoding
• URL http://enterprise.com/hr/benefits/us/ca/
BACKEND LOCAL ANALYSIS TRANFORMATIONS

29
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing

30
Global Analysis
• Deduplication
– Save resources, reduce result clutter
• Identify root of URL hierarchy
– Used for result grouping and ranking
• Anchor text analysis
– Assign external labels to documents
• Social tagging analysis
– Assign tags and their weights to documents
• Identify different versions of the same document
– Due to variations in date, language, …
• Enterprise-specific global analysis
– When certain documents co-exists, do this …
• …
BACKEND GLOBAL ANALYSIS

31
Shingle based deduplication
(Leskovec, http://www.mmds.org/)
S1={s1, s2, …}
S2={s1, s3, …}
S3={s2, s3, …}
{h1(S1), h2(S2), …}
{h1(S2), h2(S2), …}
{h1(S3), h2(S2), …}
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Shingles:
• Character or token n-gram
• Possibly stemmed
• Possibly related to stop words
Shingles:
• Character or token n-gram
• Possibly stemmed
• Possibly related to stop words
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Minhash:
• Maps sets to integers
• Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
Minhash:
• Maps sets to integers
• Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
| A∩B | / | A∪B |
More diverse set of documents. More precise.
BACKEND GLOBAL ANALYSIS DEDUPLICATION

32
Metadata-based deduplication
(IBM Gumshoe search engine)
S1=[h11, h12, …]
S2=[h21, h22, …]
S3=[h31, h32, …]
G1 = {S1, …}
G2 = {S2, S3, …}
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Significant metadata:
• Document title
• Section headers
• Signatures from URL
Ensure that all similar candidates
have the same signature
Significant metadata:
• Document title
• Section headers
• Signatures from URL
Ensure that all similar candidates
have the same signature
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Group by signature
• Perform detailed analysis
In-group similarity analysis:
• Analyze documents within candidate groups
Group by signature
• Perform detailed analysis
In-group similarity analysis:
• Analyze documents within candidate groups
More customizable for intranet. Less cost.
BACKEND GLOBAL ANALYSIS DEDUPLICATION

33
URL Root Analysis (Zhu et al., WWW’07)
host1/b/a/~user1/pub
host1/b/a
host1/b/a/~user1/
host1/b/c
host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html
host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk
host1/b/c/d/e/index.html
• Given a set of documents all with the same value V of feature X.
• E.g., At one time all webpages from IBM Tucson site had the same title
• Find the roots of URL forest. These will be preferred result for query X=V.
• E.g., when searching for “Tuscon home page”, only the IBM Tuscon homepage will match.
BACKEND GLOBAL ANALYSIS ROOT ANALYSIS

34
Label Assignment (Zhu et al., WWW’07)
BACKEND GLOBAL ANALYSIS LABEL ASSIGNMENT
Document B
………………..
… …… …………
… … ……..
………… ………
…… ……
Document A1
………………..
… X home …
…………
… … ……..
………… ………
…… ……Document A2
………………..
… X home …
…………
… … ……..
………… ………
…… ……
Bookmark C1
X home
Anchor text global
analysis:
• Assign label “X” and
/ or “Y” based on
frequency
Bookmark C2
X
Bookmark C3
Y home
Document A2
………………..
… X home …
…………
… … ……..
………… ………
…… ……
Social tagging
global analysis:
• Assign label “X
home”, “X”, and “Y
home” based on
frequency

35
Entity Integration using HIL
Entity Population Rules
• Create entities (from raw records, other
entities, and links)
• Clean, normalize, aggregate, fuse
Various data
sources
Information
Extraction
Entity
Resolution
Fuse
Aggregate
Entity Integration
Entity Resolution
Rules
• Create links between
raw records or entities
Map
Unstructured
Data
Unified
entities
Defines entity types (the logical
data model of the integration flow)
(SQL-like) rules to specify the
integration logic
Raw
Records
HIL
[Hernández et
al, EDBT’13]
Declarative IE
(IBM SystemT)
[Chiticariu et al, ACL
2010]
Optimizing compiler to Big Data runtime (Jaql and Hadoop)
BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION

36
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing

37
Indexing
• Generate and index search terms, to be
matched by terms generated at runtime from
user queries.
• Challenges:
– Extracted terms do not match user query terms
• Morphological changes, synonyms, …
– Importance of term depends on query
• Needs for bucketing of indexes, …
– Support of incremental indexing
BACKEND INDEXING

38
Term normalization
• Example: Date time normalization
– Given any of these
Wed Aug 27 10:06:11 PDT 2014
27 Aug 2014, 10:06:11
2014-08-27T10:06:11-07:00
27 Aug 2014
1409133971
– Normalize to 2014-08-27T10:06:11-07:00
– Other examples: Person names, product names,
…
BACKEND INDEXING TERM NORMALIZATION

39
Why Generate Variant Terms?
• Extracted feature string ≠ query string
– People names
• Document: John Doe Search: Doe, John Search: J Doe
– Acronym expansions
• gts Global Technology Services
– N-gram variant generation
• Title: reimbursement of travel expenses
• Terms: reimbursement, travel expenses, reimbursement travel,
reimbursement of travel, reimbursement expenses
• Normalization is not sufficient solution
– People names
• Document: John Doe J. Doe Search: Jean Doe J. Doe
• These are not supposed to match
• Solution:
– Generate variant terms with different levels of approximation.
BACKEND INDEXING VARIANT TERM GENERATION

40
Configurable Term Generation
• Configuration knobs determine the set of outputs
• Given “Mr. John (Jack) M. Doe Jr.”
– Configuration1:
Initial=both, Dot: with, NickName: both, MiddleName: both, NameSuffix:
without, Title: without, Comma:both
John M. Doe Doe, John M.
John Doe Doe, John
J. M. Doe Doe, J. M.
J. Doe Doe, J.
Jack M. Doe Doe, Jack M.
Jack Doe Doe, Jack
– Configuration2 (normalization):
Initial=without, Dot: without, NickName: without, MiddleName: without,
NameSuffix:without, Title: without, Comma: without
John Doe
BACKEND INDEXING VARIANT TERM GENERATION

41
Enterprise Search Backend
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Access various sources
• Document transform
• Format transform
Data ingestion
• Access various sources
• Document transform
• Format transform
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Deduplication
• URL root analysis
• Label assignment
• …
Global analysis
• Deduplication
• URL root analysis
• Label assignment
• …
index
Indexing
Indexing
• Generate search terms,Local analysis:
• Configurable
Local analysis:
• Configurable
DI
BACKEND RECAP

42
Search Engine Architecture
Backend
Collect data
Analyze data
Backend
Collect data
Analyze data
Admin
System performance
Search quality control
Frontend
Search index
Present results
Interact with user
Frontend
Search index
Present results
Interact with user
index
Data
source

Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)

44
1. Ambiguity
• Optimal keywords may not be used.
– Misspelled
• “datbase”
– Under-specified
• polysemy: “java”
• too general: “database papers”
– Over-specified:
• synonyms, acronyms, abbreviations &
alternative names: “green card” ≡
“permanent residency”
• too specific: “MS Office 2007 for Mac x64
edition”
– Non-quantitative:
• “small laptop”
query cleaning query autocompletion
query refinement
query rewriting
query rewriting

45
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY

46
Graph-based Spelling Correction
(bao acl 11)
• Repartition the query.
– Each partition (token) should be plausible: confidence
(correcting it) > threshold.
– confidence: linear combination of multiple scores, parameters
learned from SVM.
• Domain knowledge is often used in calculating confidence.
• For each partition, generate candidate corrections with
high scores.
“enterpricsea rch”
“enterpricse arch”
“enterpric search”
“enter pric search”
etc.
price: 0.8
prim: 0.6
etc.
pric
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY

47
(bao acl 11)
• Build a graph that connects candidate
corrections.
• Each full path is a candidate query.
– Find k top-weighted full paths
enterprise
enter
price
prim
arc
sea rich
search
1. correction score
(node weight)
2. merge penalty
(node weight)
3. split penalty
(edge weight)
enterprise → search
enter → price → sea → rich
e.g.,
weights
price: 0.8
prim: 0.6
etc.
pric

48
(bao acl 11)
• Weight doesn’t consider term correlations.
• Calculate a score for each path
– Score includes term correlations.
• This ensures the cleaned query has good quality
results.
• Correlations are computed based on number of co-
occurrences.
• Finally returns paths with high scores.
e.g., correlation(“enterprise search”) > correlation (“enterprise arc”)
e.g., “enterprise search” vs. “enterprise arc”

49
XClean (lu icde 11)
– based on the noisy channel model that finds the
intended word given the user’s input word.
– results on XML are subtrees rooted at entity nodes.
• A result quality score is calculated for each entity node in
T, and then aggregated.
• e.g., if Johnny and Mike works in the same department,
then “Johnn, Mike” → “Johnny, Mike” rather than “John,
Mike”.
– processes each word individually, i.e., no merge or
split.
Query Cleaning on Relational Data: Pu VLDB 08
related
department
head
Johnny
employees
…
QUERY CLEANING STRUCTURED DATAFRONTEND AMBIGUITY

50
• query cleaning
• query rewriting
relevant results.
• query forms
FRONTEND AMBIGUITY

51
Query Autocompletion
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

52
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
desr
desert
dessert
deserve

53
n
c
ae
x
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
data contains “search”, “sand” and “text”
max. edit distance = 1
no input input: s input: se input: sen
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
Showing results instead of keywords
can be achieved
by associating inverted lists to trie nodes.
trie

54
Tastier(li vldbj 11)
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
“have a nni” show results for “have a nice day”

55
Tastier(li vldbj 11)
• Trie-based (similar as previous paper).
– Trie leaf nodes are associated with inverted lists.
• To handle multiple keywords:
– Each record/document is associated with a sorted lists of
words in it (forward lists).
• so that a binary search can determine whether a string appears
in a record/document as a prefix.
• why not hash? Because we need to match prefix, not whole
words.
• Inverted list intersections are computed
incrementally using cache for improved efficiency.
“have a nice day” “a, day, have, nice”
example
forward list

56
Phrase Prediction(nandi vldb 07)
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
a nice have a nice day

57
Phrase Prediction(nandi vldb 07)
• Suggest phrases given the user input phrase.
– Need to find a good length of a suggested phrase.
• Too short: utility is small.
• Too long: low chance of being accepted.
• (modified) suffix tree-based.
– Each node is a word, rather than a letter.
– Why not use trie: phrases have no definitive starting point.
A phrase may start in the middle of a sentence (i.e., start at
a suffix of the sentence), hence suffix tree.
• Significant phrases.
laptop
have a nice day

58
• query cleaning
• query rewriting
relevant results.
• query forms
FRONTEND AMBIGUITY

59
Query Refinement
• Motivation
– Some under-specified queries on large data
corpus have too many results.
– Ranking cannot always be perfect.
• Approaches
– Identifying important terms in results
(structured/unstructured)
– Clustering results
(structured/unstructured)
– Faceted search
(structured)
FRONTEND AMBIGUITY QUERY REFINEMENT

60
Using Clustered Results (liu pvldb 11)
All suggested queries are about
programming language.
It is desirable to refine an ambiguous query
by its distinct meanings.
“Java”

61
• → Input: clustered results
– clustering method is irrelevant.
– e.g., the result of “Java” may have 3 clusters
corresponding to Java language, Java island, and
Java tea.
• ← Output: one refined query for each cluster.
Each refined query:
– maximally retrieves the results in its cluster
(recall)
– minimally retrieves the results not in its cluster
(precision)
Using Clustered Results (liu pvldb 11)

62
Using Important Terms in Results
(tao edbt 09)
• For relational data only.
• Given a keyword query, it outputs top-k most
frequent non-keyword terms in the results,
without generating the results.
– Avoiding result generation is possible since the
terms are ranked only by frequency: tradeoff of
quality and efficiency.
Data Clouds (for structured data): Koutrika EDBT 09
(more sophisticated term ranking, but needs to generate query results first.)
related

63
Faceted Search
all
location:
Sunnyvale, CA
location:
Phoenix, AZ
location:
Amherst, MA
department:
data management
department:
machine learning
1. How to select facets and facets conditions at each level, to
minimize the user’s expected navigation cost?
2. How to rank facets and facets conditions?
challenges
Chakrabarti SIGMOD 04 Kashyap CIKM 10
……
……
……

64
• query cleaning
• query rewriting
relevant results.
• query forms
FRONTEND AMBIGUITY

65
Query Rewriting
• Motivation
– Synonyms, alternative names: “green card” vs
“permanent residency”.
– Too specific: “MS Office 2007 for Mac x64 edition”
– Non-quantitative: “small laptop”
• Approaches
– Using query/click logs
– Finding rewriting rules from missing results
• e.g., replace “green card” with “permanent residency”.
– Using “differential queries”
FRONTEND AMBIGUITY QUERY CLEANING

66
Using Query and Click Logs (cheng
icde 10)
The availability of query and click logs
can be used to assess ground truth.
query Q
query log
click log
synonyms
hypernyms
hyponyms
of Q
“query” “search”
synonym
“MySQL” “database”
hypernym
“database” “MySQL”
hyponym
find and return historical queries
whose “ground truth” (via click
log) significantly overlaps with
top-k results of Q.
idea

67
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
• Challenges for automatically generating
rewriting rules:
– rules should be semantically natural.
– a new rule designed for one query may eliminate
good results of another query.
“green card”
result d is missing / should
be ranked higher
result d contains phrase
“permanent residency”
rewriting rule:
green card → permanent residency

68
→ Input: query q, missed
desirable results d
← Output: selected
set of rules
Generate candidate
rules L → R.
• L: n-grams in q.
• R: n-grams in high-
quality fields of d.
Identify semantically
natural rules by
machine learning.
Greedily select a
subset of rules that
maximizes the
overall query quality.
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
green card → permanent residency
green card → federal government

69
Keyword++ (Entity Databases)
(xin pvldb 10)
“small IBM laptop”
ID Product Name BrandName Screen Size Description
1 ThinkPad E545 Lenovo 15 The IBM laptop...small
business…
2 ThinkPad X240 Lenovo 12 This notebook...
To “understand” a term, compare two queries that
differ on this term, and analyze the differences of
attribute value distributions in the results.
idea
e.g., to understand term “IBM”, we can compare the results of
“IBM laptop” vs. “laptop”.

70
Suppose: “IBM laptop” → 50 results, 30 having “brand: Lenovo”
“laptop” → 500 results, only 50 having “brand: Lenovo”
The difference on “brand: Lenovo” is significant,
reflecting the meaning of “IBM”.
IBM brand: Lenovo
small order by size ASC
Offline: compute the best mapping for all terms in query log
Online: compute the best segmentation of the query (DP).
“laptop”
“small laptop”
likewise:
Keyword++ (Entity Databases)
(xin pvldb 10)

71
• query cleaning
• query rewriting
relevant results.
• query forms
FRONTEND AMBIGUITY

72
Offline: how many query forms, and which query
forms, should be generated?
• Too many – hard to find the relevant forms.
• Too few – limiting query expressiveness.
Online: how to identify query forms relevant to
users’ search needs?
Query Forms
Enabling users to issue precise structured queries
without mastering structured query languages.
advantage
challenges
Baid SIGMOD 09 Jayapandian PVLDB 08 Ramesh PVLDB 11 Tang TKDE 13
FRONTEND AMBIGUITY QUERY FORMS

74
2. Ranking
Ranking Method Categories
Unstructured Data
• represents queries and documents using vectors
• each component is a term; the value is its weight
• ranking score = similarity (query vector, result vector)
Structured Data
• a document → a node or a result (subgraph/subtree)
vector space model
proximity based ranking
…
authority based ranking
…
FRONTEND RANKING

75
2. Ranking
Unstructured Data
• proximity of keyword matches in a document can
boost its ranking.
Structured Data
• weighted tree/graph size, total distance from root to
each leaf, semantic distance, etc.
vector space model
…
…
FRONTEND RANKING

76
2. Ranking
vector space model
…
…
Unstructured Data
• nodes linked by many other important nodes are
important.
Structured Data
• authority may flow in both directions of an edge
• different types of edges in the data (e.g., entity-entity
edge, entity-attribute edge) may be treated differently.
FRONTEND RANKING

78
3. Representation
• Enterprise corpus can be much more
heterogeneous than a collection of documents or
web pages.
• Different searches may have different types:
retrieving a document, a figure, a tuple, a
subgraph, analytical keyword queries, etc.
Result diversification
Result summarization
Result differentiation
solutions
FRONTEND REPRESENTATION

79
Result Diversification
• Result diversification is essentially the same
problem as query refinement.
– e.g., Java → Java language, Java tea, Java island.
• Same techniques apply.
FRONTEND REPRESENTATION DIVERSIFICATION

80
Result Summarization
• Unstructured data: lots of work on text
summarization in machine learning, natural
language processing and IR communities.
• Structured data:
– Size-l object summary (Relational)
– Result snippet (XML)
Das, CMU 07 (unpublished)
Nenkova, Mining Text Data 12
surveys
FRONTEND REPRESENTATION SUMMARIZATION

81
Size-l Object Summary (fakas pvldb 11)
……Mike……
first
window
“Mike”
unstructured
Mike
paper paper patent patent…
conference John …
… … …
… …
?
structured

82
Size-l Object Summary (fakas pvldb 11)
• Each tuple has:
– a static importance score.
• similar idea as PageRank
– a run-time relevance score.
• distance to result root
• connectivity properties to result root
• Objective: find a connected snippet of the result,
which consists of l tuples and has the maximum
score.
• Dynamic programming based solution.
Result snippet for XML: Liu TODS 10
related

83
Result Differentiation
Result 1 Result 2
event: year 2000 2012
paper: title OLAP
data
mining
cloud
scalability
search
“NEC Labs Open House”
result 1: a large table with many
people / papers / posters
result 2: a large table with many
people / papers / posters
…
results result differentiation
vs. comparing different credit cards on a bank website:
only with pre-defined features.
FRONTEND REPRESENTATION DIFFERENTIATION

84
4. Expert Search
documents in which a candidate and a topic co-occur
topics near a candidate in a document
problem solving / ticket routing history
user’s knowledge on a topic
• expert should be more knowledgeable
social relationship between expert and user
• problem solving is usually more effective if expert has a close
social relationship with user
external corpus
• many employees publish stuff externally, i.e., papers, blogs.
ways for judging an expert
Find an expert within an enterprise to solve a particular problem.
goal
FRONTEND EXPERT SEARCH

85
Classical Methods
• Builds a feature vector for each expert using various
evidence
• Ranks experts based on query, using traditional
retrieval models
candidate model
• First finds documents related to query, then locates
experts in documents
• Mimics the process a human takes.
document model
Balog CIKM 08
survey

86
User-Oriented Model (smirnova ecir 11)
Users prefer experts who:
are more knowledgeable
than themselves.
knowledge gain: p(e|q) – p(u|q)
have a close social relationship
with themselves.
time-to-contact: shortest path
department
head
John
employees
…
e = expert
u = user

87
Using Web Search Engine
(santos inf. process. manage. 11)
query q
result from intranet
web query q’ result from internetformulate
web query
search
intranet
corpus combine
candidate’s full name: “Jeff Smisek”
organization’s name: “IBM”
terms in q: “data integration”
excluding results from organization: “-site:ibm.com”

88
Ticket Routing (shao kdd 08)
new ticket: DB2 login failure
transferred to group A
transferred to group B
transferred to group C
resolved
How to find the best group and
reduce problem solving time?
Markov chain model
Using only previous routing
history (not ticket content)

89
Ticket Routing (shao kdd 08)
Pr(g|S)
possibility to route a ticket to
group g given previous groups S
Pr(g|S) includes the probability that:
• g can solve the ticket
• g can correctly re-route the ticket.
Train the Markov chain model from ticket routing history.

91
5. Privacy
It is sometimes desirable that the search engine doesn’t
know which documents a user wants to retrieve.
• For users: privacy
• For enterprises: avoiding liability
user privacy
While a search engine answers individual keyword
searches, there are methods that perform multiple
searches and, from the answers, piece together
aggregate information about underlying corpus.
• Enterprises may not want to disclose such information to all
users.
data privacy

92
User Privacy
Private Information Retrieval (PIR)
• old topic, tons of theoretical papers
Modifying search engine. e.g.,
• forcing it to forget user activities
• embellishing queries with decoy terms (Pang PVLDB 10)
Using ghost queries to obfuscate user intention (Pang ICDE 12)
• no change to search engine
• light-weight
solutions
It is sometimes desirable that the search engine doesn’t
know which documents a user wants to retrieve.
• For users: privacy
• For enterprises: avoiding liability
user privacy

93
Private Information Retrieval (PIR)
• Idea: retrieve more documents than needed.
• Naïve: retrieve the entire corpus.
• How to minimize the number of retrieved &
unneeded documents?
• Tons of theoretical papers on different variations
of the problem, e.g.,
– different computation power of the search engine
– different number of non-communicating corpus
replica.
Gasarch EATCS Bulletin 2004
survey

94
Ghost Queries (pang icde 12)
• Challenges
– Generate ghost queries on topics different from user’s
topics of interest, and make it difficult for the search
engine to infer user’s topics.
– Ghost queries need to be meaningful/realistic, so that
they cannot be easily identified.
generate
ghost queries
ghost queries
discard ghost
query results
results
submit to
search engine
user query

95
Ghost Queries (pang icde 12)
• (e1, e2) privacy model
– Given a user query, if the probability of a topic
increases more than e1, it should be reduced to
below e2 by the ghost queries.
• Topics are predefined.
• A ghost query must be coherent: all words in
the ghost query should describe common or
related topics.
• Randomized algorithm based solution.

96
Data Privacy
While a search engine answers individual keyword searches, there
are methods that perform multiple searches and, from the answers,
piece together aggregate information about underlying corpus.
• Enterprises may not want to disclose such information to all users.
data privacy
inserting dummy tuples OR randomly generating attribute values
• only applicable to structured data
disallowing certain queries OR return snippets
• search quality loss
altering a small number of results: adding dummy results;
modifying results, hiding some results (Zhang SIGMOD 12)
solutions
FRONTEND PRIVACY

97
Aggregate Suppression (zhang sigmod 12)
• Example: consider corpus A and B.
– A: n documents
– B: 2n documents
– A ⊂ B
• Goal: suppress COUNT(*), i.e., adversary cannot tell which
corpus is larger.
• Naïve approach 1: deterministically remove n documents from B.
– achieves the goal, but with search utility loss: those n documents can
never be retrieved.
• Naïve approach 2: randomly drop half of the results at run time.
– no search utility loss, but fails to achieve the goal: a clever adversary
can still get the information.
FRONTEND PRIVACY

98
Aggregate Suppression (zhang sigmod 12)
• Algorithm ideas
– carefully adjusting query degree (number of
documents matched by a query) and document
degree (number of queries matching a
document) by document hiding at run-time.
– decline a query if its result can be covered by a
small number of previous queries. Return
previous query results instead.
FRONTEND PRIVACY

99
Backend
Collect data
Analyze data
Admin
System performance
Admin
System performance
Frontend
Search index
Present results
Interact with user
index
Data
source
Tutorial Outline

100
Enterprise Search Administrators
• Main responsibilities
– Care and feeding of an enterprise search solution
• Monitor intranet help inboxes and respond to requests.
• Assist in troubleshooting intranet issues for content contributors
• Core skills required
– Understand general corporate business processes
– Experience in coordinating activities and managing
relationships
• with employees, content administrators, stakeholders, IT teams and
external agencies
Search Admin
Search administrators ≠≠≠≠ IR experts
Key Observation
Admin Overview

101
What a Search Administrator Need?
Bad results
for query …
I’m missing the
golden URL…
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query “global
campus” seems
unsatisfying
• Understand overall search
quality
• Overall trend
• YOY change
• By segmentation
• Understand individual search
results
• Why certain result is or
isn’t brought back
• Its ranking
• Maintain search quality
• Underlying data evolves
• Terminology changes
• Policy/Business Process
changes
• Organization changes
• Hot topics
Search Admin
Admin Overview

102
Understand Search Quality
102
(Google Search analytics)
Admin Examples

103
Understand Search Quality (Google Search analytics)
Admin Examples

104
What a Search Administrator Need?
Bad results
for query …
I’m missing the
golden URL…
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query “global
campus” seems
unsatisfying
• Understand overall search
quality
• Overall trend
• YOY change
• By segmentation
• Understand individual search
results
• Why certain result is or
isn’t brought back
• Its ranking
• Maintain search quality
• Underlying data evolves
• Terminology changes
• Policy/Business Process
changes
• Organization changes
• Hot topics
Search Admin
Admin Examples

105
Gumshoe Search Quality Toolkit
105
(bao cikm 12)
Admin Examples

106
106
(bao cikm 12)
Understand
individual query
Admin Examples

107
107
(bao cikm 12)
Examine
search results
Admin Examples

108
108
(bao cikm 12)
Understand why a
result is returned
Admin Examples

109
109
(bao cikm 12)
Understand the
ranking of the result
Admin Examples

110
110
(bao cikm 12)
Investigate a
desired result
Admin Examples

111
111
(bao cikm 12)
Suggest
rewrite rules
Admin Examples

112
112
(bao cikm 12)
Edit runtime
rules
Admin Examples

Big Data Era
Case Study: IBM Intranet Search

114
Experience at IBM Internal Search
• IBM deployed a commercially available search engine
– Implementing standard IR techniques
• Search quality went down over time to the point that
Search results were unacceptable!
Success (≥ 1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWW’07]
So, they implemented various solutions…
To the administrators managing the engine, exposed
control knobs were insufficient
Case Study Background

115
Attempts to Improve Search
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake terms
to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for the
top 1200+ queries
Didn’t help…
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?

116
Goals of Gumshoe
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology
Domain-specific meaning
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions
popcorn search
conference call!
• Result 1: IBM Travel: Per Diem
• Result 2: IBM Travel: Per Diem Rates
• Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
…
Gumshoe:
• Generic search solution, customizable & maintainable in many domains
– Simple customization with reasonable effort
– Ongoing search-quality management
• Philosophy: programmable search

117
Programmable Search: Main Idea
• Goals:
– Transparency
• Know “precisely” why every result item is being brought back
• Understand how changes in content/intents affect search
– Maintainability and “Debugability”
• Ranking logic is guided by explicit rules
• Properly react to changes in content/intents
• Building blocks:
– Deep analytics on documents
– Domain-specific analysis of queries
– Transparent customizable rule-driven ranking
runtime rules
backendbackend
analytics
interpretations

118
Distributed Analytics Platform (IBM InfoSphere BigInsights)
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backendbackend
analytics
runtime rulesinterpretations
backend
frontend
Implementation Architecture

119
Backend Analytics: 3 Parts
Local Analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
index

120
Local Analysis
• Categorizing pages
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
– Geo classification
• Associate documents with the relevant countries & regions
• Annotating pages
– Identify HomePage annotation for people, projects,
communities, …
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
Case Study Backend Local Analysis

121
• Declarative approach
– Define an operator for each basic operation
• Input tuple of annotations
• Output tuples of annotations
– Compose operators to build complex extractors
• Algebraic expression
• One document at a time trivial parallelism.
• Benefits of declarative approach:
– Expressivity: Richer, cleaner rule semantics
– Performance: Better performance through optimization
Declarative IE System

122
InfoSphere
Streams
Cost-based
optimization
...
SystemT – Overview
InfoSphere
BigInsights
SystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
SystemTSystemT
IBM Engines
UIMA
SystemT
Highly embeddable runtime
AQL Extractors
Embedded machine
learning model
AQL Rules
create view SentimentForCompany as
select T.entity, T.polarity
from classifyPolarity (SentimentFeatures ) T;
create view Company as
select ...
from ...
where ...
create view SentimentFeatures as
select
from ;

123
G J Chaitin Home Page
Homepage Identification
Title Extraction
Matching titleMatching title
patterns
Title
s
Dictionary
Match
Home Page for
G J Chaitin
• http://w3.ibm.com/hr/idp/
• http://w3-03.ibm.com/isc/index.html
• http://chis.at.ibm.com/
URL Extraction
URLs
Matching URLMatching URL
patterns
Homepage for: idp isc chis
Employee
directory
… many more …
Intranet
page
[Zhu et al., WWW’07]

124124 IBM Confidential124 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for “Paula Summa”?
Role of Global Analysis
Case Study Backend Global Analysis

125
Person
Title
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
……
…
…
…
Case Study Backend Token Generation

126
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Case Study Frontend

127
Phase 3: Result Construction
Phase 2: Relevance Ranking
Phase 1: Query Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More Details
Case Study Frontend

128
Runtime Rules: Pattern-Action Language
(Fagin 2012)
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
• ibm germany
• info india
Rewrite into “[country] hr”
(e.g., germany hr)
ENDS_WITH installation
• acrobat installation
• db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
• driving directions to almaden
• directions to watson from jfk
Pages of “siteserv” category
should be ranked higher
STARTS_WITH
[d=PERSON]
• john kelly biography
• steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern →Action
• Similar to the query-template rules of Agarwal et al. [WWW 2010]
Query SemanticsCase Study Frontend

129129
What’s Best for Benefits?

130130
The most important IBM page for benefits
changes over time: currently it is netbenefits
What’s Best for Benefits?

131
Rewrite Rules
benefits netbenefits

132
Rewrite Rules
interpretations
result aggregation
ordered results
grouping rules
re-ranking rules
benefits, netbenefits
rewrite rules
queries
benefits search

133
133
IBM Confidential
People with
first name Jim
How can we avoid pages
from people category?
java jim
Complex Rules

134
134
IBM Confidential
Complex Rules
java jim and not in person category

135
135
IBM Confidential
Complex Rules
result aggregation
ordered results
grouping rules
re-ranking rules
interpretations
rewrite rules
queries
java search

136
InterpretationsScenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3 category=issi software=symphony 1.3

137
IBM Confidential
Complex Rules
result aggregation
ordered results
grouping rules
re-ranking rules
interpretations
rewrite rules
queries
java search

138
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
Phase 2:
Relevance
Ranking
conventional IR
Phase 3:
Result
Construction
• Grouping rules
Relevance RankingCase Study Frontend

139
Person
Title
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
nGramTG
spaceTG
……
…
…
…
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG

140
Annotation + TG Relevance Bucket
……
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search Relevance buckets
•Buckets are ranked
– Based on annotation type
– Based on TG quality
•A page can belong to
multiple buckets
•Within each bucket,
ranking is by
conventional IR
……

141
Ranking by Relevance Buckets
grouping rules
re-ranking rules
interpretations
rewrite rules
queries
result aggregation
ordered results
employment verification search

142
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
Phase 2:
Relevance
Ranking
conventional IR
Phase 3:
Result
Construction
• Grouping rules
Result ConstructionCase Study Frontend

143
Grouping Rules
• Grouping rules define how search results should be
grouped together
• Search administrators can improve the diversity of
search results (in 1st page)
– Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, …
Query pattern

144
Need first page diversity
Flooding with Similar Pages

145145 IBM Confidential
Grouping Rule to the Rescue

final results
re-ranking rules
interpretations
rewrite rules
queries
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search

147
Re-ranking Rules
• Re-ranking rules adjust ranking of
search results based on categories
• Example: search administrator specifies the
important sources of “hot/current topics”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …

148
Bluepedia
Technical News
Homepages of
“About IBM”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
Re-ranking Rule for Hot Topics

149
Re-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files

final results
re-ranking rules
interpretations
rewrite rules
queries
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search

151
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
Phase 2:
Relevance
Ranking
conventional IR
Phase 3:
Result
Construction
• Grouping rules
Case Study Frontend

152
What Administrators Need…
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
Allows the building of search quality toolkit.
Recap:
Case Study Admin

153
Gumshoe Search Quality Toolkit!
Case Study Admin

156
Proof of Pudding is in the Eating
• Immediate Positive Impact within first 3 months
– Improve natural clickthrough rate by 100%+
– Top 5 results: selected about 90% of the time
• Sustained search quality Improvements 4 years since
going alive
• Stable natural search click through rate
Gumshoe (Aug. 2011– Oct. 2011)
Old Intranet Search (Aug. 2010– Aug. 2011)
Natural
clickthrough
rate
Case Study Results

157
Summary
Programmable search:
Simple & flexible customization
Search quality management
Backend Analytics
Local analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
[Fagin et al.,
PODS’10,
PODS’11]
Tooling
• Search provenance
• Rule suggestion
• Utilization of relevance buckets
[Li et al.,
SIGIR’06,
Zhu et al.,
WWW’07]
Phase 1:
Query Semantics
• Rewrite rules
Phase 2:
Relevance
Ranking
conventional IR
Phase 3:
Result
Construction
• Grouping rules
[ Bao et al.,
ACL’2010,
SIGIR’2012
CIKM’2012]
Case Study Summary

Big Data Era
Future Directions

159
Search Engine Components
Backend
Collect data
Analyze data
Admin
System performance
Frontend
Search index
Present results
Interact with user
index
Data
source

160
Future Directions
Data Heterogeneity
A rich variety of data types need to be searched in
enterprises.
• docs, databases, images, videos, social graphs, etc.
observations
How to automatically identify relevant data types, and
search and rank across different data types?
• e.g., for image search, should image recognition techniques
be incorporated in enterprise search engines? If so, how?
questions

161
Future Directions
Data Freshness
New data is continuously collected and published in
enterprises, the rate of which can be very fast.
Web search engines are not required to index new websites
quickly, but in enterprises, new contents may need to be
searchable asap.
observations
How to build efficient real-time indexes to ensure data
freshness in enterprise search?
questions

162
Future Directions
Search Context
Enterprise search users have richer profiles than web users.
• activities, bio, position, projects, experiences, etc.
observations
How to utilize users’ contexts to provide customized results?
Is it possible to predict the information a user may want, and
push it to the user?
questions

163
Future Directions
User Preference
Different users in an enterprise have different expertise, and
may prefer different ways to express queries.
• e.g., some users prefer pure keyword search, while
others may want lightly-structured queries.
observations
How to effectively satisfy different needs for expressing
queries for different users?
questions

164
Future Directions
Question Answering
The purpose of many enterprise searches are to find
answers to questions.
• e.g., what is the previous name of a product, and when
did we change to the current name?
observations
Is it possible to effectively use natural language processing
techniques and domain knowledge to automatically answer
natural language questions?
questions

165
Future Directions
Transactional Search
Over 1/3 of enterprise search queries is transactional. It will
be desirable if enterprise search engines can recommend
business processes to accomplish a certain task given a
transactional search.
• E.g., given a customer’s lengthy complaint letter, how to find
out the departments relevant to the complaints.
observations
How to better support transactional search? How to initiate
a business process based on the results of a search?
questions

166
Future Directions
Big Data Analytics
Rich information and knowledge lies in big data. Many
employees (not just data analysts) may benefit from the
ability to perform analytics on the company’s big data.
observations
How to build a low-cost, interactive platform that allows a
large number of employees to issue analytical queries?
How to give employees the capabilities to analyze big data,
if they have little knowledge of SQL or MapReduce
programming?
questions

167
Future Directions
Tooling for Search Quality Maintenance
Most enterprise search engines have to be manually
evaluated and tuned by a search administrator with domain
knowledge, in an ad-hoc fashion.
observations
Can we automate this process, or at least minimize manual
involvement?
Can we fully utilize explicit user feedbacks?
• Explicit user feedbacks are easier to obtain in enterprise
search, and there are less spams.
questions

Thanks.
Acknowledgement:
IBM Research: Sriram Raghavan, Fred Reiss, Shiv Vaithyanathan, Ron Fagin
IBM CIO’s Office: Nicole Dri, Brian C. Meyer
LogicBlox: Benny Kimelfeld*
TripAdvisor: Adriano Crestani Campos*
Facebook: Zhuowei Bao*
NJIT: Yi Chen
UNSW: Wei Wang
* work done while at IBM

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Similar to Enterprise Search in the Big Data Era: Recent Developments and Open Challenges (20)

More from Yunyao Li

More from Yunyao Li (20)

Recently uploaded

Recently uploaded (20)

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges