Architecture of Search Systems and Measuring the Search Effectiveness

SEARCH SYSTEM ARCHITECTURE
& MEASURING SEARCH
EFFECTIVENESS
Introduction to text mining – Warsaw University of Technology

Plan

Findwise – who we are, what we do.
General architecture of search engines
Data sources
Content processing
Search index
Query and result processing
Security in search engines
Applications based on search
Leading search technologies
The concept of Findability
Differences in online and enterprise search
Measuring of search effectiveness
Questions and answers

Findwise – Search Driven Solutions

• Founded in 2005

• Offices in Sweden, Denmark,
Norway and Poland

• 75+ employees

Our objective is to be a leading provider of Findability solutions utilising
the full potential of search technology to create customer business value.

• Paweł Wróblewski – search enthusiast

General architecture of search engines

Important terms

Latency
Feeding Indexing Searching

Data sources

Everything that has an information is a good source!
We need a connector to feed the data into a search system:
Take the content
Take the metadata
Take the security information
Different strategies to feed the data:
Push – external applications invokes search system connector’s API
to feed the content (e.g. transactional systems)
Pull – connector periodically scans the source and takes the data
(e.g. web crawler, file system)
Hybrid – external systems dumps the data which are pulled by a
connector

Content Processing – the idea

Format Language Spell Lemmas
Synonyms
Conversion Detection Checking (tenses, forms)

Document
Geography
Taxonomy Custom Companies
Vectorizer Entities
Classification PLUG-IN People

Scopifier  index PARIS (Reuters) - Venus Williams raced into the second round of
the $11.25 million French Open Monday, brushing aside
Bianka Lamade, 6-3, 6-3, in 65 minutes.

The Wimbledon and U.S. Open champion, seeded second, breezed
past the German on a blustery center court to become the
first seed to advance at Roland Garros. "I love being here, I
love the French Open and more than anything I'd love to do
well here," the American said.
Input: byte stream
Output: structured document ready to be indexed

Content Processing – the implementation

Hydra is used in order to refine content before it hits the index. Every
document fetched from a source runs through a targeted pipeline,
which includes a number of stages. A stage can be considered as an
“app” within Appstore or the Android market. Findwise have created
a huge amount of such stages, where each stage has a small
purpose to enhance the content of the item. It is possible to create
additional stages to serve a specific customer functionality.

Hydra - example

Select stages to use in the pipeline, the left column corresponds to the
“market”, and the right is the stages used.

Hydra - example

Modify the format of the date to only include year.

Hydra - example

The new year meta-data can be used as a facet

Hydra - example

Map every author field to a metadata field called author.
Pipeline A

Pipeline B

Hydra - example

In the search result…

Search index – the problem

Input: structured document (content + metadata)
Output: binary represenation of inverted index optimised for speed
and acuracy
Search index has a flat structure – no internal relations
Usually changes to the index structure require index rebuild (re-
indexing)

Search index – the problem

Inverted index
Index split
Theory in previous lectures
M
How to achieve
…
Petabytes of indexed data Indexing / Search
Node 00
Indexing / Search
Node 10
Indexing / Search
Node N0

Thousands of queries M

Index mirror
per second

...
……
Thousands of index Indexing / Search
Node 01
Indexing / Search
Node 11

updates per second

… M
FAST Enterprise …
Search Platform – Indexing / Search
Node 0M
Indexing / Search
Node 1M
Indexing / Search
Node NM

search cluster example
Search Cluster

Search index – the implementation

In order to perform effective updates (index rebuilds) several index
partitions are produced

Index
Index
Index
Index

Small partition rebuilds quickly unlike the big one
Rebuild of larger partition involves merging index from smaller
one(s)
Rebuilds can be triggered by: number or rebuild operations, number
of documents, percent of total volume

Query processing

Query: Do you have a

Do you have an Spell- Anti-
Tokenizer Phrasing Normalization
LCD monitor checking phrasing
under $900?
Under $900? LCD monitors Flat TV YES!

price < 900 TFT monitors Plasma TV X = LCD monitor

Lemmas
NLQ Thesaurus PLUG-IN BUY( X )
Synonyms

Use “Product” collection
Rank profile = “Profit margin”
Modified query
Geography Adaptive
Evaluation

18

Result processing

The following issues might apply to results processing:
Ranking generation
Factors that can be considered: number of hits, proximity of hits,
freshness (date), web measures (e.g. page rank), business and context
factors (boosting or blocking)
Search federation
Integration of results from multiple search engines: round robin,
normalized ranks, searchlets (multiple results lists presented in
different way).
Security trimming
Filtering out the results that do not match user’s credentials
Last second check

Security in search solution

Search Application Security
Content-level Security

Secure Server Environment
20

Search Based Applications

Search Driven Solutions = Customisation of search system
components

Catalogue of Search Based Applications

Intelligence Database Commerce
Corporate Search Media Systems
System Offloading Systems
• Intranets/portals • Market • Data warehouse • Search • Public news
• Information intelligence • Data merchandising syndication
gateways • Customer transformation • Customer • Mulitmedia
• Expertise intelligence • Data caches analytics search
location • Surveillance • Campaign • Proprietary
• ECM • IP protection management research and
repositories • Fraud detection • Call centre publications
• Collaboration • eDiscovery enablement • Libraries
• Knowledge • Quality • Customer self-
Management Management service
• Enterprise apps • Information risk
management

Search subsystem

Data connectors – out of the box, custom made

Repositories – Web, Databases, Files, Enterprise systems

Leading search engine technologies

• HP / Autonomy IDOL
• Microsoft (SharePoint and FAST Search products)
• Google Search Appliance (GSA )
• IBM Content Analytics/OmniFind
• Oracle Secure Enterprise Search/Endeca
• Apache Lucene/Solr (Open source)
• Exalead CloudView

• and more…

Comparison of different technology vendors
 What is the goal of Enterprise Findability
(EF)? Core search
 How should EF improve business? technology

 What user groups are targeted? Usability Vendor
capabilitie
 What does the users’ want and need? s

 What information is available and where is it
stored?
 How should EF be rolled out and governed? Total cost
 What costs are involved? Connectivity of
ownership
and security
 Are there any IT strategy considerations?
 Vendor mapping provides an answer to which
EF platform matches the overall requirements
best on the short and long term

Findability – what is it?

Negligible Business value gained from search technology High

Business (needs & goals)

Users (needs & capabilities)

SEARCH
Search Technology
<simple>

Information (quality & structure)

Organisation (ownership & governance)

Basic Use of search technology/platform
Advanced

– a holistic approach to leverage business value with search
technology

Online vs. Enterprise Search

According to Stephen E. Arnold, „The New Landscape of Enterprise
Search”, Pandia, July 2011

Measuring the search effectiveness

Enterprise case
Relevance of search results is highly subjective
Search is highly bound to business otherwise not important to
consider
Increase income or reduce costs
Take into consideration all the dimensions of Findability:
Business: Needs & Goals
Users: Needs & Capabilities
Information: Quality & Structure
Organization: Ownership & Governance
Search Technology: correctness of implementation
Tools: reviews, workshops, presentations, strategies drafting, audits
etc.


Online case
Relevance of search results is highly subjective
Search is highly bound to business otherwise not important to
consider
Increase conversion rate
Verification od search functions and their impact on conversion rate
Make isolated tests per each identified feature
Create a score based on a weighted average

the results reported for each single test is composed of the two following elements:

Overall benchmark
Cumulated results for test groups

udit – the Final Report Overall benchmark IPMS
Test categories designed for the purpose of audit are generally applicable to any kind of a search
actively find and filter items in service or solution. Nevertheless some of them are less while some are more important in specific
a map.
g by 3 It is useful feature that aids in finding items closest to
application like online Yellow Pages catalogue. That is why a weight is assigned to each test that
ce
ased
Online case
3
selected position.
represents an importance and influence on the whole YP solution. The defined weights are described
Useful feature enabling mining the neighborhood of
stions selected item. in the following table.
h starting Example as first impression and encouraging users
4 As important
to interact with the service. Test Name Weight Remarks
esult page 3 It is important not to miss any category to offer
opportunity [1-5]
another kind of search, content or advertisements.
I.a Keyword match 5 This is basic feature of any full-text search system and it
h 5 Extremely important factor in online search solutions.
mance mostly influences the overall precision of search.
I.b Wildcard 2 Users of YP solutions rarely uses such features.
expansion
mark score is presented in the following chart.
I.c Accuracy of result 4 The importance of properly assigned categories to
categories registered entries is high since it influences usability and
Overall weighted scores relevance of categories.
6 I.d Query operators 1 Users of YP solutions uses such features hardly ever.
5
I.e Exact phrases 3 It might be important to catch exact phrase in a search
preventing any background processing.
4
II.a
iFind
Lemmatization 5 This is a must-be for any kind of search, especially for
3 Polish language.
PKT
2 II.b Synonym 3 It is useful to improve recall of search thus preventing
PF
1
expansion zero results.
II.c Spellchecking 4 Very useful feature as people tend to make simple
0
spelling mistakes while typing at keyboard.
II.d Anti-phrasing 3 It is useful not to search for irrelevant and meaningless
terms.
alculation the overall benchmark can be expressed as cumulative weighted score
II.e Name and phrase 3 It is useful to capture some multi-word expressions or
es 1-10. The ideal hypothetic search system should achieve score 10.
recognition names as a whole – in single meaning.
re as follows for the conducted tests: II.f Natural Language 2 Vey advanced yet hard to implement feature.
Processing
53
III.a Navigation 4 Very useful feature enabling easy to use and intuitive

Paweł Wróblewski
pawel.wroblewski@findwise.com

Architecture of Search Systems and Measuring the Search Effectiveness

More Related Content

Viewers also liked

Similar to Architecture of Search Systems and Measuring the Search Effectiveness

More from Findwise

Recently uploaded

Architecture of Search Systems and Measuring the Search Effectiveness