Architecture of Search Systems and Measuring the Search Effectiveness


Published on

Lecture made at the 19th of April 2012, at the Warsaw University of Technology. This is the 9th lecture in the regular course at master grade studies "Introduction to text mining".

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Architecture of Search Systems and Measuring the Search Effectiveness

  1. 1. 23 april 2012
  2. 2. SEARCH SYSTEM ARCHITECTURE & MEASURING SEARCH EFFECTIVENESS Introduction to text mining – Warsaw University of Technology
  3. 3. Plan Findwise – who we are, what we do. General architecture of search engines Data sources Content processing Search index Query and result processing Security in search engines Applications based on search Leading search technologies The concept of Findability Differences in online and enterprise search Measuring of search effectiveness Questions and answers
  4. 4. Findwise – Search Driven Solutions • Founded in 2005 • Offices in Sweden, Denmark, Norway and Poland • 75+ employees Our objective is to be a leading provider of Findability solutions utilising the full potential of search technology to create customer business value. • Paweł Wróblewski – search enthusiast
  5. 5. General architecture of search engines
  6. 6. Important terms LatencyFeeding Indexing Searching
  7. 7. Data sources Everything that has an information is a good source! We need a connector to feed the data into a search system: Take the content Take the metadata Take the security information Different strategies to feed the data: Push – external applications invokes search system connector’s API to feed the content (e.g. transactional systems) Pull – connector periodically scans the source and takes the data (e.g. web crawler, file system) Hybrid – external systems dumps the data which are pulled by a connector
  8. 8. Content Processing – the idea Format Language Spell Lemmas Synonyms Conversion Detection Checking (tenses, forms)Document Geography Taxonomy Custom Companies Vectorizer Entities Classification PLUG-IN People Scopifier  index PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything Id love to do well here," the American said.Input: byte streamOutput: structured document ready to be indexed
  9. 9. Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
  10. 10. Hydra - example Select stages to use in the pipeline, the left column corresponds to the “market”, and the right is the stages used.
  11. 11. Hydra - example Modify the format of the date to only include year.
  12. 12. Hydra - example The new year meta-data can be used as a facet
  13. 13. Hydra - example Map every author field to a metadata field called author. Pipeline A Pipeline B
  14. 14. Hydra - example In the search result…
  15. 15. Search index – the problem Input: structured document (content + metadata) Output: binary represenation of inverted index optimised for speed and acuracy Search index has a flat structure – no internal relations Usually changes to the index structure require index rebuild (re- indexing)
  16. 16. Search index – the problem Inverted index Index split Theory in previous lectures M How to achieve … Petabytes of indexed data Indexing / Search Node 00 Indexing / Search Node 10 Indexing / Search Node N0 Thousands of queries M Index mirror per second ... …… Thousands of index Indexing / Search Node 01 Indexing / Search Node 11 updates per second … M FAST Enterprise … Search Platform – Indexing / Search Node 0M Indexing / Search Node 1M Indexing / Search Node NM search cluster example Search Cluster
  17. 17. Search index – the implementation In order to perform effective updates (index rebuilds) several index partitions are produced Index Index Index Index Small partition rebuilds quickly unlike the big one Rebuild of larger partition involves merging index from smaller one(s) Rebuilds can be triggered by: number or rebuild operations, number of documents, percent of total volume
  18. 18. Query processingQuery: Do you have aDo you have an Spell- Anti- Tokenizer Phrasing NormalizationLCD monitor checking phrasingunder $900? Under $900? LCD monitors Flat TV YES!  price < 900 TFT monitors Plasma TV X = LCD monitor Lemmas NLQ Thesaurus PLUG-IN BUY( X ) Synonyms Use “Product” collection Rank profile = “Profit margin” Modified query Geography Adaptive Evaluation 18
  19. 19. Result processing The following issues might apply to results processing: Ranking generation Factors that can be considered: number of hits, proximity of hits, freshness (date), web measures (e.g. page rank), business and context factors (boosting or blocking) Search federation Integration of results from multiple search engines: round robin, normalized ranks, searchlets (multiple results lists presented in different way). Security trimming Filtering out the results that do not match user’s credentials Last second check
  20. 20. Security in search solution Search Application Security Content-level Security Secure Server Environment 20
  21. 21. Search Based ApplicationsSearch Driven Solutions = Customisation of search systemcomponents
  22. 22. Catalogue of Search Based Applications Intelligence Database Commerce Corporate Search Media Systems System Offloading Systems • Intranets/portals • Market • Data warehouse • Search • Public news • Information intelligence • Data merchandising syndication gateways • Customer transformation • Customer • Mulitmedia • Expertise intelligence • Data caches analytics search location • Surveillance • Campaign • Proprietary • ECM • IP protection management research and repositories • Fraud detection • Call centre publications • Collaboration • eDiscovery enablement • Libraries • Knowledge • Quality • Customer self- Management Management service • Enterprise apps • Information risk management Search subsystem Data connectors – out of the box, custom made Repositories – Web, Databases, Files, Enterprise systems
  23. 23. Leading search engine technologies • HP / Autonomy IDOL • Microsoft (SharePoint and FAST Search products) • Google Search Appliance (GSA ) • IBM Content Analytics/OmniFind • Oracle Secure Enterprise Search/Endeca • Apache Lucene/Solr (Open source) • Exalead CloudView • and more…
  24. 24. Comparison of different technology vendors  What is the goal of Enterprise Findability (EF)? Core search  How should EF improve business? technology  What user groups are targeted? Usability Vendor capabilitie  What does the users’ want and need? s  What information is available and where is it stored?  How should EF be rolled out and governed? Total cost  What costs are involved? Connectivity of ownership and security  Are there any IT strategy considerations?  Vendor mapping provides an answer to which EF platform matches the overall requirements best on the short and long term
  25. 25. Findability – what is it?Negligible Business value gained from search technology High Business (needs & goals) Users (needs & capabilities) SEARCH Search Technology <simple> Information (quality & structure) Organisation (ownership & governance)Basic Use of search technology/platform Advanced– a holistic approach to leverage business value with searchtechnology
  26. 26. Online vs. Enterprise Search According to Stephen E. Arnold, „The New Landscape of Enterprise Search”, Pandia, July 2011
  27. 27. Online vs. Enterprise Search According to Stephen E. Arnold, „The New Landscape of Enterprise Search”, Pandia, July 2011
  28. 28. Measuring the search effectiveness Enterprise case Relevance of search results is highly subjective Search is highly bound to business otherwise not important to consider Increase income or reduce costs Take into consideration all the dimensions of Findability: Business: Needs & Goals Users: Needs & Capabilities Information: Quality & Structure Organization: Ownership & Governance Search Technology: correctness of implementation Tools: reviews, workshops, presentations, strategies drafting, audits etc.
  29. 29. Measuring the search effectiveness Online case Relevance of search results is highly subjective Search is highly bound to business otherwise not important to consider Increase conversion rate Verification od search functions and their impact on conversion rate Make isolated tests per each identified feature Create a score based on a weighted average
  30. 30. the results reported for each single test is composed of the two following elements: Overall benchmark Cumulated results for test groups Measuring the search effectivenessudit – the Final Report Overall benchmark IPMS Test categories designed for the purpose of audit are generally applicable to any kind of a search actively find and filter items in service or solution. Nevertheless some of them are less while some are more important in specific a map.g by 3 It is useful feature that aids in finding items closest to application like online Yellow Pages catalogue. That is why a weight is assigned to each test that ce ased Online case 3 selected position. represents an importance and influence on the whole YP solution. The defined weights are described Useful feature enabling mining the neighborhood ofstions selected item. in the following table.h starting Example as first impression and encouraging users 4 As important to interact with the service. Test Name Weight Remarksesult page 3 It is important not to miss any category to offer opportunity [1-5] another kind of search, content or advertisements. I.a Keyword match 5 This is basic feature of any full-text search system and ith 5 Extremely important factor in online search solutions. mance mostly influences the overall precision of search. I.b Wildcard 2 Users of YP solutions rarely uses such features. expansionmark score is presented in the following chart. I.c Accuracy of result 4 The importance of properly assigned categories to categories registered entries is high since it influences usability and Overall weighted scores relevance of categories. 6 I.d Query operators 1 Users of YP solutions uses such features hardly ever. 5 I.e Exact phrases 3 It might be important to catch exact phrase in a search preventing any background processing. 4 II.a iFind Lemmatization 5 This is a must-be for any kind of search, especially for 3 Polish language. PKT 2 II.b Synonym 3 It is useful to improve recall of search thus preventing PF 1 expansion zero results. II.c Spellchecking 4 Very useful feature as people tend to make simple 0 spelling mistakes while typing at keyboard. II.d Anti-phrasing 3 It is useful not to search for irrelevant and meaningless terms.alculation the overall benchmark can be expressed as cumulative weighted score II.e Name and phrase 3 It is useful to capture some multi-word expressions ores 1-10. The ideal hypothetic search system should achieve score 10. recognition names as a whole – in single as follows for the conducted tests: II.f Natural Language 2 Vey advanced yet hard to implement feature. Processing53 III.a Navigation 4 Very useful feature enabling easy to use and intuitive
  31. 31. QUESTIONS?
  32. 32. Paweł Wró