• Like

Architecture of Search Systems and Measuring the Search Effectiveness

  • 2,565 views
Uploaded on

Lecture made at the 19th of April 2012, at the Warsaw University of Technology. This is the 9th lecture in the regular course at master grade studies "Introduction to text mining".

Lecture made at the 19th of April 2012, at the Warsaw University of Technology. This is the 9th lecture in the regular course at master grade studies "Introduction to text mining".

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,565
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
32
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 23 april 2012
  • 2. SEARCH SYSTEM ARCHITECTURE & MEASURING SEARCH EFFECTIVENESS Introduction to text mining – Warsaw University of Technology
  • 3. Plan Findwise – who we are, what we do. General architecture of search engines Data sources Content processing Search index Query and result processing Security in search engines Applications based on search Leading search technologies The concept of Findability Differences in online and enterprise search Measuring of search effectiveness Questions and answers
  • 4. Findwise – Search Driven Solutions • Founded in 2005 • Offices in Sweden, Denmark, Norway and Poland • 75+ employees Our objective is to be a leading provider of Findability solutions utilising the full potential of search technology to create customer business value. • Paweł Wróblewski – search enthusiast
  • 5. General architecture of search engines
  • 6. Important terms LatencyFeeding Indexing Searching
  • 7. Data sources Everything that has an information is a good source! We need a connector to feed the data into a search system: Take the content Take the metadata Take the security information Different strategies to feed the data: Push – external applications invokes search system connector’s API to feed the content (e.g. transactional systems) Pull – connector periodically scans the source and takes the data (e.g. web crawler, file system) Hybrid – external systems dumps the data which are pulled by a connector
  • 8. Content Processing – the idea Format Language Spell Lemmas Synonyms Conversion Detection Checking (tenses, forms)Document Geography Taxonomy Custom Companies Vectorizer Entities Classification PLUG-IN People Scopifier  index PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything Id love to do well here," the American said.Input: byte streamOutput: structured document ready to be indexed
  • 9. Content Processing – the implementation Hydra is used in order to refine content before it hits the index. Every document fetched from a source runs through a targeted pipeline, which includes a number of stages. A stage can be considered as an “app” within Appstore or the Android market. Findwise have created a huge amount of such stages, where each stage has a small purpose to enhance the content of the item. It is possible to create additional stages to serve a specific customer functionality.
  • 10. Hydra - example Select stages to use in the pipeline, the left column corresponds to the “market”, and the right is the stages used.
  • 11. Hydra - example Modify the format of the date to only include year.
  • 12. Hydra - example The new year meta-data can be used as a facet
  • 13. Hydra - example Map every author field to a metadata field called author. Pipeline A Pipeline B
  • 14. Hydra - example In the search result…
  • 15. Search index – the problem Input: structured document (content + metadata) Output: binary represenation of inverted index optimised for speed and acuracy Search index has a flat structure – no internal relations Usually changes to the index structure require index rebuild (re- indexing)
  • 16. Search index – the problem Inverted index Index split Theory in previous lectures M How to achieve … Petabytes of indexed data Indexing / Search Node 00 Indexing / Search Node 10 Indexing / Search Node N0 Thousands of queries M Index mirror per second ... …… Thousands of index Indexing / Search Node 01 Indexing / Search Node 11 updates per second … M FAST Enterprise … Search Platform – Indexing / Search Node 0M Indexing / Search Node 1M Indexing / Search Node NM search cluster example Search Cluster
  • 17. Search index – the implementation In order to perform effective updates (index rebuilds) several index partitions are produced Index Index Index Index Small partition rebuilds quickly unlike the big one Rebuild of larger partition involves merging index from smaller one(s) Rebuilds can be triggered by: number or rebuild operations, number of documents, percent of total volume
  • 18. Query processingQuery: Do you have aDo you have an Spell- Anti- Tokenizer Phrasing NormalizationLCD monitor checking phrasingunder $900? Under $900? LCD monitors Flat TV YES!  price < 900 TFT monitors Plasma TV X = LCD monitor Lemmas NLQ Thesaurus PLUG-IN BUY( X ) Synonyms Use “Product” collection Rank profile = “Profit margin” Modified query Geography Adaptive Evaluation 18
  • 19. Result processing The following issues might apply to results processing: Ranking generation Factors that can be considered: number of hits, proximity of hits, freshness (date), web measures (e.g. page rank), business and context factors (boosting or blocking) Search federation Integration of results from multiple search engines: round robin, normalized ranks, searchlets (multiple results lists presented in different way). Security trimming Filtering out the results that do not match user’s credentials Last second check
  • 20. Security in search solution Search Application Security Content-level Security Secure Server Environment 20
  • 21. Search Based ApplicationsSearch Driven Solutions = Customisation of search systemcomponents
  • 22. Catalogue of Search Based Applications Intelligence Database Commerce Corporate Search Media Systems System Offloading Systems • Intranets/portals • Market • Data warehouse • Search • Public news • Information intelligence • Data merchandising syndication gateways • Customer transformation • Customer • Mulitmedia • Expertise intelligence • Data caches analytics search location • Surveillance • Campaign • Proprietary • ECM • IP protection management research and repositories • Fraud detection • Call centre publications • Collaboration • eDiscovery enablement • Libraries • Knowledge • Quality • Customer self- Management Management service • Enterprise apps • Information risk management Search subsystem Data connectors – out of the box, custom made Repositories – Web, Databases, Files, Enterprise systems
  • 23. Leading search engine technologies • HP / Autonomy IDOL • Microsoft (SharePoint and FAST Search products) • Google Search Appliance (GSA ) • IBM Content Analytics/OmniFind • Oracle Secure Enterprise Search/Endeca • Apache Lucene/Solr (Open source) • Exalead CloudView • and more…
  • 24. Comparison of different technology vendors  What is the goal of Enterprise Findability (EF)? Core search  How should EF improve business? technology  What user groups are targeted? Usability Vendor capabilitie  What does the users’ want and need? s  What information is available and where is it stored?  How should EF be rolled out and governed? Total cost  What costs are involved? Connectivity of ownership and security  Are there any IT strategy considerations?  Vendor mapping provides an answer to which EF platform matches the overall requirements best on the short and long term
  • 25. Findability – what is it?Negligible Business value gained from search technology High Business (needs & goals) Users (needs & capabilities) SEARCH Search Technology <simple> Information (quality & structure) Organisation (ownership & governance)Basic Use of search technology/platform Advanced– a holistic approach to leverage business value with searchtechnology
  • 26. Online vs. Enterprise Search According to Stephen E. Arnold, „The New Landscape of Enterprise Search”, Pandia, July 2011
  • 27. Online vs. Enterprise Search According to Stephen E. Arnold, „The New Landscape of Enterprise Search”, Pandia, July 2011
  • 28. Measuring the search effectiveness Enterprise case Relevance of search results is highly subjective Search is highly bound to business otherwise not important to consider Increase income or reduce costs Take into consideration all the dimensions of Findability: Business: Needs & Goals Users: Needs & Capabilities Information: Quality & Structure Organization: Ownership & Governance Search Technology: correctness of implementation Tools: reviews, workshops, presentations, strategies drafting, audits etc.
  • 29. Measuring the search effectiveness Online case Relevance of search results is highly subjective Search is highly bound to business otherwise not important to consider Increase conversion rate Verification od search functions and their impact on conversion rate Make isolated tests per each identified feature Create a score based on a weighted average
  • 30. the results reported for each single test is composed of the two following elements: Overall benchmark Cumulated results for test groups Measuring the search effectivenessudit – the Final Report Overall benchmark IPMS Test categories designed for the purpose of audit are generally applicable to any kind of a search actively find and filter items in service or solution. Nevertheless some of them are less while some are more important in specific a map.g by 3 It is useful feature that aids in finding items closest to application like online Yellow Pages catalogue. That is why a weight is assigned to each test that ce ased Online case 3 selected position. represents an importance and influence on the whole YP solution. The defined weights are described Useful feature enabling mining the neighborhood ofstions selected item. in the following table.h starting Example as first impression and encouraging users 4 As important to interact with the service. Test Name Weight Remarksesult page 3 It is important not to miss any category to offer opportunity [1-5] another kind of search, content or advertisements. I.a Keyword match 5 This is basic feature of any full-text search system and ith 5 Extremely important factor in online search solutions. mance mostly influences the overall precision of search. I.b Wildcard 2 Users of YP solutions rarely uses such features. expansionmark score is presented in the following chart. I.c Accuracy of result 4 The importance of properly assigned categories to categories registered entries is high since it influences usability and Overall weighted scores relevance of categories. 6 I.d Query operators 1 Users of YP solutions uses such features hardly ever. 5 I.e Exact phrases 3 It might be important to catch exact phrase in a search preventing any background processing. 4 II.a iFind Lemmatization 5 This is a must-be for any kind of search, especially for 3 Polish language. PKT 2 II.b Synonym 3 It is useful to improve recall of search thus preventing PF 1 expansion zero results. II.c Spellchecking 4 Very useful feature as people tend to make simple 0 spelling mistakes while typing at keyboard. II.d Anti-phrasing 3 It is useful not to search for irrelevant and meaningless terms.alculation the overall benchmark can be expressed as cumulative weighted score II.e Name and phrase 3 It is useful to capture some multi-word expressions ores 1-10. The ideal hypothetic search system should achieve score 10. recognition names as a whole – in single meaning.re as follows for the conducted tests: II.f Natural Language 2 Vey advanced yet hard to implement feature. Processing53 III.a Navigation 4 Very useful feature enabling easy to use and intuitive
  • 31. QUESTIONS?
  • 32. Paweł Wróblewskipawel.wroblewski@findwise.com