Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Concept Searching Portal Solutions Search Engine Face Off


Published on

The presentation from a recent webinar.

Concept Searching Portal Solutions Search Engine Face Off

  1. 1. Search Engine Face-Off Keyword Search versus Metadata Search Don Miller, VP of Business Development Val Orekhov, VP of Business Development 1 (408) 828-3400 1 (240) 450-2166 x 103
  2. 2. Concept Searching Don Miller Don Miller is a senior executive at ConceptSearching with over 20 years experience in knowledge management. He is a frequent speaker about Records Management and Information Architecture problems and solutions. Don has been a guest speaker at Taxonomy Bootcamp, Management Electronic Records and numerous SharePoint events about information organization and records management. Don Miller, VP of Business Development * 1 (408) 828-3400 * Portal Solutions Val Orekhov Val Orekhov, Chief Architect for Portal Solutions is deeply skilled in Enterprise Application Development, Web development, portals, relational databases and data access, modeling, and is versed in a number of programming languages and technologies. He has been with Portal Solutions for almost five years and drives the technical team to excel year over year. He holds a Master of Science in Computer Science from Kyrgyz Technical University in Bishkek, Kyrgyzstan. Val Orekhov, Chief Technical Architect * (1) (240) 450-2166 x 103 *
  3. 3. Agenda  ConceptSearching:  Keyword vs Metadata  Keyword vs Metadata Costs  Google vs. SharePoint vs. FAST  What’s wrong with a manual metadata approach  Automated approaches  USAF Case Study  Portal Solutions:  Enterprise Search – Google vs FAST in SharePoint  Indexing Options  Approach to Security Trimming  Ranking Algorithms & Sorting Options  Metadata & Search Refinements  Questions and Answers  Demo of product if time permits Concept Searching • Don Miller • (408) 828-3400 •
  4. 4. Concept Searching, Inc. Company founded in 2002  Product launched in 2003  Focus on management of structured and unstructured information  Technology  All technologies based on our ‘open conceptualTagging framework’  Automatic concept identification, content tagging, auto- classification, taxonomy management  Only statistical vendor that can extract conceptual metadata  2009 and 2010 ‘100 Companies that Matter in KM’ (KM World Magazine)  KMWorld ‘Trend Setting Product’ of 2009 and 2010  Locations: US, UK, & South Africa Client base: Fortune 500/1000 organizations  Microsoft Enterprise Search ISV , FAST Partner  Product Suite: conceptSearch, conceptTaxonomyManager, conceptClassifier, conceptClassifier for SharePoint, contentTypeUpdater for SharePoint Concept Searching • Don Miller • (408) 828-3400 •
  5. 5. What Type of Search or Information Architecture Do You Need? Keyword Search = ~66%+ Metadata Search = 100% of results (Recall) of results (Recall) • Simple • Guided Navigation • No administration • Records Management • Good enough • Sensitive Information Removal Recall (information retrieval), a • Collaboration statistical measure (contrasted with precision), the fraction of (all) relevant • Improved Precision and material that are returned by a search Recall query Precision (information retrieval), • Evolution of Keyword the percentage of documents returned Search that are relevant Concept Searching • Don Miller • (408) 828-3400 •
  6. 6. What Is Keyword vs. Metadata Costing You? Problem Pre Migration Search Records Management Data Privacy Protection •60% of stored •“It’s not about better •67% of data loss in •Average cost per documents are search” Records Management is exposed record is $197 obsolete •Less than 50% of content due to end user error and ranges from $90- •50% of documents are is correctly indexed, meta •It costs and organization $305 per record duplicates tagged or efficiently $180 per document to •70% of breaches are due •Requires resources to searchable recreate it when it is not to a mistake or malicious identify what •85% of relevant tagged correctly and intent by an should/not be migrated documents are never cannot be found organization’s own staff retrieved in search •Eliminate duplicate •Eliminate manual tagging •Eliminate inconsistent •Identify any type of Solution end user tagging organizationally defined documents & replace with automatic •Identify privacy data identification of multi- •Automatically declare privacy data exposures word concepts documents of record •Combines pattern •Identify and declare •Provide guided based on vocabulary and matching with associated records that were not navigation via the retention codes vocabulary previously identified taxonomy structure (i.e. •Automatically change the •Automatic Content Type •Identify high value concepts) Content Type and route updating enabling content •Go beyond dynamic to the Records workflows and rights •Migrating required clustering with Management repository management content to a structure conceptual clustering based on the taxonomies Benefit •Reduces migration •Taxonomy navigation •Savings of $4.00 - $7.04 •Average cost runs from costs is 36% - 48% faster per record by eliminating $225K to $35M •Ensures •Savings 2.5 hours manual tagging compliance and per user per day •Ensures compliance and protection of reduces potential content assets litigation exposures Concept Searching • Don Miller • (408) 828-3400 •
  7. 7. Metadata Search vs. Keyword and Guided Navigation “Proposal” “Software License” “SLA” “Licensee” “Addendum” “License Agreement” “License” 100% of Results Results “Documents of Record” Metadata Search also known as “Recall” “Proposals” “Contract” 66% Key + Synonym Search “Proposal” Entity Extraction 33% Keyword Search 20-33% of results Entity extraction without complex rules is ineffective. It is just keyword Cost (Time, Money and Complex) match, which is what keyword search is, which is 33% effective. Concept Searching • Don Miller • (408) 828-3400 •
  8. 8. Similar Features Against Total Number of Documents Returned Google SharePoint FAST Index 500 M + 100 M 500 M + Key Word /– 33% of Yes Yes – Good as Yes results Google or FAST Synonyms Up to Yes Yes Yes 50-66%+ of results Apply metadata No No Key Word only automatically for which equals 33% 100% of results of results Ranking Algorithm Non tunable Tunable Very Tunable + Best Bets: Does not improve number of results only how presented Concept Searching • Don Miller • (408) 828-3400 •
  9. 9. What Is Missing To Get to 100% of Relevant Results in Every Search? Metadata Google SharePoint FAST Auto Classification No – No – Entity extraction, Missing 33-50% of Missing 33-50% of which is the same results on any results on any as keyword search particular topic particular topic 33% results. No RECALL results improvement with this approach Taxonomy No Yes, but can’t do No Management any thing with it in this release. Security issues for managing Term Store. Concept Searching • Don Miller • (408) 828-3400 •
  10. 10. Miscellaneous Items to Review Google SharePoint FAST SharePoint Refiners Hard Yes – Easy to use Medium – Initial and Navigators with for standard search. release, does not counts. No counts on leverage Term Store results. yet. XML – RECALL Powershell based Customization Difficult Difficult Extendable Concept Searching • Don Miller • (408) 828-3400 •
  11. 11. Summary • Google – Best for no administration, install and walk away. Usually missing 33%- 50% of results on any given topic because of missing metadata. Not easy to integrate refiners or navigators into SharePoint UI. • SharePoint Search – Cost effective, comes free with SharePoint. Search Algorithm is as good as FAST or Google. Also very easy to install and walk away. Limited extensibility. Easy integration for refiners and navigators (no counts). Also missing 50% of results on any topic. • FAST – Extremely customizable, but requires training or professional services to customize. Most likely Microsoft long term platform for search. Very scalable and can provide refiner counts. Still missing 33-50% of results from any given search because of metadata inconsistency. • However, they are all missing a true metadata strategy which is the only way to ensure 100% of results. Concept Searching • Don Miller • (408) 828-3400 •
  12. 12. A Manual Metadata Approach Will Fail 95%+ Of The Time Issue Organizational Impact Inconsistent Less than 50% of content is correctly indexed, meta-tagged or efficiently searchable rendering it unusable to the organization (IDC) Subjective Highly trained Information Specialists will agree on meta tags between 33% - 50% of the time. (C. Cleverdon) Cumbersome - Expensive Average cost of manually tagging one item runs from $4 - $7 per document and does not factor in the accuracy of the meta tags nor the repercussions from mis-tagged content (Hoovers) Malicious Compliance End users select first value in list (Perspectives on Metadata, Sarah Courier) No perceived value for end user What’s in it for me? End user creates document, does not see value for organization nor risks associated with litigation and non conformance to policies. What have you seen Metadata will continue to be a problem due to inconsistent human behavior The answer to consistent metadata is an automated approach that can extract the meaning from content eliminating manual metadata generation yet still providing the ability to manage knowledge assets in alignment with the unique corporate knowledge infrastructure. Concept Searching • Don Miller • (408) 828-3400 •
  13. 13. conceptClassifier’s TaxonomyManager Automated Metadata Approach Drives Business Value  Create enterprise automated metadata framework/model  Average return on investment minimum of 38% and runs as high as 600% (IDC) 1. Model and Validate  Apply consistent meaningful metadata to enterprise content  Incorrect meta tags costs an organization 6. Life Cycle 2. Automate Management Tagging $2,500 per user per year – in addition potential costs for non-compliance (IDC)  Guide users to relevant content with taxonomy navigation  Savings of $8,965 per year per user based on an 5. Records $80K salary (Chen & Dumais) Management 3. Findability  100% “Recall” of content, 35% Faster access to and PII content “Precision” 4. Business  Use automatic conceptual metadata Processes generation to improve Records Management  Eliminate inconsistent end user tagging at $4-$7 per record (Hoovers)  Improve compliance processes, eliminate potential privacy exposures Concept Searching • Don Miller • (408) 828-3400 •
  14. 14. USAF Human Performance Clearinghouse GOAL : Leverage Existing USAF, AFDW, and AFMS License Agreements to Enable IM, RM, & Privacy & Security Compliance Requirements • DoDD 8320 (Data Sharing in a Net-Centric DoD) • DoDD 5015 (Records Management) Data Privacy • USAF Privacy Act Program & HIPAA • Freedom of Information Act (FOIA) Migration Migration Records Management Search eDiscovery & FOIA Tel: 703.246.9360 | Fax: 240.465.1182 Distribution Statement A: Approved for public release; distribution is unlimited. Distribution Statement A: Approved for public release; distribution is 311 ABG/PA No. 09-488, 16 Oct 2009 unlimited. 311 ABG/PA No. 09-488, 16 Oct 2009
  15. 15. Taxonomy Improves “Precision” with Guided Refiners for “Proposals” • After 100% of Results are returned, leverage metadata for guided navigation and refiners • Use taxonomy/metadata structures before query and after query to guide users to the right document • Accelerate document finding [PRECISION] by a minimum of 35% I want all proposals in two specific regions. I could then have a guided refiner for vertical, amount, etc. Concept Searching • Don Miller • (408) 828-3400 •
  16. 16. Dynamic Clustering Is Not Guided Navigation for “Proposals” • Brings back clusters • They are best guesses • They might help, they might make it worse • Better than nothing, but not a long term strategy or evolution of key word search Dynamic navigation (CLUSTERING) is ineffective. How does an information worker know when it is a good topic or not? This is NOT PRECISION! Concept Searching • Don Miller • (408) 828-3400 •
  17. 17. Enterprise Search Comparison for SharePoint Google vs FAST Why Enterprise Search needs Metadata and Taxonomy Management  Recall – Ensures you bring back 100% of Results  Enhance Precision – Fastest way to filter to the right results so that you are looking at the documents that matter the most MUST HAVES:  Heterogeneous content sources:  HTML, Documents and LOBs records  Located on Portals, File Systems and in Databases  Required Security Trimming:  Integrate with Identity Providers (AD, LDAP, SQL)  Implement authorization decision logic  Able to take advantage of metadata stored with documents and LOBs Concept Searching • Don Miller • (408) 828-3400 •
  18. 18. Google Search Appliance 6.8 vs. FAST Search Server for SharePoint 2010 For metadata-driven search scenarios in a SharePoint environment
  19. 19. Portal Solutions Corporate Overview - Vitals • Founded in 2002 • SharePoint 2010 Microsoft Gold Certified partner • Over 100 SharePoint deployments • 30+ certified engineers/developers • Member of Microsoft SharePoint Early Adopter Program • A recognized best place to work by Washingtonian magazine • A growing IT consulting organization comprised of talented and certified staff
  20. 20. Corporate Overview - Solutions • Employee Portals and Intranets • Public facing web sites • Knowledge Management solutions • Document and Records Management • Performance And Risk Management/BI • Customer Extranets • Enterprise Search solutions • Business Process Automation
  21. 21. Introducing the Contenders Google Search Appliance (GSA) • Search Appliance, in a box • Hardware & Software Solution • Pre-packaged functionality ready to work • “Black box” approach to search results FAST Search Server for SharePoint 2010 • Spin off of the earlier FAST ESP • Software-only solution • Allows to customize many aspects of the engine functionality down to relevancy tuning algorithms • Platform rather than a product
  22. 22. Comparing FS4SP and GSA • Indexing Options • Approach to Security Trimming • Ranking Algorithms & Sorting Options • Metadata & Search Refinements
  23. 23. Content Crawl Options GSA FAST SharePoint Content Pull HTTP Crawler SharePoint Crawler SharePoint Enterprise Crawler Crawler Content Push XML Feed API Feed API - Indexing LOBs (Pull) Onboard Database Databases & Web Services Databases & Connector via SharePoint BCS Web Services via SharePoint BCS Connectors SharePoint, OTB: File System, OTB: File Documentum, Exchange Public Folders System, LiveLink, FileNet, File Exchange Public System, LDAP Custom: Documentum, Folders Lotus Notes External Metadata Push through XML Custom Stages in the - Feed API processing pipeline Cloud Connectivity Google Apps & Sites; Custom connectors - Tweeter;
  24. 24. Comparing FS4SP and GSA • Indexing Options • Approach to Security Trimming • Ranking Algorithms & Sorting Options • Metadata & Search Refinements
  25. 25. Security Trimming • Answers the “Who Am I” and “What Results Can I See” questions • Required with most Enterprise Search scenarios • Approaches include Late & Early Authorization/Biding Authorization Access Rights Pros Cons Approach (ACLs) Late Checked at run - Up-to-date presentation - Slow on large time against system result sets of record Early Information stored - Fast - Duplicates info in the index at item - Facilitates metadata - Potential for level clustering outdated results
  26. 26. Security Trimming Options Support GSA FAST SharePoint 2010 Late - “Default” option in - - Custom Authorization many scenarios - Via Kerberos, SAML Bridge or Connector Early - Rel. 6.0 –High level - Item-level ACLs for Native support Authorization Policy ACLs configured Windows and for Item-level by admins or through a SharePoint security ACLs with remote API * principals supported Windows and - Rel. 6.8 – Item-level natively SharePoint ACLs) ** - Allows to setup multiple security user property stores and principals map user principals * Best applied to enterprises with a manageable number of high level policies, or able to invest into custom ACL sync tools ** SharePoint Connector Rel. 2.6.4 sends SharePoint Site Groups with the feed but the Groups are not expanded property by GSA
  27. 27. Comparing FS4SP and GSA • Indexing Options • Approach to Security Trimming • Ranking Algorithms & Sorting Options • Metadata & Search Refinements
  28. 28. Search Engine Internals
  29. 29. Result Set Ranking • Fidelity of keyword matches (All Engines) • Proximity • Frequency • Completeness • Hyper Text Matching (GSA only) • Analyzes keyword location on a rendered page and related pages • Hub and Spoke Algorithm (All engines) • Driven by linkages between web pages • Pages receiving or providing most links have higher rankings • GSA – PageRank; FAST – Document authority; • Static rank biasing, document importance • Document, Site, Metadata -based promotion / demotion (All engines) • User-tagged documents receive higher importance (FAST, SharePoint search) • Adaptive ranking • User clicks in search results (FAST, SharePoint search) • Custom Ranking • Build custom ranking models w/ FAST
  30. 30. Result Set Sorting • GSA • Date/Time only (Document Modification Date, or a date extracted from Title, Metadata or Body of a document) • FAST • Any property marked as Sortable • Supported data types: String, Number, Date/Time
  31. 31. Comparing FS4SP and GSA • Indexing Options • Approach to Security Trimming • Ranking Algorithms & Sorting Options • Metadata & Search Refinements
  32. 32. Index Schema Management • GSA (All-inclusive) • All discovered metadata (Crawled Properties) are stored in the index by default • Metadata from MS Office documents stored in the index results. (GSA Feature Request ID# 1371024) • All string-type metadata is associated with FTI by default, matches on metadata controlled through query time (allintext:, allintitle: keyword filters) • Metadata in results limited to 1,500 chars per field (Rel. 6.8; prev. releases – 320 chars) • FAST (Opt-in) • Crawled properties have to be associated with Managed Properties (MPs) to be stored in the index • MPs represent a level of abstraction from Content Sources • MPs can be configured to be used as: • Stored in the index (Queryable) • Associated with FTI (Searchable) • Sortable • Refiner-enabled
  33. 33. Search Refinement with Metadata Approach Completeness Pros Cons Run-time Smaller sample of - Smaller index size - Degraded clustering much larger set; performance w/ Top 50-100 query larger samples results. - No cluster counts Index-based Entire result set - Fast - Increases index clustering stored in the index. - Allows for precise cluster size counts
  34. 34. Search Refinement with Metadata GSA FAST SharePoint 2010 Run-time - The only option prior to - OTB - OTB clustering Rel. 6.8 (Custom) Index-based - “Preview” status in Rel. - OTB for MPs marked as - Not available clustering 6.8 (OTB) Refinable - Inverted Index and Metadata Property Store combined into a high performance OLAP cube
  35. 35. Conclusions* • SharePoint intranet as a hub + • Heterogeneous content sources FAST GSA document libraries, LOBs; dominated by web pages • Search results served from the • Search UI served by GSA SharePoint portal • Predominantly Keyword –driven • Active Directory -tied systems w/ search experience, content security policies applied • Custom run-time search refiners for broadly protected content; OTB “Dynamic • Fine level of control over index Navigation” for LOB / public data schema and document processing • Result biasing via URL patterns, • Custom search results ranking / metadata values relevancy models • Medium complexity metadata-based • High complexity metadata-based search scenarios search scenarios • Full & Mini Search-driven applications * Usage scenarios best aligned with OTB functionality, minimum possible customizations.
  36. 36. Special Offer First ten attendees to sign up will receive a two-hour evaluation of your current or planned enterprise search strategy. For more information contact: Val Orekhov -
  37. 37. Questions