Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Concept Searching Overview Google Vs Fast


Published on

Slides from our webinar: Search Engine Face-Off Keyword Search Versus Metadata Search

  • Be the first to comment

Concept Searching Overview Google Vs Fast

  1. 1. Search Engine Face-Off Keyword Search versus Metadata SearchDon Miller, VP of Business Development Val Orekhov, Chief Architect(408) 828-3400 (240) 450-2166
  2. 2. Agenda Introductions Concept Searching:  What is Metadata  Keyword vs.. Metadata Search  Keyword vs.. Metadata Costs  Google vs.. SharePoint vs.. FAST Portal Solutions:  Enterprise Search – Google vs. FAST in SharePoint 2010  Indexing Options  Approach to Security Trimming  Ranking Algorithms & Sorting Options  Metadata & Search Refinements Concept Searching - How Do I apply metadata:  Microsoft‟s approach to applying metadata  How to automate the Microsoft approach with conceptClassifier for SharePoint 2010 Demo
  3. 3. Concept Searching, Inc.Company founded in 2002  Product launched in 2003  Focus on management of structured and unstructured information Technology  Automatic concept identification, content tagging, auto- classification, taxonomy management  Only statistical vendor that can extract conceptual metadata 2009 and 2010 „100 Companies that Matter in KM‟ (KM World Magazine) KMWorld „Trend Setting Product‟ of 2009 and 2010 Locations: US, UK, & South AfricaClient base: Fortune 500/1000 organizations Managed Partner under Microsoft global ISV Program - “go to partner” for Microsoft for auto-classification and taxonomy management Microsoft Enterprise Search ISV , FAST Partner Product Suite: conceptSearch, conceptTaxonomyManager, conceptClassifier, conceptClassifier for SharePoint, contentTypeUpdater for SharePoint
  4. 4. What is metadata• Metadata is a means to apply structure to unstructured or structured content or information. Metadata describes what the document is about.• Metadata makes it easier to find information.• There are usually multiple metadata terms per item or document.• Metadata can also be used for rights management, governance, retention code policies, sensitive information removal and of course improved findability.
  5. 5. What Is Keyword vs. Metadata Costing You?Problem Pre Migration Search Records Management Data Privacy Protection •60% of stored •“It‟s not about better •67% of data loss in •Average cost per documents are search” Records Management is exposed record is $197 obsolete •Less than 50% of content due to end user error and ranges from $90- •50% of documents are is correctly indexed, meta •It costs and organization $305 per record duplicates tagged or efficiently $180 per document to •70% of breaches are due •Requires resources to searchable recreate it when it is not to a mistake or malicious identify what •85% of relevant tagged correctly and intent by an should/not be migrated documents are never cannot be found organization‟s own staff retrieved in search •Eliminate duplicate •Eliminate manual tagging •Eliminate inconsistent •Identify any type ofSolution end user tagging organizationally defined documents & replace with automatic •Identify privacy data identification of multi- •Automatically declare privacy data exposures word concepts documents of record •Combines pattern •Identify and declare •Provide guided based on vocabulary and matching with associated records that were not navigation via the retention codes vocabulary previously identified taxonomy structure (i.e. •Automatically change the •Automatic Content Type •Identify high value concepts) Content Type and route updating enabling content •Go beyond dynamic to the Records workflows and rights •Migrating required clustering with Management repository management content to a structure conceptual clustering based on the taxonomiesBenefit •Reduces migration •Taxonomy navigation •Savings of $4.00 - $7.04 •Average cost runs from costs is 36% - 48% faster per record by eliminating $225K to $35M •Ensures •Savings 2.5 hours manual tagging compliance and per user per day •Ensures compliance and protection of reduces potential content assets litigation exposures
  6. 6. USAF Human Performance Clearinghouse GOAL : Leverage Existing USAF, AFDW, and AFMS License Agreements to Enable IM, RM, & Privacy & Security ComplianceRequirements• DoDD 8320 (Data Sharing in a Net-Centric DoD)• DoDD 5015 (Records Management) Data Privacy• USAF Privacy Act Program & HIPAA• Freedom of Information Act (FOIA) Migration Migration Records Management Search eDiscovery & FOIA Tel: 703.246.9360 | Fax: 240.465.1182Distribution Statement A: Approved for public release; distribution is unlimited.Distribution Statement A: Approved for public release; distribution is311 ABG/PA No. 09-488, 16 Oct 2009 unlimited.311 ABG/PA No. 09-488, 16 Oct 2009
  7. 7. What Type of Search or Information Architecture Do You Need?Keyword Search = ~66%+ Metadata Search = 100% of of results (Recall) results (Recall)• Simple • Guided Navigation• No administration • Records Management• Good enough • Sensitive Information Removal • CollaborationRecall (information retrieval), astatistical measure (contrasted with • Improved Precision andprecision), the fraction of (all) relevantmaterial that are returned by a search Recallquery • Evolution of KeywordPrecision (information retrieval), thepercentage of documents returned that Searchare relevant
  8. 8. Metadata Search vs.. Keyword and Guided Navigation “Proposal” “Software License” “SLA” “Licensee” “Addendum” “License Agreement” “License” 100% of Results Results “Documents of Record” Metadata Searchalso knownas “Recall” “Proposals” “Contract” 66% Key + Synonym Search “Proposal” Entity Extraction 33% Keyword Search 20-33% of results Entity extraction without complex rules is ineffective. It is just keyword match, Cost (Time, Money and Complex) which is what keyword search is, which is 33% effective.
  9. 9. Similar Features Against Total Number of Documents Returned Google SharePoint FASTIndex 500 M + 100 M 500 M +Key Word – 33% Yes Yes – Good as Yesof results Google or FASTSynonyms - Up to Yes Yes Yes50-66%+ ofresults for topicRanking Somewhat Somewhat Very TunableAlgorithm + Best Tunable TunableBets: Does notimprove numberof results onlyhow presented
  10. 10. What Is Missing To Get to 100% of Relevant Results in Every Search?Metadata Google SharePoint FASTAuto No – No – Entity extraction,Classification Missing 33-50% Missing 33-50% which is the same of results on any of results on any as keyword particular topic particular topic search 33% results. Provides some refinement capabilities.Taxonomy No Yes, but not used Same asManagement for auto SharePoint classification this release.
  11. 11. Miscellaneous Items to Review Google SharePoint FASTSharePoint Hard Yes – Easy to use Medium – InitialRefiners and for standard release, does notNavigators with search. No leverage Termcounts. counts on results. Store yet. XML – Powershell basedRECALLCustomization Limited Limited Extendable
  12. 12. Summary• Google – Best for no administration, install and walk away. However, keyword approach usually missing 33%-50% of results on any given topic because of missing metadata. Not easy to integrate refiners or navigators into SharePoint UI.• SharePoint Search – Cost effective, comes free with SharePoint. Also very easy to install. Search Algorithm is as good as FAST or Google. Limited extensibility. Easy integration for refiners and navigators (no counts). However, keyword approach still missing 50% of results on any topic.• FAST – Extremely customizable, but requires training or professional services to customize. Most likely Microsoft long term platform for search. Very scalable and can provide refiner counts. However, keyword approach still missing 33-50% of results from any given search because of metadata inconsistency.• However, they are all missing a true metadata strategy which is the only way to ensure 100% of results (Recall).
  13. 13. Take it away Val
  14. 14. Google Search Appliance 6.8 vs.. FAST Search Server for SharePoint 2010For metadata-driven search scenarios in a SharePoint environmentVal Orekhov, Chief ArchitectPortal SolutionsEmail: val@portalsolutions.netPhone: (240) 450-2166 x
  15. 15. Agenda• Enterprise Search Technologies • Google Search Appliance 6.8 and FAST for SharePoint • Content Indexing Options • Approach to Security Trimming • Ranking Algorithms and Searching Options • Index Schema Management, Metadata & Search Refinements• Conclusions• Q&A
  16. 16. Enterprise Search Technologies• Heterogeneous content sources: • HTML, Documents and LOBs records • Located on Portals, File Systems and in Databases• Required Security Trimming: • Integrate with Identity Providers (AD, LDAP, SQL) • Implement authorization decision logic• Able to take advantage of metadata stored with documents and LOBs
  17. 17. Introducing the ContendersGoogle Search Appliance (GSA) • Search Appliance, in a box • Hardware & Software Solution • Pre-packaged functionality ready to work • “Black box” approach to search resultsFAST Search Server for SharePoint 2010 • Spin off of the earlier FAST ESP • Software-only solution • Allows to customize many aspects of the engine functionality down to relevancy tuning algorithms • Platform rather than a product
  18. 18. Comparing FS4SP and GSA• Indexing Options• Approach to Security Trimming• Ranking Algorithms & Sorting Options• Metadata & Search Refinements
  19. 19. Content Crawl Options GSA FAST SharePoint Content Pull HTTP Crawler SharePoint Crawler SharePoint Enterprise Crawler Crawler Content Push XML Feed API Feed API - Indexing LOBs (Pull) Onboard Database Databases & Web Services Databases & Connector via SharePoint BCS Web Services via SharePoint BCS Connectors SharePoint, OTB: File System, OTB: File Documentum, Exchange Public Folders System, LiveLink, FileNet, File Exchange Public System, LDAP Custom: Documentum, Folders Lotus Notes External Metadata Push through XML Custom Stages in the - Feed API processing pipeline Cloud Connectivity Google Apps & Sites; Custom connectors - Tweeter;
  20. 20. Comparing FS4SP and GSA• Indexing Options• Approach to Security Trimming• Ranking Algorithms & Sorting Options• Metadata & Search Refinements
  21. 21. Security Trimming• Answers the “Who Am I” and “What Results Can I See” questions• Required with most Enterprise Search scenarios• Approaches include Late & Early Authorization/Biding Authorization Access Rights Pros Cons Approach (ACLs) Late Checked at run - Up-to-date presentation - Slow on larger time against system sections of result of record sets Early Information stored - Fast - Duplicates info in the index at item - Facilitates metadata - Potential for level clustering outdated results
  22. 22. Security Trimming Options Support GSA FAST SharePoint 2010Late - “Default” option in - ? - CustomAuthorization many scenarios - Via Kerberos, SAML Bridge or ConnectorEarly - Rel. 6.0 –High level - Item-level ACLs for Native supportAuthorization Policy ACLs configured Windows and for Item-level by admins or through a SharePoint security ACLs with remote API * principals supported Windows and - Rel. 6.8 – Item-level natively SharePoint ACLs) ** - Allows to setup multiple security user property stores and principals map user principals* Best applied to enterprises with a manageable number of high level policies, or able to invest into custom ACL sync tools** SharePoint Connector Rel. 2.6.4 sends SharePoint Site Groups with the feed but the Groups are not expanded property by GSA
  23. 23. Comparing FS4SP and GSA• Indexing Options• Approach to Security Trimming• Ranking Algorithms & Sorting Options• Metadata & Search Refinements
  24. 24. Search Engine Internals
  25. 25. Result Set Ranking• Fidelity of keyword matches (All Engines) • Proximity • Frequency • Completeness• Hyper Text Matching (GSA only) • Analyzes keyword location on a rendered page and related pages• Hub and Spoke Algorithm (All engines) • Driven by linkages between web pages • Pages receiving or providing most links have higher rankings • GSA – PageRank; FAST – Document authority;• Static rank biasing, document importance • Document, Site, Metadata -based promotion / demotion (All engines) • User-tagged documents receive higher importance (FAST, SharePoint search)• Adaptive ranking • User clicks in search results (FAST, SharePoint search) • Custom Ranking • Build custom ranking models w/ FAST
  26. 26. Result Set Sorting• GSA • Date/Time only (Document Modification Date, or a date extracted from Title, Metadata or Body of a document)• FAST • Any property marked as Sortable • Supported data types: String, Number, Date/Time
  27. 27. Comparing FS4SP and GSA• Indexing Options• Approach to Security Trimming• Ranking Algorithms & Sorting Options• Index Schema Management, Metadata & Search Refinements
  28. 28. Index Schema Management• GSA (All-inclusive) • All discovered metadata (Crawled Properties) are stored in the index by default • Metadata from MS Office documents stored in the index results. (GSA Feature Request ID# 1371024) • All string-type metadata is associated with FTI by default, matches on metadata controlled through query time (allintext:, allintitle: keyword filters) • Metadata in results limited to 1,500 chars per field (Rel. 6.8; prev. releases – 320 chars)• FAST (Opt-in) • Crawled properties have to be associated with Managed Properties (MPs) to be stored in the index • MPs represent a level of abstraction from Content Sources • MPs can be configured to be used as: • Stored in the index (Queryable) • Associated with FTI (Searchable) • Sortable • Refiner-enabled
  29. 29. Search Refinement with Metadata Approach Completeness Pros Cons Run-time Smaller sample of - Smaller index size - Degraded clustering / much larger set; performance w/ Shallow Top 50-100 query larger samples refiners results. - No cluster counts Index-based Entire result set - Fast - Increases index clustering / stored in the index. - Allows for precise cluster size Deep refiners counts
  30. 30. Search Refinement with Metadata GSA FAST SharePoint 2010 Run-time - The only option prior to - OTB - OTB clustering / Rel. 6.8 (Custom) Shallow refiners Index-based - “Preview” status in Rel. - OTB for MPs marked as - Not available clustering / 6.8 (OTB) Refinable Deep refiners - Inverted Index and Metadata Property Store combined into a high performance OLAP cube
  31. 31. Conclusions* • SharePoint intranet as a hub + • Heterogeneous content sources GSA FAST document libraries, LOBs; dominated by web pages • Search results served from the • Search UI served by GSA SharePoint portal • Predominantly Keyword –driven • Active Directory -tied systems w/ search experience, content security policies applied • Custom run-time search refiners for broadly protected content; OTB “Dynamic • Fine level of control over index Navigation” for LOB / public data schema and document processing • Result biasing via URL patterns, • Custom search results ranking / metadata values relevancy models • Medium complexity metadata-based • High complexity metadata-based search scenarios search scenarios • Full & Mini Search-driven applications* Usage scenarios best aligned with OTB functionality, minimum possible customizations.
  32. 32. Questions
  33. 33. Back to Don
  34. 34. In Summary: Enterprise Search Comparison for SharePoint vs. Google vs. FASTWhy Enterprise Search needs Metadata and Taxonomy Management – Recall – Ensures you bring back 100% of Results – Enhances Precision – Fastest way to filter to the right results so that you are looking at the documents that matter the most – Boosts the relevancy of documents – Drives Records Management, Sensitive Information Removal, Retention Code PoliciesMUST HAVES: – Heterogeneous content sources: • HTML, Documents and LOBs records • Located on Portals, File Systems and in Databases – Required Security Trimming: • Integrate with Identity Providers (AD, LDAP, SQL) • Implement authorization decision logic – Able to take advantage of metadata stored in documents and LOBs
  35. 35. How do I apply metadata to content?
  36. 36. Microsoft‟s approach to solving the metadataproblem for Records Management, Governance Policies, Sensitive Information Removal and Findability: Content Types, The Term Store and Enterprise Managed Metadata Services
  37. 37. What is a content type• A Content Types is a means to apply structure to unstructured or structured content with in SharePoint. Content Types inherit their parent content types.• This is usually a combination of a term or terms from a single or multiple term sets.• Terms are metadata and metadata is information about information.• Terms can also include governance and retention code policies and also can be for the sole purpose of improved findability• However, it is best to align Content Types with business goals and business use cases.
  38. 38. Introducing EMM, The Term Store and Term Store Management Definitions SharePoint 2010 conceptClassifier for Enterprise Managed SharePoint 2010 Metadata Service SharePoint 2010 Farm Term Store Management Subscription Service Auto Classification Content Type Hub Content Type Term Store Site Collection Updating Records Library
  39. 39. The Managed Metadata Service Managed Metadata Service Manages Enterprise Content Types via the Content Type Hub Manages Term Store Term Sets (taxonomies) and terms can be shared across multiple SharePoint siteEnterprise Managed Metadata Service collections Multiple manage metadata services can be created Enables search filtering 30,000 Terms per Term Set Two types of terms: (1 Taxonomy) Managed terms – pre-defined by an enterprise administrator and may be 1,000 Term Sets hierarchical. Surfaced in the "managed metadata" column typeTested to 1,000,000 Preferred Terms Managed keywords – non-hierarchical words or phrases that have been added to SharePoint 2010 items by users (folksonomy)
  40. 40. conceptClassifier for SharePoint is the only native Term Store Management tool for 2010 Term Set Parent Term Build term sets/taxonomies Child Term here in SharePoint 2010 EMM. Plan for 30,000 Grand Child Term values A content type can contain one or many taxonomies based on specific business user requirement. The values can shown as columns or can be hidden from users for administrative or governance purposes only.
  41. 41. Traditional manual approach is subjective, cumbersome and overwhelmingEnd user must selectvalues from multipleterm sets. Up to 30,000values per term set and1,000 term sets perterm store. Manualapproach is impractical.
  42. 42. conceptClassifier for SharePoint 2010An automated solution for applying metadata and providing term store management to enhance SharePoint 2010 capabilities for Records Management, Governance Policies, Rights Management, Sensitive Information Removal and Findability.
  43. 43. A Manual Metadata Approach Will Fail 95%+ Of The TimeIssue Organizational ImpactInconsistent Less than 50% of content is correctly indexed, meta-tagged or efficiently searchable rendering it unusable to the organization (IDC)Subjective Highly trained Information Specialists will agree on meta tags between 33% - 50% of the time. (C. Cleverdon)Cumbersome - Expensive Average cost of manually tagging one item runs from $4 - $7 per document and does not factor in the accuracy of the meta tags nor the repercussions from mis-tagged content (Hoovers)Malicious Compliance End users select first value in list (Perspectives on Metadata, Sarah Courier)No perceived value for end What‟s in it for me? End user creates document, does not see valueuser for organization nor risks associated with litigation and non conformance to policies.What have you seen Metadata will continue to be a problem due to inconsistent human behavior The answer to consistent metadata is an automated approach that can extract the meaning from content eliminating manual metadata generation yet still providing the ability to manage knowledge assets in alignment with the unique corporate knowledge infrastructure.
  44. 44. conceptClassifier for SharePoint 2010 provides an automated metadata approachfor an immediate ROI and to drives business value Create enterprise automated metadata framework/model  Average return on investment minimum of 38% and runs as high as 600% (IDC) 1. Model and Validate Apply consistent meaningful metadata to enterprise content  Incorrect meta tags costs an organization 6. Life Cycle 2. Automate Management Tagging $2,500 per user per year – in addition potential costs for non-compliance (IDC) Guide users to relevant content with taxonomy navigation  Savings of $8,965 per year per user based 5. Records on an $80K salary (Chen & Dumais) Management 3. Findability  100% “Recall” of content, 35% Faster and PII access to content “Precision” 4. Business Use automatic conceptual metadata generation Processes to improve Records Management  Eliminate inconsistent end user tagging at $4-$7 per record (Hoovers)  Improve compliance processes, eliminate potential privacy exposures
  45. 45. conceptClassifier provides a native integration into Term StoreNative integration into Term No Service Pack Updates, no custom code.Store conceptClassifier is a native integration.No custom property types Every item is synchronized with term store and is a part of managed metadata service. All search features work natively as they should. No custom search property values which require custom code updates and additional custom search controls. conceptClassifier is a native integration.Why do we work with native Because it is the natural place that youterm store natively should store metadata if you are driving economies of scale by leveraging Microsoft stack. That is Microsoft‟s road map for metadata management.Easy Upgrade If you want to go back to a pure manual application, there is no code rewrite. conceptClassifier is a native integration. You just unplug and you are back to native.
  46. 46. Automated Multi Word Term Suggestions for Term Store Concept Searching‟s unique statistical concept identification underpins all technologies. Multi word suggestion is explicitly more valuable than single term suggestion algorithms. Concept Searching provides Automatic Concept Term Extraction Triple Heart Bypass Baseball Organ Highway Three Center Avoid  conceptClassifier will generate conceptual metadata by extracting multi-word terms that identifies „triple heart bypass‟ as a concept as opposed to single keywords . Metadata can be used by any search engine index or any application/process that uses metadata.
  47. 47. conceptClassifier for SharePoint 2010 drives immediate value for end users forSearch, Records Management and Sensitive Information RemovalconceptClassifier for SharePoint 2010 Automatically applies Metadata Automatically Applies Content Types Auto Applies Retention Code Policies Automatically applies Windows Rights Management Policies Automatic Term Boosting for FAST Pulls hierarchy directly from Term Store, therefore updates are immediate and accurate for guided taxonomy navigation in FAST
  48. 48. Enterprise Taxonomy Management and Auto-classification Multi User Distributed Branch and Term Support for Enterprise Native Term Store Integration for SharePoint 2010 Accelerate building out taxonomies by 75% with automatic Term/Clue Suggestion Enables the ability for information architects to build model and validate Automatic Term Boosting for FAST/Search Platforms Pragmatic Ontology Features for subject matter experts (You don‟t need to be a librarian) Broad to Narrow Preferred Term Non preferred terms Poly hierarchies – Not supported in Term Store Relations – Not supported in Term Store
  49. 49. conceptClassifier for FAST Search Improves search outcomes by placing conceptual metadata in the FAST Search  Provides accurate metadata filters such as numeric index to increase relevancy of search results range searching and wildcard alphanumeric matchingEnables import of FAST Entities into the  Removes documents from search results that are conceptClassifier taxonomy manager to fine- confidential/sensitive through automatic Content tune them with metadata generated from your Type updating and routing to secure server own content and nomenclature  Automatically tags content with both vocabulary Runs natively as a FAST Pipeline Stage and retention codes and respects SharePoint eliminating integration and customization security that could prevent access to the document issues once it has been declared a recordEliminates vocabulary normalization issues across global boundaries through controlled vocabularies Improves faceted search results as facets are based on concepts aligned with the taxonomy Provides taxonomy browse capabilities based on the nodes within the corporate taxonomy(s)
  50. 50. Product Screen Shots
  51. 51. Traditional manual approach is subjective, cumbersome and ineffectiveEnd user must selectvalues from multipleterm sets. Up to 30,000values per term set and1,000 term sets per termstore. Manual approachis impractical.
  52. 52. An automated approach ensures accurate Records Management, SensitiveInformation Removal and improved Search/FindabilitycMetadata is automatically applied to content by ConceptClassifier viaTaxonomyManager. contentTypeUpdater can take it a step further and can modifycontent type to redirect document/object to a different content type or migrate it toanother site collection or document library. In this example the documents are beingchanged from document content type to PII or Records Center Content Type.
  53. 53. Term Store Management is provided by Taxonomy Manager andconceptClassifierTaxonomyManager is anintuitive and elegant to Deep capabilities to build out rules classificationtool to manage how and approaches including: standard term, phonetics,when term sets are metadata, class ID, language, case sensitive,applied within regular expression and boosting.SharePoint 2010 andwhat new terms to add tothe term store.
  54. 54. An automated approach ensures accurate Records Management, SensitiveInformation Removal and improved Search/FindabilityThe documents with 10 in front of them have had their content types updated.In this example the documents are being changed from document content typeto PII or Records Center Content Type. They could have also been moved toa different folder if that was the desired outcome.
  55. 55. conceptClassifier for FAST and SharePoint 2010 Search conceptClassifier for 2010 Product Suite provides intuitive guided navigation for FASTMulti value select with in a term set is the single fastest approach you can provide for endusers to get access to the correct content. It is just like picking values when you are onBest Buy or Amazon but it is with your personalized corporate term set vocabulary.
  56. 56. Demo – How to automate theprocess of applying metadata in a SharePoint 2010 native term store environment to improve Findability and Records Management
  57. 57. QA
  58. 58. Thank YouDon Miller, VP of Business Development Val Orekhov, Chief Architect(408) 828-3400 (240) 450-2166