Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Adding structure to unstructured content for enhanced findability hakan tylen


Published on

Presenation by Hakan Tylen, Microsoft for SharePoint round Table Imtech ICT June 2011

Published in: Business, Technology, Education
  • Be the first to comment

  • Be the first to like this

Adding structure to unstructured content for enhanced findability hakan tylen

  1. 1. Do not reinventFindability and Knowledge ManagementHåkan TylénWestern Europe Business
  2. 2. Agenda outline
  3. 3. Customer/Employee Service,in the Self-service channel How can I help YOU?
  4. 4. Metadata basicsWhat is it? Where is it stored? Metadata is the set of properties that characterize a document.
  5. 5. Poor metadata impairs the search experienceDegraded findability leads to the erosion of users’ trust in search Few options to navigate Inconsistent, incorrect or missing I’m not confident I will find what I need here… or refine a large result list other than trying to metadata is commonplace within This is a waste of time! reformulate the query most organizations today This impairs findability in the context of enterprise search Hard to scan or navigate results Documents returned may be Unchanged template metadata make results look like duplicates incomplete or not current No confidence in authority and correctness of information Difficult to locate relevant experts Meaningless metadata confuses users as they Even with refinement scan the search results tools, users do not rely on them Multiple variations or spellings Missing metadata raises questions about result Hit counts do not set completeness add up
  6. 6. ROI - Scenarios1. Time Wasted Searching2. Cost of Reworking Information3. Opportunity Costs to the Enterprise 6 | SharePoint Server 2010 for Internet Sites Microsoft confidential.
  7. 7. Scenario 1: Time wasted €3.000/month + social €50.000/year 10 minutes/day *220 €1.000/emp/year 1000 employees = €1.000.000/year ”released time” 7 | SharePoint Server 2010 for Internet Sites Microsoft confidential.
  8. 8. Creating quality metadata is a real challengeFew organizations have good quality metadata on internal content • Ineffective information governance across the enterprise • Multiple content silos and search interfacesChallenge • Manually entered metadata is inconsistent, incorrect or missing • No automated tools for content classification • Impossible to keep up with ever growing content volumes Assist users in tagging content with automated metadata suggestions or enrichment tools• FAST Search for SharePoint (FS4SP) delivers business value out-of-the-box• Sophisticated content processing optimizes findability across multiple silos Solution of unstructured and structured content• In addition, property extraction overcomes poor metadata by generating it and normalizing it on-the-fly
  9. 9. Agenda outline
  10. 10. Content Processing Pipeline – what is it?Enhance your content for optimal search experience and findability The pipeline is a sequentially arranged set of discrete processing stages that break down and enrich content for indexing Convert documents to plain text (support for 400+ file formats) Detect document languages and encoding (support for 80+ languages) Apply linguistic normalization to optimize content for search Identify and leverage existing metadata where applicable Parse content to extract or generate additional metadata Map content and associated metadata (crawled properties) to the index schema (managed properties) for searching Custom stages can be created and added to the pipeline Language Custom Identifies the encoding and language-specific rules for Breaks you to tokens entities mentioned Applies document times (phrase/weight inin pipeline Recognizes predefined usinga standard topairsthe so Createstext into andvectors content processingthe content; Converts dates extend the normalizationrepresentation, Enables language-specific tolanguages usedcontenttext to reflecting Date and Time Properties Property Format Extracts plain text pieces of content and metadata Maps the relevant and metadata from multiple content Lemmatization Encoding and Vectorization Tokenization Processing content so that the (home-grown occurrence) 3rd party punctuation, support for and phrases users’ locale-specific accents, linguisticin words,enable out custom stages appropriate of solutionscanonical important terms and frequency compoundexample, the handlethe box match wordsCompanies, Locations andor withof queriesdiacritics,representations; fornormalization or to phrases Normalization Conversion Extraction Mapper formats (e.g. the pipeline to the index schema discovered inMicrosoft Office, PDF, HTML, etc.) for search Detection Stage rules and to (currency, telephones, downstream and similar”address extended to other 2010 inflected dictionariesyour People but this is equivalent to March numbers, etc.) “find 14-Mar-10can becan be appliedpartneeds datenumbers functionality own masculine/feminine, software)forms (singular/plural,business14,categories etc.)
  11. 11. Property ExtractionCreate metadata on-the-fly, adding structure to unstructured content In a nutshell, property extraction Crawled Properties is the ability to Companies Process unstructured content (e.g. Microsoft Contoso a document’s body) Woodgrove Recognize entities mentioned in … the text (e.g. people, companies, Locations locations, concepts, etc.) London San Francisco Optionally, normalize variations to Moscow a single, canonical form … Expose these extracted entities as People crawled properties in pipeline Bill Gates Barack Obama Map them to managed properties José Caires for filtering and searching ...Index Schema: Managed Properties Type Doc ID Title Author Date Size Keywords Companies Locations People ... Body Text xxx Sales For… John Doe 2010-04-15 386 KB sales; pipe… Microsoft; … London; … Bill Gates; … … The mark… yyy … … … … … … … … … … zzz … … … … … … … … … …
  12. 12. Good metadata greatly improves findabilityProperty extraction enables consistent metadata across all content This is really great! Now I Metadata quality is critical to can navigate through this Metadata is also used for relevancy tuning, the search experience large information universe multi-level sorting and advanced search without feeling lost… FS4SP leverages metadata, i.e. managed properties, to present deep refiners File Formats , Offer at-a-glance overview Organize free-text search results into multiple facets Companies Make search conversational Guide users toward possible Precise hit counts in refinement choices deep refiners are computed across the whole result set. Prevent users drilling down Products into a “0 results” dead end Additional uses for managed properties in FS4SP Relevancy tuning & ranking Concepts Multi-level sorting Advanced (or fielded) search And many more…
  13. 13. The Microsoft IT IntranetEnvironment 6.4 TB 49,731 Sites Seattle Dublin 117,324 Sub-sites 29.89 TB 22% 65% ( 31,346,042 MB ) Grows with 1.5TB per quarter Singapore 223,595 Sites 4.1 TB19.4 TB 545,387 Sub-sites 45,878 Sites 13%127,986 Sites 82,128 Sub-sites345,935 Sub-sites - Europe - Middle East - - Americas - - Africa - - Asia Pacific - As of September 2010 | 13
  14. 14. Knowledge Transfer: MSW
  15. 15. Property extraction and refiners in FS4SPWhat’s available out-of-the-box? FS4SP automatically detects 80+ languages in content Property extraction dictionaries are included for 11 languages* and 3 types of entities Locations Companies Persons The metadata is exposed to users as refiners, drives relevancy and other features to improve findability This delivers real business value to organizations struggling with issues such as Poor document metadata Large content volumes Lack of result refinement options Low user adoption of search * Arabic, Dutch, English, French, German, Italian, Japanese, Norwegian, Portuguese, Russian, Spanish
  16. 16. Extending property extraction in FS4SP (1/2)Make search speak the language of your business using dictionaries Property extraction in FS4SP is SharePoint lists & Term Store customizable using a dictionary, i.e. list of keywords and phrases Matching variations can be normalized to a single entry Several dictionaries may co-exist to address needs of the business Projects Create custom Products search refiners to fit your own Customers business needs Competitors Employees Business-specific concepts The necessary data may be readily available within the organization or from external sources LOB applications, Databases & XML
  17. 17. Extending property extraction in FS4SP (2/2)Use existing text mining or classification tools to go even further Another approach is to invoke External text mining/classification tool external tools during content processing in FS4SP This leverages the standard pipeline extensibility mechanism Local software Web service Such tools typically address problems like Analyze text content Text mining for entity, fact or relationship extraction Return metadata tags Taxonomy classification Moreover, these tools may be already deployed for other Index purposes in the enterprise Content pipeline Enriched document Home-grown solutions for indexing ? 3rd party, specialized vendors Industry sectors or verticals Original document from repository Scientific or technology domains
  18. 18. Agenda outline
  19. 19. Best practice #1Deepen your understanding of your audiences and your content Marketing Sales Consulting Procurement Production Research IT Support HR / LegalEnterprise content Before you start deploying enterprise search: understand your content, your users and what they need to get their jobs done effectively.
  20. 20. Best practice #2 Use existing language resources inside and outside your enterprise •Thesauri, controlled •Government Internal assets Internet resources Content providers Specialized vendors vocabularies agencies •Taxonomies, •Industry bodies ontologies •Research institutions •Master databases •Academia •Enterprise systems •Virtual •Line-of-business communities applications •Examples •Subject matter experts • • •Examples* •WordNet, from •SharePoint (Lists, Princeton University Term Store) •Medical Subject •Employees (AD, HR) Headings (MeSH) •Customers (CRM) •Suppliers (ERP) •Products (PLM) •Processes (BPM) •Projects (EPM)* AD – Active Directory; CRM – Customer Relationship Mgmt.; ERP – Enterprise Resource Mgmt.; PLM – Product Lifecycle Mgmt.; BPM – Business Process Modeling; EPM – Enterprise Project Mgmt.
  21. 21. Best practice #3Keep the index synchronized with content sources and dictionaries The language of the business Where possible, automate will change over time dictionary upkeep as part of External environment standard business workflows Enterprise content Taxonomies and thesauri Users’ needs Enterprise project management Ensure that property extraction Product lifecycle management dictionaries and search index Schedule regular analysis and are systematically updated to review checkpoints to handle respond to these changes exceptional casesDictionary with changes over time Data Search synchronized Sources Property ExtractionDictionaries Search IndexEnterprise Content Sources
  22. 22. Best practice #4Distinguish search management from systems management As the language of your business Search management is not an IT and users’ needs evolve, so should responsibility, it’s for the business your search solution Job profile If not, the search experience and • Skillset of a SharePoint administrator (not a findability inevitably degrade over programmer or systems engineer) time – users’ trust will plunge too • Business perspective and focus • Good ability with languages • Attention to detail Original implementation Sample tasks of the search solution • Monitor search reports (daily/weekly) • Run user polls and/or focus groups (quarterly) • Process users feedback/questions • Update dictionaries and manage keywords (as required) • Support search-related projects Staffing – depends on scaleActual search experience, if left unattended... • One person part-time, or • A geographically distributed team
  23. 23. Agenda outline
  24. 24. Case study #1General Mills (Research & Development) Business Problem • Researchers forced to search each internal and external content source separately • Low relevancy in existing search applications • High effort in information discovery tasks • Growing difficulty in establishing connections with experts as company grew worldwide Approach & Solution • FAST Search for SharePoint indexes all internal sources and federates external industry services • Property extraction dictionaries extended to recognize product names cited in documents • Deep refiners are used on extracted properties to drill down by products, companies and people Benefits & Value • Improved employee productivity with more relevant search results in a unified interface • Greater information sharing and reuse across product areas & geographies By using FAST Search Server 2010 for SharePoint, our • Integrated people search eases social networking researchers can refine their searches and find exactly what they are looking for. They spend more time innovating than • Proof point for wider search roll-out in enterprise looking for information.Link to full case study – Michelle Check, R&D Systems Leader, General Mills
  25. 25. Case study #2Mississippi Department of Transportation (MDOT) Business Problem • Poor access to a large, active collection of paper- based contracts and project documents • Metadata managed in a separate DMS (database) • Information silos stifle and sharing of data and collaboration • Requirements to provide internal and public access Approach & Solution • FAST Search for SharePoint indexes images with iFilter-based OCR technology • Pipeline extended with custom .NET code to merge metadata from database with indexed documents • Custom refiners reflect language used in the business for navigating search results Benefits & Value • Unified self-service interface to locate information • Ability to slice & dice results according to specific needs (dates, project, folder, route, district, etc.) We are literally reducing decision cycles from days to • Information search times cut from several hours or minutes for hundreds of overlapping decisions a day. With days to mere seconds or minutes SharePoint Server 2010, we can make better spending decisions and enhance program performance without a very • Users have more time to focus on higher value tasks large investment.Link to full case study – John Michael Simpson, CTO, MDOT
  26. 26. Ingredients for great enterprise searchThe business value of FAST Search Server 2010 for SharePoint The challenges • Explosive content growth puts information management and governance under pressure • Multiple content silos with different search interfaces • Poor metadata – missing, inconsistent, incorrect The solution • Content processing optimizes findability across disparate sources • Property extraction generates metadata while indexing content • Deep refiners expose metadata in search results helping users quickly zoom to the right information The benefits • Reduced costs through enterprise search consolidation and automated metadata enrichment • Enhanced findability helps employees to get their job done faster • Increased user adoption across the enterprise drives ROI
  27. 27. / Enterprise Search © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.