Adding structure to unstructured content for enhanced findability hakan tylen

Do not reinvent
Findability and Knowledge Management

Håkan Tylén
Western Europe Business Development
+46703091665
hatylen@microsoft.com

Customer/Employee Service,
in the Self-service channel

How can I help YOU?

Metadata basics
What is it? Where is it stored?

Metadata is the set of properties that characterize a document.

Poor metadata impairs the search experience
Degraded findability leads to the erosion of users’ trust in search
Few options to navigate
Inconsistent, incorrect or missing I’m not confident I will
find what I need here…
or refine a large result
list other than trying to

metadata is commonplace within This is a waste of time!
reformulate the query

most organizations today
This impairs findability in the context
of enterprise search
Hard to scan or navigate results
Documents returned may be Unchanged template
metadata make results
look like duplicates
incomplete or not current
No confidence in authority and
correctness of information
Difficult to locate relevant experts
Meaningless metadata
confuses users as they
Even with refinement scan the search results

tools, users do not
rely on them
Multiple variations
or spellings Missing metadata raises
questions about result

Hit counts do not
set completeness

add up

ROI - Scenarios

1. Time Wasted Searching
2. Cost of Reworking Information
3. Opportunity Costs to the Enterprise

6 | SharePoint Server 2010 for Internet Sites Microsoft confidential.

Scenario 1: Time wasted

€3.000/month + social €50.000/year
10 minutes/day *220
€1.000/emp/year

1000 employees = €1.000.000/year ”released time”

7 | SharePoint Server 2010 for Internet Sites Microsoft confidential.

Creating quality metadata is a real challenge
Few organizations have good quality metadata on internal content
• Ineffective information governance across the enterprise
• Multiple content silos and search interfaces
Challenge • Manually entered metadata is inconsistent, incorrect or missing
• No automated tools for content classification
• Impossible to keep up with ever growing content volumes

Assist users in tagging
content with automated
metadata suggestions
or enrichment tools

• FAST Search for SharePoint (FS4SP) delivers business value out-of-the-box
• Sophisticated content processing optimizes findability across multiple silos Solution
of unstructured and structured content
• In addition, property extraction overcomes poor metadata by generating it
and normalizing it on-the-fly

Content Processing Pipeline – what is it?
Enhance your content for optimal search experience and findability

The pipeline is a sequentially arranged set of discrete processing
stages that break down and enrich content for indexing
Convert documents to plain text (support for 400+ file formats)
Detect document languages and encoding (support for 80+ languages)
Apply linguistic normalization to optimize content for search
Identify and leverage existing metadata where applicable
Parse content to extract or generate additional metadata
Map content and associated metadata (crawled properties) to the index
schema (managed properties) for searching
Custom stages can be created and added to the pipeline

Language
Custom Identifies the encoding and language-specific rules for
Breaks you to tokens entities mentioned
Applies document times (phrase/weight inin pipeline
Recognizes predefined usinga standard topairsthe so
Createstext into andvectors content processingthe content;
Converts dates extend the normalizationrepresentation,
Enables language-specific tolanguages usedcontenttext to
reflecting
Date and Time
Properties
Property
Format Extracts plain text pieces of content and metadata
Maps the relevant and metadata from multiple content
Lemmatization
Encoding and
Vectorization
Tokenization
Processing content so that the (home-grown occurrence) 3rd party
punctuation, support for and phrases
users’ locale-specific accents, linguisticin words,enable
out custom stages appropriate of solutionscanonical
important terms and frequency compoundexample, the
handlethe box match wordsCompanies, Locations andor
withof queriesdiacritics,representations; fornormalization
or to phrases
Normalization
Conversion
Extraction
Mapper formats (e.g. the pipeline to the index schema
discovered inMicrosoft Office, PDF, HTML, etc.) for search
Detection
Stage rules and to (currency, telephones, downstream
and similar”address extended to other 2010
inflected dictionariesyour
People but this is equivalent to March numbers, etc.)
“find 14-Mar-10can becan be appliedpartneeds
datenumbers functionality own masculine/feminine,
software)forms (singular/plural,business14,categories etc.)

Property Extraction
Create metadata on-the-fly, adding structure to unstructured content
In a nutshell, property extraction Crawled Properties
is the ability to Companies

Process unstructured content (e.g. Microsoft
Contoso
a document’s body) Woodgrove

Recognize entities mentioned in …

the text (e.g. people, companies, Locations

locations, concepts, etc.) London
San Francisco
Optionally, normalize variations to Moscow

a single, canonical form …

Expose these extracted entities as People

crawled properties in pipeline Bill Gates
Barack Obama
Map them to managed properties José Caires

for filtering and searching ...

Index Schema: Managed Properties
Type Doc ID Title Author Date Size Keywords Companies Locations People ... Body Text

xxx Sales For… John Doe 2010-04-15 386 KB sales; pipe… Microsoft; … London; … Bill Gates; … … The mark…

yyy … … … … … … … … … …

zzz … … … … … … … … … …

Good metadata greatly improves findability
Property extraction enables consistent metadata across all content
This is really great! Now I
Metadata quality is critical to can navigate through this
Metadata is also used
for relevancy tuning,

the search experience large information universe multi-level sorting and
advanced search
without feeling lost…

FS4SP leverages metadata,
i.e. managed properties, to
present deep refiners File Formats ,

Offer at-a-glance overview
Organize free-text search
results into multiple facets
Companies
Make search conversational
Guide users toward possible Precise hit counts in

refinement choices deep refiners are
computed across the
whole result set.
Prevent users drilling down Products
into a “0 results” dead end
Additional uses for managed
properties in FS4SP
Relevancy tuning & ranking Concepts

Multi-level sorting
Advanced (or fielded) search
And many more…

The Microsoft IT Intranet
Environment
6.4 TB
49,731 Sites
Seattle Dublin 117,324 Sub-sites
29.89 TB 22%

65%
( 31,346,042 MB )
Grows with 1.5TB per quarter
Singapore
223,595 Sites 4.1 TB
19.4 TB 545,387 Sub-sites 45,878 Sites
13%
127,986 Sites 82,128 Sub-sites
345,935 Sub-sites
- Europe - Middle East -
- Americas - - Africa - - Asia Pacific -
As of September 2010
| 13

Property extraction and refiners in FS4SP
What’s available out-of-the-box?
FS4SP automatically detects 80+
languages in content
Property extraction dictionaries are
included for 11 languages* and 3
types of entities
Locations
Companies
Persons
The metadata is exposed to users as
refiners, drives relevancy and other
features to improve findability
This delivers real business value to
organizations struggling with issues
such as
Poor document metadata
Large content volumes
Lack of result refinement options
Low user adoption of search
* Arabic, Dutch, English, French, German, Italian, Japanese, Norwegian, Portuguese, Russian, Spanish

Extending property extraction in FS4SP (1/2)
Make search speak the language of your business using dictionaries

Property extraction in FS4SP is SharePoint lists & Term Store

customizable using a dictionary,
i.e. list of keywords and phrases
Matching variations can be
normalized to a single entry
Several dictionaries may co-exist
to address needs of the business
Projects
Create custom
Products search refiners
to fit your own
Customers business needs

Competitors
Employees
Business-specific concepts
The necessary data may be readily
available within the organization
or from external sources
LOB applications, Databases & XML

Extending property extraction in FS4SP (2/2)
Use existing text mining or classification tools to go even further

Another approach is to invoke External text mining/classification tool
external tools during content
processing in FS4SP
This leverages the standard
pipeline extensibility mechanism Local software Web service

Such tools typically address
problems like Analyze
text content
Text mining for entity, fact or
relationship extraction Return
metadata tags
Taxonomy classification
Moreover, these tools may be
already deployed for other Index

purposes in the enterprise
Content pipeline

Enriched document
Home-grown solutions for indexing
?
3rd party, specialized vendors
Industry sectors or verticals Original document
from repository
Scientific or technology domains

Best practice #1
Deepen your understanding of your audiences and your content

Marketing Sales Consulting Procurement Production Research IT Support HR / Legal
Enterprise
content

Before you start deploying enterprise search:
understand your content, your users and what
they need to get their jobs done effectively.

Best practice #2
Use existing language resources inside and outside your enterprise

•Thesauri, controlled •Government
Internal assets

Internet resources

Content providers

Specialized vendors
vocabularies agencies
•Taxonomies, •Industry bodies
ontologies •Research
institutions
•Master databases
•Academia
•Enterprise systems
•Virtual
•Line-of-business communities
applications
•Examples
•Subject matter
experts •Wikipedia.org
•DBpedia.org
•Examples* •WordNet, from
•SharePoint (Lists, Princeton University
Term Store) •Medical Subject
•Employees (AD, HR) Headings (MeSH)
•Customers (CRM)
•Suppliers (ERP)
•Products (PLM)
•Processes (BPM)
•Projects (EPM)

* AD – Active Directory; CRM – Customer Relationship Mgmt.; ERP – Enterprise Resource Mgmt.; PLM – Product Lifecycle Mgmt.; BPM – Business Process Modeling; EPM – Enterprise Project Mgmt.

Best practice #3
Keep the index synchronized with content sources and dictionaries

The language of the business Where possible, automate
will change over time dictionary upkeep as part of
External environment standard business workflows
Enterprise content Taxonomies and thesauri
Users’ needs Enterprise project management
Ensure that property extraction Product lifecycle management
dictionaries and search index Schedule regular analysis and
are systematically updated to review checkpoints to handle
respond to these changes exceptional cases
Dictionary

with changes over time
Data

Search synchronized
Sources

Property
Extraction
Dictionaries

Search
Index

Enterprise
Content
Sources

Best practice #4
Distinguish search management from systems management

As the language of your business Search management is not an IT
and users’ needs evolve, so should responsibility, it’s for the business
your search solution
Job profile
If not, the search experience and
• Skillset of a SharePoint administrator (not a
findability inevitably degrade over programmer or systems engineer)
time – users’ trust will plunge too • Business perspective and focus
• Good ability with languages
• Attention to detail

Original implementation Sample tasks
of the search solution
• Monitor search reports (daily/weekly)
• Run user polls and/or focus groups
(quarterly)
• Process users feedback/questions
• Update dictionaries and manage keywords
(as required)
• Support search-related projects

Staffing – depends on scale
Actual search experience,
if left unattended... • One person part-time, or
• A geographically distributed team

Case study #1
General Mills (Research & Development)
Business Problem
• Researchers forced to search each internal and
external content source separately
• Low relevancy in existing search applications
• High effort in information discovery tasks
• Growing difficulty in establishing connections with
experts as company grew worldwide

Approach & Solution
• FAST Search for SharePoint indexes all internal
sources and federates external industry services
• Property extraction dictionaries extended to
recognize product names cited in documents
• Deep refiners are used on extracted properties to
drill down by products, companies and people

Benefits & Value
• Improved employee productivity with more relevant
search results in a unified interface
• Greater information sharing and reuse across
product areas & geographies By using FAST Search Server 2010 for SharePoint, our
• Integrated people search eases social networking researchers can refine their searches and find exactly what
they are looking for. They spend more time innovating than
• Proof point for wider search roll-out in enterprise looking for information.

Link to full case study
– Michelle Check, R&D Systems Leader, General Mills

Case study #2
Mississippi Department of Transportation (MDOT)
Business Problem
• Poor access to a large, active collection of paper-
based contracts and project documents
• Metadata managed in a separate DMS (database)
• Information silos stifle and sharing of data and
collaboration
• Requirements to provide internal and public access

Approach & Solution
• FAST Search for SharePoint indexes images with
iFilter-based OCR technology
• Pipeline extended with custom .NET code to merge
metadata from database with indexed documents
• Custom refiners reflect language used in the
business for navigating search results

Benefits & Value
• Unified self-service interface to locate information
• Ability to slice & dice results according to specific
needs (dates, project, folder, route, district, etc.) We are literally reducing decision cycles from days to
• Information search times cut from several hours or minutes for hundreds of overlapping decisions a day. With
days to mere seconds or minutes SharePoint Server 2010, we can make better spending
decisions and enhance program performance without a very
• Users have more time to focus on higher value tasks large investment.
Link to full case study – John Michael Simpson, CTO, MDOT

Ingredients for great enterprise search
The business value of FAST Search Server 2010 for SharePoint
The challenges
• Explosive content growth puts information management and
governance under pressure
• Multiple content silos with different search interfaces
• Poor metadata – missing, inconsistent, incorrect

The solution
• Content processing optimizes findability across disparate sources
• Property extraction generates metadata while indexing content
• Deep refiners expose metadata in search results helping users
quickly zoom to the right information

The benefits
• Reduced costs through enterprise search consolidation and
automated metadata enrichment
• Enhanced findability helps employees to get their job done faster
• Increased user adoption across the enterprise drives ROI

microsoft.com / Enterprise Search

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Adding structure to unstructured content for enhanced findability hakan tylen

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Viewers also liked

Viewers also liked (17)

Similar to Adding structure to unstructured content for enhanced findability hakan tylen

Similar to Adding structure to unstructured content for enhanced findability hakan tylen (20)

Recently uploaded

Recently uploaded (20)

Adding structure to unstructured content for enhanced findability hakan tylen