Enterprise Search Share Point2009 Best Practices Final
Good Afternoon and many thanks for attending the last session on the last
day of this conference. The focus of this presentation are the many excellent
features contained in MOSS 2007 search. My goal is to show you why these
features are excellent so that you will make use of them. Because, if you do,
you will be able to walk the halls of your organization with your heads held
high and fear no “search sucks” cracks as you do.
I am a pointy-head and not a propeller-head. While there are technical
references in this presentation, the orientation will be more behavioral and
less technical. There are terrific technical resources contained in the
Resources section and the occasional snippet of code did make its way into
the main section.
UC Berkeley Study on How Much Information: http://www2.sims.berkeley.edu/research/projects/how-much-
Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002.
Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks.
How big is five exabytes? If digitized with full formatting, the seventeen million books in the Library
of Congress contain about 136 terabytes of information; five exabytes of information is equivalent in
size to the information contained in 37,000 new libraries the size of the Library of Congress book
Hard disks store most new information. Ninety-two percent of new information is stored on magnetic
media, primarily hard disks. Film represents 7% of the total, paper 0.01%, and optical media 0.002%.
The United States produces about 40% of the world's new stored information, including 33% of the
world's new printed information, 30% of the world's new film titles, 40% of the world's information
stored on optical media, and about 50% of the information stored on magnetic media.
How much new information per person? According to the Population Reference Bureau, the world
population is 6.3 billion, thus almost 800 MB of recorded information is produced per person each
year. It would take about 30 feet of books to store the equivalent of 800 MB of information on paper.
We estimate that the amount of new information stored on paper, film, magnetic, and optical media has about
doubled in the last three years.
Information explosion? We estimate that new stored information grew about 30% a year between
1999 and 2002.
Paperless society? The amount of information printed on paper is still increasing, but the vast
majority of original information on paper is produced by individuals in office documents and postal
mail, not in formally published titles such as books, newspapers and journals.
Hosted websites [UC Berkeley How Much Information Project]
•July 1993: 1,776,000
•July 2005: 353,084,187
Size of the Web [Indexable Web: Guilli & Signorini 2005]
•1997: 200 million Web pages
•2005: 11.5 billion pages
Information Re/volution: Michael Wensch; Kansas State University
All of his work is very good
And how we manage information is different because searchers are squishy – some just want to find “it”, others want it to find
them and others want to change it, create it, manipulate it, share it…
•They are searching because they don’t know
•Language and perception are different
•Some people think women put their stuff in a purse, others a pocketbook, and others a handbag.
•“Animal” is a mammal, a Sesame Street character, and an uncouth person
•Enterprise information is individualized.
•Gates Foundation has different issues than PACCAR
•Providence Healthcare has different types of content than King County Library
•Codeplex has a different user type [or a more standard one] than Microsoft Virtual Earth
Search engines use bots to crawl pages and send compressed data based on grammatical requirements such as stemming [taking
the word down to its most basic root] and stop words [common articles and others stipulated by the company] back to the index.
This index is then inverted so that lookup is done on the basis of record contents and not the document ID which is a completely
different method of data storage and retrieval from other relational database data storage. A complete copy of the Web page
may be stored in the search engine’s cache. With brute force calculation, the system pulls each record from the inverted index
[mapping of words to where they appear in document text]. This is recall or all documents in the corpus with text instances that
match your the term(s).
Search engine indexes are not like relational databases. There is no such thing as normalization, no unique identifiers and the
loosest of structures.
The “secret sauce” for each search engine are algorithms that sort order the recall results in a meaningful fashion. This is precision
or the number of documents from recall that are relevant to your query term(s). All search engines use a common set of values to
refine precision. If the search term used in the title of the document, in heading text, formatted in any way, or used in link text,
the document is considered to be more relevant to the query. If the query term(s) are used frequently throughout the document,
the document is considered to be more relevant.
Another example is Term Frequency - Inverse Document Frequency [TF-IDF] weighting. Here the raw term frequency (TF) of a
term in a document by the term's inverse document frequency (IDF) weight [frequency of occurrence in a particular document
multiplied the number of documents containing the term divided by the number of documents in the entire corpus. [caveat
emptor: high-level, low-level, level-playing-field math are not my strong suits].
There is a fundamental difference between Web search and Enterprise search.
•Web search is generic search. One size fits all. Features serve the technology to better enable it to serve the masses.
•Search technology has to work for the broadest document set, those 11 billion plus pages
•Keys off strong linking [the # and the structure]
•Links are “editorial” – endorsement of destination content through “vote”
•Millions of publishers that are not required to adhere to any specific standards
•Site structure is not often tied to content or context
•Search engines are constantly fighting attempts to game their technology in the Web search space. Black hat techniques
like cloaking, link farms, spamming, keyword stuffing, Sybil attacks and the like are a blight. They manipulate the results and
reduce user confidence in the system
•Technology changing and refining its operation to rely on both internal [document level] and external [site level] data.
Examples of this would be: IBM’s narrative distiller, MSN link text analysis, Google Scout that finds related hyperlinks, and
Yahoo!’s document segmentation
Important to note: The PageRank algorithm is a pre-query calculation. It is a value that is assigned as a result of the
search engine’s indexing of the entire Web and the associated value has no relationship to the user’s information
need. There have been a number of additions and enhancements to lend some contextual credence to the
relevance ranking of the results.
•Bounded corpus of content
•Produced and maintained by a limited set of authors
•No strong linking strategy – links mostly for navigation [not editorial]
•Information related in ways that key outside of document content
•Hierarchical structure intended – part of corporate culture
•Publishing guidelines can be established to enforce meta data standards to tune a search appliance and improve relevance
through enforced semantic relationships.
In the early days of search engines, Advanced Search was a means for those who could phrase their
queries in Boolean or SQL language to do so for more refined results. As search engines became
more sophisticated, the need for such coding ability discrimination.
Usability studies show that most customers avoid Advanced Search because they assume that it is
too advanced for them. A better method is to offer means for the searcher to refine their own
search using facets based on document type, subject or location.
From MOSS 2007 search Under the Hood PPT by Adir Ron
Search Query Execution:
•The query engine passes the query through a language-specific wordbreaker.
•After wordbreaking, the resulting words are passed through a stemmer to generate language-
specific inflected forms of a given word.
•When the query engine executes a property value query, the index is checked first to get a list of
•If the user does not have permission to a matching document, the query engine filters that
document out of the list that is returned.
• Index Engine: Processes the chunks of text and properties filtered from content sources, storing
them in the content index and property store.
• Query Engine: Executes keyword and SQL syntax queries against the content index and search
• Protocol Handlers: Opens content sources in their native protocols and exposes documents and
other items to be filtered.
• IFilters: Opens documents and other content source items in their native formats and filters into
chunks of text and properties.
• Property Store: Stores a table of properties and associated values.
• Wordbreakers: Used by the query and index engines to break compound words and phrases into
individual words or tokens.
SPS 2003 was SQL search - different db structure, more classic RDM
MOSS 2007 is indexed search = inverted index based on words not records -- scopes, structured Biz
data search, people search
•Click Distance: Browsing distance from authoritative sites: shorter tends to be more relevant
•Anchor Text: Hyperlinks act as annotations on their target
•URL Depth: URLs higher in the hierarchy tend to be more relevant
•URL Matching: Direct matches on text in URLs
•Metadata Extraction: Automatically extract titles and authors from document text
•Automatic Language Detection: Helps bias toward results in your language
•File Type Biasing: For example, PPT docs tend to be more relevant than XLS
•Text Analysis: Traditional text ranking based on matching terms, term frequencies, word variants,
•Collection frequency: The number of documents a term appears in compared to total number of
documents. Search terms that occur in only a few documents are likely to be more useful than
terms that occur in many documents.
•Term frequency: The number of occurrences of the search term in a document. The more
frequently a search term appears in a document the more important it is likely to be important for
ranking that document.
•Document length: The length of the searched document. A term that occurs the same number of
times in a short document as in a long one is likely to be more important to the short document.
•Term Position: The position of a word within a document, for example, presence of a term in the
document’s title. A term that appears in a particular component of the document, such as the title,
is more likely to be important for ranking that document. 10
Here is where you manage the components that manage search performance and search
Because search is a shared service, you only have to configure in one location
MOSS 2007 enables testing the configuration to ensure performance
Where you put the content is not necessarily where your customers will look for it
Better management and control
Better resource management, both hardware and personnel
Agile index changes
Text Analysis [internal]: Traditional text ranking based on such factors as matching terms, term frequencies, and
Dynamic and Static ranking: Like other search technology MOSS 2007 Search incorporates both internal [text on
the page, term frequency, page layout and formatting, etc] and external metadata to more closely match user’s
request. However, MOSS 2007 Search incorporates cutting edge technology from Microsoft Search to push beyond
the 1 link=1 vote for quality/relevance of the PageRank model.
•Click Distance [external]: Browsing distance from authoritative sites (shorter distances tend to be more
•Anchor Text [external]: Hyperlinks act as annotations on their target. In addition, they tend to be highly
•URL Depth [external]: URLs higher in the hierarchy tend to be more relevant.
•URL Matching [external]: Direct matches on text that's in URLs.
•Metadata Extraction [internal]: Automatically extracts titles and authors from document text if they are
•Automatic Language [internal]: Detection Helps create preference for results in your language.
•File Type Biasing [internal]: Certain file types tend to be more relevant (for example, PPT files are often
more relevant than XLS files).
Project Description from Codeplex http://www.codeplex.com/FacetedSearch
MOSS Faceted Search is a set of web parts that provide intuitive way to refine search results by
The facets are implemented using SharePoint API and stored within native SharePoint METADATA
store. The solution demonstrates following key features:
Grouping search results by facet
Displaying a total number of hits per facet value
Refining search results by facet value
Update of the facet menu based on refined search criteria
Displaying of the search criteria in a Bread Crumbs
Ability to exclude the chosen facet from the search criteria
Flexibility of the Faceted search configuration and its consistency with MOSS administration
Estimated dev time to create own FLD file is 3 days (from MS internal)
Best to pass the query through and have destination do relevance ranking (saves bandwidth) than
to access destination index (lose proprietary relevance ranking though)
Day Software Delivers Standardized Connectivity for Open Text Livelink
Using SharePoint 2007 to Index Lotus Notes
Microsoft Knowledge Network: Stored on separate server
Version 1.0 is an add-on product for Enterprise version of Stand-alone Search and for both versions
of Full Product
Initial results are presented with identity masked – KN server takes user request and sends to
person who can accept or reject the request through the KN server without identity ever being
The Business Data Catalogue (BDC) crawls and integrates data from other applications [email servers, line-of-
business applications, external databases, customer relationship management apps] and puts into a cache for
crawl by the search server.
Accesses these repositories with a connector http://msdn.microsoft.com/en-us/library/ms563661.aspx
Available in MOSS 2007 Search Enterprise edition and both version of MOSS 2007 Full Product
Short term: FAST will remain an independent entity that Microsoft will continue to support on the non-
Windows platforms with a connector for MOSS 2007. Next release will see 2 versions of FAST ESP, a stand-
alone successor and a SharePoint edition that will incorporate the connect and add new features that require
Relevance by using the underlying semantic relationships
•unity (federation of results from outside resources)
•admomentum (search driven monetization with ad serving)
•recommendations (recommendation engine similar to Amazon/Netflicks - based on behavior of user
base - cookie based, item to item, people to items)
•featured content (search driven content merchandizing)
•fast unity (search driven portal experiences)
•phrasing and anti-phrasing: strips out the extraneous terms
•clustering: comprehension through association
•can be taxonomy based or on the Open Source Directory
•flexible relevancy model: boost block search results - dynamic on per query basis
•whole equalizer with whole set of knobs - reissues query with different weights based on choices -
ranking more than filtering - does not change the # of results, changes the order of display
•can work in conjunction with faceted search
Represent a collection of documents mapped to a single element [i.e. authored by, specific directory, file type,
metadata type], no longer tied to an index crawl – effective immediately.
By default, the scope plug-in will create scopes for the following:
•Site (domain, sub-domain, host-name)
•All content (used to include all content)
•Global query exclusions (used to exclude content)
Results collapsing can group duplicated or similar results together, so that they are displayed as one entry in the
search result set. This entry includes a link to display the expanded results for that collapsed result set entry. Search
administrators can collapse results for the following content item groups:
•Duplicates and derivatives of documents
•Windows SharePoint Services discussion messages for the same topic
•Microsoft Exchange Server public folder messages for the same conversation topic
•Current versions of the same document
•Different language versions of the same document
•Content from the same site
By default, results collapsing is turned on in Enterprise Search. The search administrator can configure it, however,
either through the Search Administration UI or the Search Administration object model.
Security Trimmed Results: they don’t see what they are not allowed to see
Best Bets: editorially programmed results or what you want them to want to see
•Dashboard-style data presentation
•Keys of document library of reports
•Can import KPIs
KPIs are a central way of presenting business intelligence for an organization. High level goals for
organization or site
KPIs increase the speed and efficiency of evaluating progress against key business goals. Reduces
the amount of data for analysis
KPIs connect to business data from various sources. Consolidates data against KPI, not repository.
Each KPI gets a single value from a data source, either from a single property or by calculating
averages across the selected data, and then compares that value against a pre-selected value. Data
•Excel workbooks: The data comes from an Excel workbook.
•SQL Server 2005 Analysis Services: The data comes from database stores known as cubes,
for connections in a data connection library.
•Manually entered information: The data is from a static list, rather than based on
underlying data sources. This is used less frequently, for test purposes prior to deployment
or on occasions when regular data sources are unavailable but you still want to provide
Sometimes configuring search can seem like that big ticking box from Acme…
Frank Lloyd Wright said something along the lines of it being easier to take an eraser to the drafting
table than a sledgehammer to the construction site.
Don’t boil the ocean.
A smaller segment of your content is satisfying a significant portion of your customer searches
Search logs, customer feedback, server logs will reveal this portion
Performed on a small subset of the corpus that best represents nature of the whole
Ranked according to the number of non-affiliated “experts” point to it – i.e. not in the same site or directory
Affiliation is transitive [if A=B and B=C then A=C]
Beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the relationship between the authority and the user’s
query. You don’t have to be big or have a thousand links from auto parts sites to be an “authority”
Segmentation of corpus into broad topics
Subset that is then extrapolated to Web as a whole
Selection of authority sources within these topic areas
Authorities have lots of non-related pages on the same subject pointing to them
Quality of links more important than quantity of links
Determination of HUBS (pages that point to many authority sources)
Pre query calculations applied at query time
TOPIC SENSITIVE PR
•Consolidation of Hypertext Induced Topic Selection [HITS] and PageRank
•Pre-query calculation of factors based on subset of corpus: context of term use in document, context of term use in history of
queries and context of term use by user submitting query
•Computes PR based on a set of representational topics [augments PR with content analysis]
•Topic derived from the Open Source directory
•Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the similarity of query to topics
During the age of early explorers, map makers would insert this phrase when
they reached the edge of their known world.
The “dragons” on the following slides are known issues that Ascentium
developers have discovered in working with MOSS 2007 search or found
through my own research. Few diamonds are flawless. I find it best to
address the shortcomings upfront and have solutions in hand to
mitigate customer pain.
•Advanced auto-classification, taxonomy management and compound term metadata tagging
•Only statistical metadata generation, auto Classification and taxonomy management vendor in the
world that uses concept extraction and compound term processing
•Proven to deliver the highest precision without the loss of recall
•Only Tagging and classification solution fully integrated with MOSS, Microsoft Office, Exchange and
Microsoft Enterprise Search
•Automatically classifies content at the time creation or ingestion
•Generates compound term metadata (concepts) and stores in SharePoint properties
•Automatic classification within MS Office applications, metadata stored in the document
•Taxonomy Manager -Supports multiple taxonomies
•Priced by server -$95K per production server, $47.5 per staging/test server
•Vertical applications (Legal, Finance, eDiscovery, Services, Oil & Gas, Manufacturing, Government,
Education, Life Sciences & Healthcare, Energy & Utilities)
•Horizontal applications (ECM, Document Management, Compliance & Risk Management, Records
Management, Enterprise Search, Portals, Intranets & Information Rich Web Sites
•The weights used in the product were carefully tested. Changes to the weights may also
have a negative effect on relevance.
•After you set property.weight you must call the property.Update() method to save the