Enterprise Search Share Point2009 Best Practices Final


Published on

This presentation examines features and benefits in Microsoft Office SharePoint Server (MOSS) 2007 enteprise search. It contains configuration guidance, code snippets, tips and tricks.

Published in: Technology
  • Be the first to comment

Enterprise Search Share Point2009 Best Practices Final

  1. 1. Good Afternoon and many thanks for attending the last session on the last day of this conference. The focus of this presentation are the many excellent features contained in MOSS 2007 search. My goal is to show you why these features are excellent so that you will make use of them. Because, if you do, you will be able to walk the halls of your organization with your heads held high and fear no “search sucks” cracks as you do. 1
  2. 2. I am a pointy-head and not a propeller-head. While there are technical references in this presentation, the orientation will be more behavioral and less technical. There are terrific technical resources contained in the Resources section and the occasional snippet of code did make its way into the main section. 2
  3. 3. 3
  4. 4. UC Berkeley Study on How Much Information: http://www2.sims.berkeley.edu/research/projects/how-much- info-2003/ Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. Ninety-two percent of the new information was stored on magnetic media, mostly in hard disks. How big is five exabytes? If digitized with full formatting, the seventeen million books in the Library of Congress contain about 136 terabytes of information; five exabytes of information is equivalent in size to the information contained in 37,000 new libraries the size of the Library of Congress book collections. Hard disks store most new information. Ninety-two percent of new information is stored on magnetic media, primarily hard disks. Film represents 7% of the total, paper 0.01%, and optical media 0.002%. The United States produces about 40% of the world's new stored information, including 33% of the world's new printed information, 30% of the world's new film titles, 40% of the world's information stored on optical media, and about 50% of the information stored on magnetic media. How much new information per person? According to the Population Reference Bureau, the world population is 6.3 billion, thus almost 800 MB of recorded information is produced per person each year. It would take about 30 feet of books to store the equivalent of 800 MB of information on paper. We estimate that the amount of new information stored on paper, film, magnetic, and optical media has about doubled in the last three years. Information explosion? We estimate that new stored information grew about 30% a year between 1999 and 2002. Paperless society? The amount of information printed on paper is still increasing, but the vast majority of original information on paper is produced by individuals in office documents and postal mail, not in formally published titles such as books, newspapers and journals. Hosted websites [UC Berkeley How Much Information Project] •July 1993: 1,776,000 •July 2005: 353,084,187 Size of the Web [Indexable Web: Guilli & Signorini 2005] •1997: 200 million Web pages •2005: 11.5 billion pages 4
  5. 5. Information Re/volution: Michael Wensch; Kansas State University http://www.youtube.com/user/mwesch All of his work is very good And how we manage information is different because searchers are squishy – some just want to find “it”, others want it to find them and others want to change it, create it, manipulate it, share it… •They are searching because they don’t know •Language and perception are different •Some people think women put their stuff in a purse, others a pocketbook, and others a handbag. •“Animal” is a mammal, a Sesame Street character, and an uncouth person •Enterprise information is individualized. •Gates Foundation has different issues than PACCAR •Providence Healthcare has different types of content than King County Library •Codeplex has a different user type [or a more standard one] than Microsoft Virtual Earth 5
  6. 6. Search engines use bots to crawl pages and send compressed data based on grammatical requirements such as stemming [taking the word down to its most basic root] and stop words [common articles and others stipulated by the company] back to the index. This index is then inverted so that lookup is done on the basis of record contents and not the document ID which is a completely different method of data storage and retrieval from other relational database data storage. A complete copy of the Web page may be stored in the search engine’s cache. With brute force calculation, the system pulls each record from the inverted index [mapping of words to where they appear in document text]. This is recall or all documents in the corpus with text instances that match your the term(s). Search engine indexes are not like relational databases. There is no such thing as normalization, no unique identifiers and the loosest of structures. The “secret sauce” for each search engine are algorithms that sort order the recall results in a meaningful fashion. This is precision or the number of documents from recall that are relevant to your query term(s). All search engines use a common set of values to refine precision. If the search term used in the title of the document, in heading text, formatted in any way, or used in link text, the document is considered to be more relevant to the query. If the query term(s) are used frequently throughout the document, the document is considered to be more relevant. Another example is Term Frequency - Inverse Document Frequency [TF-IDF] weighting. Here the raw term frequency (TF) of a term in a document by the term's inverse document frequency (IDF) weight [frequency of occurrence in a particular document multiplied the number of documents containing the term divided by the number of documents in the entire corpus. [caveat emptor: high-level, low-level, level-playing-field math are not my strong suits]. 6
  7. 7. There is a fundamental difference between Web search and Enterprise search. Web Search: •Web search is generic search. One size fits all. Features serve the technology to better enable it to serve the masses. •Search technology has to work for the broadest document set, those 11 billion plus pages •Keys off strong linking [the # and the structure] •Links are “editorial” – endorsement of destination content through “vote” •Millions of publishers that are not required to adhere to any specific standards •Site structure is not often tied to content or context •Search engines are constantly fighting attempts to game their technology in the Web search space. Black hat techniques like cloaking, link farms, spamming, keyword stuffing, Sybil attacks and the like are a blight. They manipulate the results and reduce user confidence in the system •Technology changing and refining its operation to rely on both internal [document level] and external [site level] data. Examples of this would be: IBM’s narrative distiller, MSN link text analysis, Google Scout that finds related hyperlinks, and Yahoo!’s document segmentation Important to note: The PageRank algorithm is a pre-query calculation. It is a value that is assigned as a result of the search engine’s indexing of the entire Web and the associated value has no relationship to the user’s information need. There have been a number of additions and enhancements to lend some contextual credence to the relevance ranking of the results. Enterprise Search: •Bounded corpus of content •Produced and maintained by a limited set of authors •No strong linking strategy – links mostly for navigation [not editorial] •Information related in ways that key outside of document content •Hierarchical structure intended – part of corporate culture •Publishing guidelines can be established to enforce meta data standards to tune a search appliance and improve relevance through enforced semantic relationships. 7
  8. 8. In the early days of search engines, Advanced Search was a means for those who could phrase their queries in Boolean or SQL language to do so for more refined results. As search engines became more sophisticated, the need for such coding ability discrimination. Usability studies show that most customers avoid Advanced Search because they assume that it is too advanced for them. A better method is to offer means for the searcher to refine their own search using facets based on document type, subject or location. 8
  9. 9. From MOSS 2007 search Under the Hood PPT by Adir Ron Search Query Execution: •The query engine passes the query through a language-specific wordbreaker. •After wordbreaking, the resulting words are passed through a stemmer to generate language- specific inflected forms of a given word. •When the query engine executes a property value query, the index is checked first to get a list of possible matches. •If the user does not have permission to a matching document, the query engine filters that document out of the list that is returned. Search Architecture http://www.sharepointblogs.com/heliosa/archive/2007/03/07/enterprise-search-architecture-in- sharepoint-technologies-2007.aspx • Index Engine: Processes the chunks of text and properties filtered from content sources, storing them in the content index and property store. • Query Engine: Executes keyword and SQL syntax queries against the content index and search configuration data. • Protocol Handlers: Opens content sources in their native protocols and exposes documents and other items to be filtered. • IFilters: Opens documents and other content source items in their native formats and filters into chunks of text and properties. • Property Store: Stores a table of properties and associated values. • Wordbreakers: Used by the query and index engines to break compound words and phrases into individual words or tokens. 9
  10. 10. SPS 2003 was SQL search - different db structure, more classic RDM MOSS 2007 is indexed search = inverted index based on words not records -- scopes, structured Biz data search, people search MOSS 2007 •Click Distance: Browsing distance from authoritative sites: shorter tends to be more relevant •Anchor Text: Hyperlinks act as annotations on their target •URL Depth: URLs higher in the hierarchy tend to be more relevant •URL Matching: Direct matches on text in URLs •Metadata Extraction: Automatically extract titles and authors from document text •Automatic Language Detection: Helps bias toward results in your language •File Type Biasing: For example, PPT docs tend to be more relevant than XLS •Text Analysis: Traditional text ranking based on matching terms, term frequencies, word variants, etc. SPS 2003 •Collection frequency: The number of documents a term appears in compared to total number of documents. Search terms that occur in only a few documents are likely to be more useful than terms that occur in many documents. •Term frequency: The number of occurrences of the search term in a document. The more frequently a search term appears in a document the more important it is likely to be important for ranking that document. •Document length: The length of the searched document. A term that occurs the same number of times in a short document as in a long one is likely to be more important to the short document. •Term Position: The position of a word within a document, for example, presence of a term in the document’s title. A term that appears in a particular component of the document, such as the title, is more likely to be important for ranking that document. 10
  11. 11. Here is where you manage the components that manage search performance and search experience Because search is a shared service, you only have to configure in one location MOSS 2007 enables testing the configuration to ensure performance Where you put the content is not necessarily where your customers will look for it 11
  12. 12. Better management and control Better resource management, both hardware and personnel Agile index changes 12
  13. 13. Text Analysis [internal]: Traditional text ranking based on such factors as matching terms, term frequencies, and word variants. Dynamic and Static ranking: Like other search technology MOSS 2007 Search incorporates both internal [text on the page, term frequency, page layout and formatting, etc] and external metadata to more closely match user’s request. However, MOSS 2007 Search incorporates cutting edge technology from Microsoft Search to push beyond the 1 link=1 vote for quality/relevance of the PageRank model. •Click Distance [external]: Browsing distance from authoritative sites (shorter distances tend to be more relevant). •Anchor Text [external]: Hyperlinks act as annotations on their target. In addition, they tend to be highly descriptive. •URL Depth [external]: URLs higher in the hierarchy tend to be more relevant. •URL Matching [external]: Direct matches on text that's in URLs. •Metadata Extraction [internal]: Automatically extracts titles and authors from document text if they are missing. •Automatic Language [internal]: Detection Helps create preference for results in your language. •File Type Biasing [internal]: Certain file types tend to be more relevant (for example, PPT files are often more relevant than XLS files). 13
  14. 14. You must turn on stemming and PDF indexing 14
  15. 15. Project Description from Codeplex http://www.codeplex.com/FacetedSearch MOSS Faceted Search is a set of web parts that provide intuitive way to refine search results by category (facet). The facets are implemented using SharePoint API and stored within native SharePoint METADATA store. The solution demonstrates following key features: Grouping search results by facet Displaying a total number of hits per facet value Refining search results by facet value Update of the facet menu based on refined search criteria Displaying of the search criteria in a Bread Crumbs Ability to exclude the chosen facet from the search criteria Flexibility of the Faceted search configuration and its consistency with MOSS administration 15
  16. 16. 3/23/2009 Estimated dev time to create own FLD file is 3 days (from MS internal) Best to pass the query through and have destination do relevance ranking (saves bandwidth) than to access destination index (lose proprietary relevance ranking though) Day Software Delivers Standardized Connectivity for Open Text Livelink http://www.econtentmag.com/Articles/ArticleReader.aspx?ArticleID=19280 Using SharePoint 2007 to Index Lotus Notes http://meiyinglim.blogspot.com/2007/01/using-sharepoint-2007-to-index-lotus.html 16
  17. 17. Microsoft Knowledge Network: Stored on separate server Version 1.0 is an add-on product for Enterprise version of Stand-alone Search and for both versions of Full Product Refinement/scoping available Initial results are presented with identity masked – KN server takes user request and sends to person who can accept or reject the request through the KN server without identity ever being revealed. 17
  18. 18. The Business Data Catalogue (BDC) crawls and integrates data from other applications [email servers, line-of- business applications, external databases, customer relationship management apps] and puts into a cache for crawl by the search server. Accesses these repositories with a connector http://msdn.microsoft.com/en-us/library/ms563661.aspx Available in MOSS 2007 Search Enterprise edition and both version of MOSS 2007 Full Product 19
  19. 19. 3/23/2009 Short term: FAST will remain an independent entity that Microsoft will continue to support on the non- Windows platforms with a connector for MOSS 2007. Next release will see 2 versions of FAST ESP, a stand- alone successor and a SharePoint edition that will incorporate the connect and add new features that require less customization Relevance by using the underlying semantic relationships •Categorization •Transformation (lemmatization) •Presentation FAST Platform •unity (federation of results from outside resources) •admomentum (search driven monetization with ad serving) •recommendations (recommendation engine similar to Amazon/Netflicks - based on behavior of user base - cookie based, item to item, people to items) •featured content (search driven content merchandizing) •fast unity (search driven portal experiences) Core Capabilities •phrasing and anti-phrasing: strips out the extraneous terms •clustering: comprehension through association •can be taxonomy based or on the Open Source Directory •flexible relevancy model: boost block search results - dynamic on per query basis •whole equalizer with whole set of knobs - reissues query with different weights based on choices - ranking more than filtering - does not change the # of results, changes the order of display •can work in conjunction with faceted search 20
  20. 20. Search Scopes Represent a collection of documents mapped to a single element [i.e. authored by, specific directory, file type, metadata type], no longer tied to an index crawl – effective immediately. By default, the scope plug-in will create scopes for the following: •Display URL •Site (domain, sub-domain, host-name) •Author •All content (used to include all content) •Global query exclusions (used to exclude content) Results Collapsing Results collapsing can group duplicated or similar results together, so that they are displayed as one entry in the search result set. This entry includes a link to display the expanded results for that collapsed result set entry. Search administrators can collapse results for the following content item groups: •Duplicates and derivatives of documents •Windows SharePoint Services discussion messages for the same topic •Microsoft Exchange Server public folder messages for the same conversation topic •Current versions of the same document •Different language versions of the same document •Content from the same site By default, results collapsing is turned on in Enterprise Search. The search administrator can configure it, however, either through the Search Administration UI or the Search Administration object model. Security Trimmed Results: they don’t see what they are not allowed to see Best Bets: editorially programmed results or what you want them to want to see 21
  21. 21. 22
  22. 22. 23
  23. 23. Report Center •Dashboard-style data presentation •Keys of document library of reports •Can import KPIs KPIs are a central way of presenting business intelligence for an organization. High level goals for organization or site KPIs increase the speed and efficiency of evaluating progress against key business goals. Reduces the amount of data for analysis KPIs connect to business data from various sources. Consolidates data against KPI, not repository. Each KPI gets a single value from a data source, either from a single property or by calculating averages across the selected data, and then compares that value against a pre-selected value. Data sources include: •Excel workbooks: The data comes from an Excel workbook. •SQL Server 2005 Analysis Services: The data comes from database stores known as cubes, for connections in a data connection library. •Manually entered information: The data is from a static list, rather than based on underlying data sources. This is used less frequently, for test purposes prior to deployment or on occasions when regular data sources are unavailable but you still want to provide performance indicators 24
  24. 24. Sometimes configuring search can seem like that big ticking box from Acme… 25
  25. 25. Frank Lloyd Wright said something along the lines of it being easier to take an eraser to the drafting table than a sledgehammer to the construction site. 26
  26. 26. Don’t boil the ocean. A smaller segment of your content is satisfying a significant portion of your customer searches Search logs, customer feedback, server logs will reveal this portion 27
  27. 27. 28
  28. 28. 3/23/2009 HILLTOP Performed on a small subset of the corpus that best represents nature of the whole Ranked according to the number of non-affiliated “experts” point to it – i.e. not in the same site or directory Affiliation is transitive [if A=B and B=C then A=C] Beauty of Hilltop is that unlike PageRank, it is query-specific and reinforces the relationship between the authority and the user’s query. You don’t have to be big or have a thousand links from auto parts sites to be an “authority” Segmentation of corpus into broad topics Subset that is then extrapolated to Web as a whole Selection of authority sources within these topic areas Authorities have lots of non-related pages on the same subject pointing to them Quality of links more important than quantity of links Determination of HUBS (pages that point to many authority sources) Pre query calculations applied at query time TOPIC SENSITIVE PR •Consolidation of Hypertext Induced Topic Selection [HITS] and PageRank •Pre-query calculation of factors based on subset of corpus: context of term use in document, context of term use in history of queries and context of term use by user submitting query •Computes PR based on a set of representational topics [augments PR with content analysis] •Topic derived from the Open Source directory •Uses a set of ranking vectors: Pre-query selection of topics + at-query comparison of the similarity of query to topics 29
  29. 29. 30
  30. 30. 31
  31. 31. 32
  32. 32. 33
  33. 33. 34
  34. 34. 35
  35. 35. During the age of early explorers, map makers would insert this phrase when they reached the edge of their known world. The “dragons” on the following slides are known issues that Ascentium developers have discovered in working with MOSS 2007 search or found through my own research. Few diamonds are flawless. I find it best to address the shortcomings upfront and have solutions in hand to mitigate customer pain. 36
  36. 36. 37
  37. 37. 38
  38. 38. 39
  39. 39. 40
  40. 40. 41
  41. 41. 42
  42. 42. 43
  43. 43. 3/23/2009 •Advanced auto-classification, taxonomy management and compound term metadata tagging technology •Only statistical metadata generation, auto Classification and taxonomy management vendor in the world that uses concept extraction and compound term processing •Proven to deliver the highest precision without the loss of recall •Only Tagging and classification solution fully integrated with MOSS, Microsoft Office, Exchange and Microsoft Enterprise Search •Automatically classifies content at the time creation or ingestion •Generates compound term metadata (concepts) and stores in SharePoint properties •Automatic classification within MS Office applications, metadata stored in the document •Taxonomy Manager -Supports multiple taxonomies •Priced by server -$95K per production server, $47.5 per staging/test server •Highly scalable •Vertical applications (Legal, Finance, eDiscovery, Services, Oil & Gas, Manufacturing, Government, Education, Life Sciences & Healthcare, Energy & Utilities) •Horizontal applications (ECM, Document Management, Compliance & Risk Management, Records Management, Enterprise Search, Portals, Intranets & Information Rich Web Sites 44
  44. 44. Notes: •The weights used in the product were carefully tested. Changes to the weights may also have a negative effect on relevance. •After you set property.weight you must call the property.Update() method to save the change. 45
  45. 45. 46
  46. 46. 47
  47. 47. 48
  48. 48. Used in custom Web parts to execute queries against the enterprise search service http://msdn.microsoft.com/en-us/library/ms544561.aspx 49