Improving search using the pipeline  in FAST Search for SharePoint            Miles Kehoe            Author of: Profession...
Agenda• Introductions• When FS4SP makes sense• What is the FS4SP indexing pipeline?• Why is it important to you?• How do y...
About Me• Founder of New Idea Engineering Inc.• Work with enterprise search since 1989• Co-Author Professional Microsoft S...
When to use FS4SPLarge datasets •   SP Search indexes 100M documents •   FS4SP virtually unlimited (650M in tests) •   Row...
What is the FS4SP indexing               pipeline?Standard sequence of ‘stages’ from crawl to index  •   Format conversion...
Pipeline Architecture                                  Index Flow                                                         ...
Why is the pipeline             important to you?Sometimes content IS messy: • URLs with abbreviations • Additional metada...
Examples where the pipeline                  can save youCryptic URLs     •   With URLs like www.myco.com/mkt/prodmgmt/pro...
How do you use the pipeline?Pipeline configuration files in FASTSearchetc    • PipelineConfig.xml    • PipelineExtensibili...
Config Files
Adding a Processor StageOn each FAST document processor node:• Edit %FASTSEARCH%etcpipelineconfig.xml    <processor name=“...
FS4SP Pipeline Extensibility
How do you create a                        custom stage?Edit file %FASTSEARCH%etcpipelineconfig as aboveEdit file %FASTSea...
Pipeline is            performance-criticalPipeline runs in ‘sandbox’ environment •   NOT the same type of ‘sandbox’ in O3...
Pipeline HintsMS only supports: • Single custom stage (in PipelineConfig.xml) • .NET languages (C#, etc)But: • A custom st...
Using web services in Sandbox                         Web Service                           Stage                         ...
Ontolica FAST ManagementOntolica Fast Management provides clear and easy to use configuration directly fromwithin the Shar...
Additional Resources• This slide deck live at http://slidesha.re/sCGAaP• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4s...
Q/A & Contact Details Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblo...
Upcoming SlideShare
Loading in …5
×

Using the Fast Search for SharePoint Pipeline to Improve Search

3,681 views

Published on

Miles Kehoe, Enterprise Search Guru, presents FAST Search for SharePoint and ways to improve search in your organization by utilizing the document processing pipeline in FAST Search for SharePoint.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,681
On SlideShare
0
From Embeds
0
Number of Embeds
156
Actions
Shares
0
Downloads
50
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • By default two pipelines defined – Attachments and Office14
  • http://fs4sp.blogspot.com/2011/05/manipulating-crawled-properties-in-fast.html
  • Using the Fast Search for SharePoint Pipeline to Improve Search

    1. 1. Improving search using the pipeline in FAST Search for SharePoint Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblog.com @miles_kehoe mileskehoe ideaeng.com SurfRay.com
    2. 2. Agenda• Introductions• When FS4SP makes sense• What is the FS4SP indexing pipeline?• Why is it important to you?• How do you use it?• Wrap Up
    3. 3. About Me• Founder of New Idea Engineering Inc.• Work with enterprise search since 1989• Co-Author Professional Microsoft Search/Wrox• Author several blogs: - Enterprisesearchblog.com - SearchComponentsOnline.com• Search nerd
    4. 4. When to use FS4SPLarge datasets • SP Search indexes 100M documents • FS4SP virtually unlimited (650M in tests) • Rows and Columns conceptNeed to fine-tune index & search • Pipeline • Need custom relevance profiles • Need to fine-tune queries for relevance
    5. 5. What is the FS4SP indexing pipeline?Standard sequence of ‘stages’ from crawl to index • Format conversion & language detection • Lemmatization / Stemming • Entity extraction • Map crawled properties to managed propertiesUnique to FAST: the ability to insert custom processing • ‘Must’ be just before mapper • C# supported; but any code using STDIN/STDOUT ok • Time critical!A great way to fix up messy data!
    6. 6. Pipeline Architecture Index Flow Content Indexer Query Crawler Processor ProcessorData Sources User Queries FS4SP Pipeline … Entity Extraction Lemmatization Language Detection Format Conversion Custom Extensibility Mapper
    7. 7. Why is the pipeline important to you?Sometimes content IS messy: • URLs with abbreviations • Additional metadata is in external sources • Geo-tag documentsDiagnose problems in the indexing process: • Identify bad or missing metadata
    8. 8. Examples where the pipeline can save youCryptic URLs • With URLs like www.myco.com/mkt/prodmgmt/products.aspx • I can add specific metadata to the document ‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’)Adding valuable metadata:• When I find a user name in a document I can lookup and return phone number and email• When I find a city name I can geo-tag with latitude and longitudeDebugging the indexing process• When things are not as they seem I can diagnose problems in the indexing process
    9. 9. How do you use the pipeline?Pipeline configuration files in FASTSearchetc • PipelineConfig.xml • PipelineExtensibility.xmlFor each Document Processor node: • Create an entry for a new ‘processor’ • Add your new processor name to the <pipelines> node • Restart the ‘FAST processor server’ from CMD: psctrl reset • Submit a single known test document • Check your results
    10. 10. Config Files
    11. 11. Adding a Processor StageOn each FAST document processor node:• Edit %FASTSEARCH%etcpipelineconfig.xml <processor name=“Spy1" type="general" hidden="0"> <load module="processors.Spy" class="Spy"/> <config> <param name="SpyDumpFile" value="var/log/spy.txt" type="str"/> <param name="FileStringCutOffLen" value="32768" type="int"/> </config> <inputs> </inputs> </processor>• In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14 pipeline) <processor name=“Spy1” />• Reset (each) document processor node: psctrl reset
    12. 12. FS4SP Pipeline Extensibility
    13. 13. How do you create a custom stage?Edit file %FASTSEARCH%etcpipelineconfig as aboveEdit file %FASTSearch%etcPipelineExtensibility.xml<PipelineExtensibility> <Run command=“YourCode.EXE %(input)s %(output)s"> <Input> <CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" /> </Input> <Output> <CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/> <CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/> </Output> </Run> </PipelineExtensibility>Restart content servers from command Line prompt psctrl reset
    14. 14. Pipeline is performance-criticalPipeline runs in ‘sandbox’ environment • NOT the same type of ‘sandbox’ in O365 • File I/O only allowed in C:users<fast service user>AppDataLocalLow • Maximum of 10 seconds to live • Permissions restricted regardless of FAST Service user permissions • Each Document Processor (DP) is an individual instance • Only one item passes thru a DP at a time • If each document takes 1 second then10 DPs can process at best 10 docs/sec • Consider 1 sec for each of 100K docs ~ 3 hours!
    15. 15. Pipeline HintsMS only supports: • Single custom stage (in PipelineConfig.xml) • .NET languages (C#, etc)But: • A custom stage can appear in multiple places in PipelineConfig.xml even w/ different parameters • Theoretically any executable that handles STDIN/STDOUT will do • VC#/VC++/VBScript/CMD files seem to work • Web services calls are supported
    16. 16. Using web services in Sandbox Web Service Stage Stage XML Stage XML Stage XML Config
    17. 17. Ontolica FAST ManagementOntolica Fast Management provides clear and easy to use configuration directly fromwithin the SharePoint admin GUI. Forget XML configuration files, manual filedeployments, and tricky PowerShell configuration with easy management consoles.Key Features:• Backup, Manage, & Deploy Configurations• Manage FAST Relevance Profiles• Upload & Manage Pipeline Extensions• Create & Manage JDBC Connections• FAST Webcrawler Configuration• Manage FAST Server Processes from Central Admin
    18. 18. Additional Resources• This slide deck live at http://slidesha.re/sCGAaP• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/• Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/• Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/• ESW Blog - http://www.enterprisesearchwiki.com/wp/• TechNet/MSDN/Microsoft• And of course: SurfRay.com (Robert Piddocke & Josh Noble)
    19. 19. Q/A & Contact Details Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblog.com @miles_kehoe mileskehoe Robert Piddocke Author: Pro SharePoint 2010 Search rcp@surfray.com @rpiddocke R Piddocke ideaeng.com SurfRay.com

    ×