SharePoint 2013 Search Architecture with Russ HoubergPresentation Transcript
SharePoint 2010 SharePoint 2013Managed Property (Multiple) Search SchemasBest Bets Promoted Results (Query Rule)Scope and Federated Location Result SourceContent By Query Content By SearchIncremental Crawl Continuous CrawlMCM MCSM
Continous Crawl Benefits Continus Crawl Facts• No more waiting for index • Runs every 15 minutes by merge default• Does not wait for other • Default interval can be crawls to complete changed with PowerShell• Can have multiple • Should be used instead of continuous crawls running incremental crawls for simultaneously SharePoint content sources• Continuous crawls ignores errors
HTTP Other File Share End User QueryUser Profile Or Content Process Initiated SharePoint Sources Query Content Query Crawl Index Processing Processing Component Component Component Component Analytics Processing Link Index Crawl Partition(s) Component Database Database(s) Event Store Analytics Database
What it Does Important Facts• Crawls content sources to • We can have multiple crawl populate index components• Delivers crawl items (binary) and • MS Recommends: 2 Crawl metadata to content processor Components per Search Service• Invokes connectors or protocol Application handlers to interact with content • MS Recommends: 8(4vm) CPU / sources to retrieve data 8GB RAM per Crawl Component• Uses one or more crawl databases to store info about crawl items and crawl history
What it Does Important Facts• Processes crawl items and feeds to index • We must only have one (1) crawl component processing component per server – more• Transforms crawl items into artifacts that will hurt, not help crawl performance can be included in search index • Max of 2 per search service application (Performs document parsing and • Feeding Sessions are scaled based on property mapping) CPU cores using a default coefficient of 3• Writes information about links and urls 8 (cores) * 3 = 24 feeding sessions in link database (which are analyzed by 4 (cores) * 3 = 12 feeding sessions analytics to calculate relevance and • MS Recommends: 8(4vm) CPU / 8GB currency - Results written back to search RAM per Content Processing Component index by content processing component • Feeding sessions require RAM – More• Generates phonetic name variations to RAM is necessary when more cores are improve people search present – monitoring required
What it Does Important Facts• Runs analytics jobs that analyze crawl items • Maximum of 6 per search service and user interaction with search results to application perform both search analytics and usage • Add more Analytics Processing Components analytics to improve analytics performance• Analyzes Link & Anchor text analysis, Clear • MS Recommends: 8(4vm) CPU / 8GB RAM / distance, Search Clicks, Deep Links, Social 300GB disk space per Analytics Processing Tags, Social Distance, Search Reports, Component. Recommendations, Usage Counts, Activity • Interacts with Analytics Reporting to store Ranking statistical information• Improves search relevance and create • Interacts with Link database to store search results information about searches and crawled• Output included in search index by content documents processor
What it Does Important Facts• Receives processed items from content • Maximum of 60 index partitions (20 processing component and writes the index partitions X 3 index replicas) per items to the index file search service application• Receives queries from the query • Must provision one Index Component processing component and returns for each index replica. result sets • MS Recommends: 8(4vm) CPU / 16GB• Redistributes content among index RAM / 500GB disk space per Index partitions when index architecture is Component. changed by Search Administration Component
• Index partition is logical portion of entire search index (same as before) • Index partition is served by one or more index components • Index components can be primary "replica" or secondary Index "replica" • Primary Replica is contacted by content processing component to write new data in the indexArchitecture • Secondary Replica is read only copy that get updated with the data. • Adding replicas improves query performance under load • Add partitions to handle increased content corpus • Cant remove partition after it has been added.
What it Does Important Facts• Analyzes and processes queries and • Maximum of 1 per server results • MS Recommends: 8(4vm) CPU / 8GB• After receiving a query, it analyzes and RAM per Query Processing processes the query to optimize Component. precision, recall and relevance• Submits processed queries to the index component• Processes the result set returned by the index component before returning to the querying entity.
Host 1 Host 2 Host 5 Host 6 Web server Web server Web server Web server All SharePoint databases All SharePoint databases Application Office Application Office Search admin db Link db Server Web Apps Server Web Apps Server Server Crawl db Analytics db Redundant copies of all databases using SQL clustering, mirroring, or SQL Server SharePoint Config db 2012 AlwaysOn All other SharePoint databasesHost 3 Host 4 Application Server Application Server Query Processing Query Processing Replica Index part ition 0 Replica Application Server Application Server Crawl Crawl Admin Admin Analytics Analytics Content processing Content processing
Host A Host B Host E Host F Application Server Application Server Query Processing Replica Index part ition 0 Replica Application Server Application Server Analytics Analytics Application Server Application Server Content processing Content processing Application Server Application Server Replica Index part ition 1 Replica Admin Admin Crawl Content processing Crawl Content processingHost C Host D Host G Host H Application Server Application Server Query Processing SharePoint databases SharePoint databases Replica Index part ition 2 Replica Crawl db Search admin db Crawl db Redundant copies of all databases using Application Server Application Server Link db Analytics db SQL clustering, mirroring, or SQL Server 2012 AlwaysOn Replica Index part ition 3 Replica
Host A Host B Host C Host D Host K Host L Host M Host N Application Server Application Server Application Server Application Server Query Processing Query Processing Replica Index part ition 2 Replica Replica Index part ition 0 Replica Application Server Application Server Application Server Application Server Analytics Analytics Analytics Analytics Application Server Application Server Application Server Application Server Content processing Content processing Content processing Content processing Application Server Application Server Application Server Application Server Index part ition 1 Replica Index part ition 3 Replica Replica Replica Analytics Analytics Crawl Admin Crawl Admin Content processing Content processingHost E Host F Host G Host H Host O Host P Host Q Host R Application Server Application Server Application Server Application Server SharePoint databases SharePoint databases SharePoint databases SharePoint databases Query Processing Query Processing Index part ition 4 Replica Replica Index part ition 6 Replica Replica Search admin db Link db Redundant copies of all databases using Crawl db Redundant copies of all databases using Analytics db SQL clustering, mirroring, or SQL Server Application Server Application Server Application Server Application Server SQL clustering, mirroring, or SQL Server 2012 AlwaysOn Crawl db 2012 AlwaysOn Analytics db Crawl db Crawl db Replica Index part ition 5 Replica Replica Index part ition 7 Replica Crawl dbHost I Host J Application Server Application Server Replica Index part ition 8 Replica Application Server Application Server Replica Index part ition 9 Replica
Schema can be managed by site admins, reducing the load on search administrator Schema can be configured to allow more granularity (query, retrieve, refine, sort, etc) - Affects content index size Remote result sources can be crawled locally and then queried by remote farms. Huge impact on geo-distributed search… KL may be able to help! Individual items can be re-crawled easily Automatic URL balancing in crawl databases minimizes host name restrictions for large archive repositoriesScalability limit changes will have a big impact on farm design for large archive content repositories inthe near future.