ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!

on

  • 284 views

Presentation from the European SharePoint Conference 2014 in Barcelona. How did we build a solution for indexing 3000 file shares using self service solutions and automated crawl management.

Presentation from the European SharePoint Conference 2014 in Barcelona. How did we build a solution for indexing 3000 file shares using self service solutions and automated crawl management.

Statistics

Views

Total Views
284
Views on SlideShare
277
Embed Views
7

Actions

Likes
1
Downloads
5
Comments
0

3 Embeds 7

https://twitter.com 5
https://www.linkedin.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013! Presentation Transcript

  • 1. So you think you can crawl? Stretching the Boundaries of SharePoint 2013! Petter Skodvin-Hvammen AD-Gruppen, Norway
  • 2. Who am I? Petter Skodvin-Hvammen Oseberg ship - Discovered 1904 in Tønsberg, Norway. Buried by Vikings in 834 AD • Solutions Architect • SharePoint Consultant • Search Enthusiast • Community Lead @pettersh - psh@adgruppen.no www.adgruppen.no
  • 3. Enterprise Search Index thousands of sources Automate index management Infrastructure sizing Challenges and Solutions Not Included: code/scripts, user experience, relevancy, governancewww.sharepointeurope.com
  • 4. Enterprise Search using SharePoint Server 2013 • 30,000 users • 85 locations in 30 countries • 15,000 daily searches • 100,000,000 documents(?) • 60 core systems, 2,000 applications The Mission…
  • 5. What do we index? 100,000,000 documents 3,000 fileshares 500 servers
  • 6. Where is the data? • Datacenters • Time zones • Bandwidth www.sharepointeurope.com
  • 7. * http://blogs.technet.com/b/shanecothran/archive/2010/07/16/maxtokensize-and-kerberos-token-bloat.aspx How can we get it? • Limit bandwidth usage for specific server locations • Limit crawler impact within local business hours • Grant read access to crawler per file share • Avoid token bloat issues with more than 1,015* groups per account
  • 8. How do we operate it? • File shares are created, changed, and deleted every day using a custom self service solution • File shares are moved between servers every day by automation rules • Manage indexing and crawling of each file shares with minimum manual effort www.sharepointeurope.com
  • 9. What can SharePoint do? • Max 50 content sources per service application – Max 500 with October 2013 CU installed • Max 100 start addresses per content source – Max 500 with October 2013 CU installed • Max 20 concurrent crawls per service application – Limitation has been removed http://technet.microsoft.com/en-us/library/cc262787(v=office.15).aspx#Search
  • 10. It’s complicated • More data than we have space for • It’s located all over the place • Everything changes all of the time • There are limitations in SharePoint • Someone’s gotta maintain this • It has to be secure and relevant www.sharepointeurope.com
  • 11. What did we do? • Created logical groups of file shares • Used symbolic linking www.sharepointeurope.com fewer content sources file01share01 file02share03 file03share03 file00sharesym01 file00sharesym02 file00sharesym03 file00share Start address
  • 12. What did we do? • Grouped file shares based on region • One content source per region • Incremental crawls every night www.sharepointeurope.com crawling based on time zones
  • 13. What did we do? • Created DNS alias per impact rule in etc/hosts on crawl servers www.sharepointeurope.com reduced crawler impact
  • 14. What did we do? • Granted file share access to the account included in least groups • Monitored group memberships • Grouped file shares by crawl account • Crawl rules matched folder structure managed pool of crawl accounts file://.*/spcrwl01/.* file://.*/spcrwl02/.* Include Include SPspcrwl01 SPspcrwl02 www.sharepointeurope.com
  • 15. The bigger picture • Folder structure: • Start addresses: <content source>/<crawler impact>/<crawl account>/<symbolic link> file://<crawler impact>/<content source>/<crawler impact> Source Start addresses Folder Crawl rule Impact rule Europe file://default/europe/default europe/default/spcrwl01 file://.*/spcrwl01/.* Default europe/default/spcrwl02 file://.*/spcrwl02/.* Default file://wait-60/europe/wait-60 europe/wait-60/spcrwl01 file://.*/spcrwl01/.* Wait-60 europe/wait-60/spcrwl02 file://.*/spcrwl02/.* Wait-60 Asia file://default/asia/default asia/default/spcrwl01 file://.*/spcrwl01/.* Default asia/default/spcrwl02 file://.*/spcrwl02/.* Default file://wait-60/asia/wait-60 asia/wait-60/spcrwl01 file://.*/spcrwl01/.* Wait-60 asia/wait-60/spcrwl02 file://.*/spcrwl02/.* Wait-60
  • 16. How did we manage this? www.sharepointeurope.com self service portal for enabling indexing of file shares custom web service integration in self service portal custom solution for granting access to crawl accounts custom timer job to get list of file shares to crawl from self service portal custom timer job for creating and removing symbolic links custom lists for mapping server to content source, schedule and impact, shares to crawl accounts and metadata, UNC to symlink content enrichment service for replacing symlinks in paths with actual file paths
  • 17. www.sharepointeurope.com Title: European SharePoint Conference Owner: Petter Skodvin-Hvammen Business Area: Consulting Classification: Internal Type: Project UNC Path: Assigned automatically Crawl Account: Assigned automatically CancelSave Example: Self Service Portal Example: Custom Lists Title: European SharePoint Conference Owner: Petter Skodvin-Hvammen Business Area: Consulting Classification: Internal Type: Project UNC Path: file01share01 Crawl Account: SPspcrawl01 Symlink: defaulteuropedefaultspcrwl01e5dc12a41d Location: europe (server file01 is located in Oslo DC) Bandwidth: 5Mbps
  • 18. Index-0 Query WFE Doc Proc Crawling Central Admin Enrichment Query WFE Index-2 Index-1 Index-3 Index-0 Index-2 Index-1 Index-3 Doc Proc Doc Proc Doc Proc Doc Proc Doc Proc Doc Proc Doc Proc Crawling Analytics AdminAdmin Enrichment Enrichment Enrichment Enrichment Enrichment Enrichment Enrichment Analytics Doc Proc Enrichment Doc Proc Enrichment 40Million Documents 10Queries / Second SQL Server SQL Server • Admin DB • Analytics DB • Crawl DB • Link DB • Other SP DBs Caching Caching
  • 19. Capacity testing Purpose • Crawling of symbolic links • Scaling of virtual machines • Sizing of disk space • Verify Microsoft’s advises Approach • 4 server farm with 2 partitions • 8 vCPU, 16 GB RAM, 850 GB • Crawl 10 file shares (3.7M files) • Replay top 300 queries • Apache JMeter www.sharepointeurope.com
  • 20. Capacity testing – findings • Crawl rate declined 1% per million items indexed • Query latency increased exponentially from 12 million items indexed per partition • Database latency was insignificant during crawling • Successfully crawled file shares via symbolic directory links • Disk space usage was significant lower than expected – Reduced data volume from 850 GB to 450 GB – 40+ servers => huge cost savings www.sharepointeurope.com
  • 21. Infrastructure – VM sizing Dedicated ESX Cluster • 14 x VM for SharePoint 2013 – 4 physical machines – 4 x 32 = 128 CPUs – 4 x 56 = 1024 GB memory • HA max utiliization = ¾ – 3 x 32 = 96 CPUs – 3 x 56 = 768 GB memory • CPU and Memory can be over- commited • CPU over-commited 1,34 (1,78 if one physical host fail) • VM’s must wait for physical CPU Wait time for 8 cpu = 2 x 4 cpu • Mitigation: a) Reduce allocated virtual CPU, or b) Increase physical CPU • Memory factor 0,44 (0,59) • Reserved and locked memory prevents HA failover www.sharepointeurope.com
  • 22. Infrastructure – VM tuning www.sharepointeurope.com DC Role vCPU Peak Average Calculated Recommended Change A Web, Query, Admin 8 187,55 37,03 2 4 -4 B Web, Query, Admin 8 621,88 92,69 8 8 0 A Crawl, Analytics, Content, CEWS, Central Admin 8 724,35 210,59 8 8 0 B Crawl, Analytics, Content, CEWS, Symbolic Links 8 724,56 198,44 8 8 0 A Index 0, Content, CEWS 8 486,18 62,55 6 6 -2 B Index 0, Content, CEWS 8 520,63 63,98 6 6 -2 A Index 1, Content, CEWS 8 547,08 69,3 6 6 -2 B Index 1, Content, CEWS 8 546,44 91,74 6 6 -2 A Index 2, Content, CEWS 8 491,38 65,6 6 6 -2 B Index 2, Content, CEWS 8 532,01 77,83 6 6 -2 A Index 3, Content, CEWS 8 540,45 78,72 6 6 -2 B Index 3, Content, CEWS 8 621,88 92,69 8 8 0 A Distributed Cache 4 91,71 5,99 2 2 -2 B Distributed Cache* (added later) - - - - - - 100 78 80 -20 Peak and average CPU usage is calculated over 30 days
  • 23. Summary 1. Indexing thousands of content sources 2. Automation for rapid changing index requirements 3. Sizing the infrastructure for performance and HA www.sharepointeurope.com
  • 24. Questions? petter.skodvin-hvammen@adgruppen.no http://linkedin.com/in/petterskodvin@pettersh