IntegratingGoogle Search Appliance with Mura CMS Ajay Sathuluri @sathuluri
About Me∗ Ajay Sathuluri∗ Sr. Architect at ICF International∗ Using ColdFusion since ’98∗ Server Tuning, Administration, Load Testing∗ I like spending time with my kids and wife.
What are we covering?∗ Google Search Appliance ∗ Configuring a Crawl ∗ Control Access to Content ∗ Configuring Database Crawl ∗ Collections / Front Ends ∗ Crawl Diagnostics∗ Configuring GSA with Mura CMS Plugin (FW/1)∗ Search∗ Search Results
Configuring a Crawl∗ Before starting a crawl, you must configure the crawl path so that it only includes information that you wants to make available in search results.∗ Use the Crawl and Index > Crawl URLs page in the Admin Console to enter URLs∗ URLs are case-sensitive.∗ Configure your network to disallow search appliance connectivity outside of your intranet.
Control Access to Content∗ robot.txt∗ meta tag∗ no-crawl Directories
Control Access to Content (2)robot.txt∗ The Google Search Appliance always obeys the rules in robots.txt and it is not possible to override this feature.∗ robots.txt file is not mandatory.∗ It is located in the Web servers root directory.∗ For the search appliance to be able to access the robot.txt file, the file must be public.∗ Includes one or more Disallow: or Allow:∗ User-agent: gsa-crawler∗ Disallow: /personal_records/∗ Disallow: /admin/∗ Allow: /∗ Allow: /personal_records/mypersonal.doc
Control Access to Content (3)meta tag∗ Prevent the search appliance crawler (as well as other crawlers) from indexing or following links in a specific HTML page.∗ Embed a robots meta tag in the head of the HTML page.∗ The search appliance crawler obeys the index, noindex, follow, and nofollow in meta tags.<meta name="robots" content="index, nofollow"><meta name="robots" content="noindex, nofollow">
Control Access to Content (4)no-crawl Directories∗ The Google Search Appliance does not crawl any directories named "no_crawl." You can prevent the search appliance from crawling files and directories by: Creating a directory called "no_crawl."∗ Putting the files and subdirectories you do not want crawled under the no_crawl directory.
Configuring Database Crawl∗ Database data source information enables the search appliance to access content stored in a database.∗ To configure a database crawl, provide database data source information.∗ Crawl and Index > Databases page in the Admin Console.∗ After you create a new database data source, click the Sync link to start a database crawl.
Collections∗ A collection lets you search over a specific part of the index.∗ For example, you may want to create a products collection or a faq collection that supports searches that are only within the products or faqs part of your index.∗ Maximum number of collections for a search appliance is 200.∗ Use the Crawl and Index > Collections - In the Collection Name text box, type a name for the new collection.∗ Manage collection by ∗ Editing a Collection ∗ Exporting and Importing a Collection Configuration ∗ Deleting a Collection
Front Ends∗ A front end enables you to change the look and feel of the search and search result pages your users access.∗ You can customize these pages to display your organizations colors, fonts, and design. If you have multiple collections, you can make each front end appear in a different format, and have its own configuration options.∗ Use the Serving > Front Ends - In the Front End Name field, enter a name for the new front end.∗ Manage Front End by ∗ Editing a Front End ∗ Deleting a Front End