7.1 Search and Lucene.Net
Lucene.Net was the obvious choice of technology for Search in 7.1. Lucene is a general purpose search engine, integrating with the intricracies with DNN wasn't trivial. Ash was very instrumental in design and development of the new Search in 7.1. Join Ash to hear all about DNN Search and Lucene.Net and what's the future look like.
1. 7.1 Search and Lucene.Net
Ash Prasad
Don’t forget to include #DNNCon in your tweets!
@DNNCon
2. Agenda
•
•
•
•
•
History and New Objectives
Architecture
Lucene / Lucene.Net
Crawlers, Entities, Controllers
Ranking, Synonyms, Ignore Words,
Stemming
• Security Trimming
• Module Integration, New Crawler
Don’t forget to include #DNNCon in your tweets!
@DNNCon
3. History of Search
ISearchable
• Platform Edition
• SQL Server
• ISearchable
Scheduler
Module
Module
SQL
• Commercial Edition
• Lucene 2.9.2
• URL and Files
Scheduler
Lucene
Don’t forget to include #DNNCon in your tweets!
@DNNCon
4. Objectives of New Search
• Handle diverse Content
• CMS, Social, Localized, 3rd Party Modules)
• Consistent User Experience
• Simple for Module Developers
• Uniform Architecture
•
Feature based differentiation
Don’t forget to include #DNNCon in your tweets!
@DNNCon
6. What‟s Lucene
•
•
•
•
•
•
Java-based indexing and search
technology
Managed by Apache
NOSQL database
Near real-time, Spellchecking,
Highlighting, Ranking, Synonyms
Many companies use Lucene
directly or customize
Facebook‟s Graph search uses
similar „Inverted Index‟
Don’t forget to include #DNNCon in your tweets!
@DNNCon
7. What‟s Lucene.Net
•
•
•
•
Line-by-line port from Java to C#
Maintains high-performance requirements
A bit behind Java releases
Who Uses Lucene.Net
• Products - RavenDB, Orchard, Umbraco, SubText
• Commercial Sites – BBC UK Top Gear, AutoDesk,
Koders.Com
Don’t forget to include #DNNCon in your tweets!
@DNNCon
8. Lucene – A Document Store
• Flexible Schema
• Consists of Documents
•
Which are collection of Fields
• Documents can have different set of Fields
•
•
Field(“ID”,”xxx-yyy-999”), Field(“Title”, “My best
doc”)
Field(“Owner”,”Ash”), Field(“Locale”,”en-US”)
Don’t forget to include #DNNCon in your tweets!
@DNNCon
9. Lucene – A Document Store (Contd.)
• Denormalized (No Referential Integrity)
• Deletion – Done through a flag
• Compact reclaims deleted space
• Update is Delete + Insert
• Boost = Ranking
• Unicode compliant
Don’t forget to include #DNNCon in your tweets!
@DNNCon
10. Book consulted for Search
• Book on version
3.0
• ~ 500 pages
• Very useful
Don’t forget to include #DNNCon in your tweets!
@DNNCon
12. Crawlers
• Platform
• Site Crawler
•
•
Module and Tab Metadata
Module Content
(ModuleSearchBase/ISearchable)
• Commercial Edition
• File Crawler
•
Uses IFilter for extraction of text PDF/Office files
• URL Crawler
•
Internal and External URLs
Don’t forget to include #DNNCon in your tweets!
@DNNCon
13. Search Entities
• SearchType
• Distinguishes Crawlers
• SearchDocument
• Properties for a Content
• Stored in the Index
• SearchQuery
• Parameters to execute a Query
• SearchResult
• Derived from SearchDocument
Don’t forget to include #DNNCon in your tweets!
@DNNCon
14. Search Entities – Indexing vs. Querying
Don’t forget to include #DNNCon in your tweets!
@DNNCon
15. Controllers
• SearchController
• For Querying
• InternalSearchController
• For Adding / Updating / Deleting
• LuceneController
• Interacts with Lucene
Don’t forget to include #DNNCon in your tweets!
@DNNCon
16. Ranking = Boosting
• Doc and/or Field can be boosted in
Lucene
• DNN does Field boosts (Default - 10)
•
•
•
•
•
Title (50)
Tag (40)
Keyword (35)
Description (20)
Author (15)
• Configured manually by HostSettings
Don’t forget to include #DNNCon in your tweets!
@DNNCon
17. Synonyms and Ignore Words
• Synonyms are injected into Index
• Ignore Words are removed from Index
Don’t forget to include #DNNCon in your tweets!
@DNNCon
18. Stemming
• Convert words to its root
• PorterStemFilter is used
• Country and Countries = countri
• breathe, breathes, breathing, breathed =
breath
• fishing, fished, fisher = fish
Don’t forget to include #DNNCon in your tweets!
@DNNCon
19. Security Trimming
• Done through Collectors (Callback)
• Each Doc found is sent to Collector
• Collector rejects/accept per
Permission
• Site Crawler - Module / Tab Permission
• File Crawler - Folder Permission
• User Crawler – Profile Permission
Don’t forget to include #DNNCon in your tweets!
@DNNCon
20. Module Integration
• ModuleSearchBase
• New abstract class with just one method
• Defined in BusinessControllerClass
• GetModifiedSearchDocuments
•
•
•
Returns New, Changed and Deleted content
Delta based
Granular Permission, Localization, etc.
• ISearchable continues to work (no
delta)
Don’t forget to include #DNNCon in your tweets!
@DNNCon
21. New Crawler (How to)
• Define a new SearchType
• Optionally use IsPrivate to hide from site
search
• Implement BaseResultController (2
methods)
• HasViewPermission
• GetDocUrl
• Create Scheduled Task
• Call AddSearchDocuments to inject
contentforget to include #DNNCon in your tweets! @DNNCon
Don’t
23. Recap
•
•
•
•
New Search uses Lucene.Net
Platform has Site Crawler
Commercial has URL and File Crawlers
Modules to implement
ModuleSearchBase
• New Crawler implements
BaseResultController
Don’t forget to include #DNNCon in your tweets!
@DNNCon
24. THANKS TO ALL OF OUR GENEROUS
SPONSORS!
Don’t forget to include #DNNCon in your tweets!
@DNNCon