Smart Search   and Beyond
Who?            Chris Davenport       Production Leadership TeamSmart Search and Beyond
Solving the search problemSmart Search and Beyond
Old Joomla Search Sucks!                    Cannot rank by                     relevance across                     conten...
Table of Contents01   Smart Search so far02   Smart Search in action03   Smart Search under the hood04   Smart Search tips...
A Short History ‣ Old Joomla Search  • Introduced in Mambo  • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ F...
Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ .....
Table of Contents01   Smart Search so far02   Smart Search in action03   Smart Search under the hood04   Smart Search tips...
Auto-completionSmart Search and Beyond
Another exampleSmart Search and Beyond
Another exampleSmart Search and Beyond
Table of Contents01   Smart Search so far02   Smart Search in action03   Smart Search under the hood04   Smart Search tips...
Under the hoodSmart Search and Beyond
A problem in two halvesSmart Search and Beyond
First half: Indexing             INDEX             Raw dataSmart Search and Beyond
Second half: Querying     Search    INDEX   Search     queries           resultsSmart Search and Beyond
Search resultsSearch results are rendered purely fromdata in the index, not the raw data.Smart Search and Beyond
IndexingSmart Search and Beyond
Indexing      Parsing          Stemming    Tokenisation        Analysis Token aggregation   Term weighting     Filtration ...
Terms indexSmart Search and Beyond
Parsing ‣ Extract plain text from raw data  • HTML, RTF supported out-of-the-box  • PDF, MS Word could be supported ‣ For ...
Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma,   dot and quotes ‣ Remove non-alphanumerics ‣ R...
Token aggregationOn a clear disk you can seek foreveron             a              clearon a           a clear        clea...
Filtration ‣ “Stop word removal”  • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only  • Other...
Stemmingfishingfished               fishfisherfishSmart Search and Beyond
Stemming ‣ “Snowball” is used by default  • Danish, German, English, Spanish, Finnish,    French, Hungarian, Italian, Norw...
Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If n...
Term weightingContext         MultiplierTitle           1.7Text            0.7Meta            1.2Path            2.0Miscel...
ClassificationSmart Search and Beyond
Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully   supported (y...
Taxonomies - drop-downsSmart Search and Beyond
Taxonomies - checkboxesSmart Search and Beyond
Taxonomies - linksSmart Search and Beyond
Database ERDSmart Search and Beyond
Smart Search Plug-ins               /plugins   /content     /finder     /system    /finder   /categories   /highlight     ...
Smart Search Plug-inscontent/finder           finder/[type]  onContentBeforeSave         onFinderBeforeSave   onContentAft...
Query parsing                      URI argument      Query stringTerms                 q=Some+text       Some textPhrases ...
Results rendering ‣ com_finder  • search                  Search results    ‣ default.php           page    ‣ form.php    ...
Layout overrides exampleSmart Search and Beyond
Alternative overrideSmart Search and Beyond
Table of Contents01   Smart Search so far02   Smart Search in action03   Smart Search under the hood04   Smart Search tips...
Tips and tricksSmart Search and Beyond
Tips and tricks ‣ HTML Parser  • Invalid HTML can confuse the parser  • Invalid UTF8 is ignored  • Text in attributes is i...
When to do a purge ‣ Indexing is incremental so most of the time you dont   need to. ‣ Changes to taxonomies that do not i...
Tuning Smart Search ‣ Use the CLI for indexing  • http://docs.joomla.org/Setting_up_automatic_Smart_    Search_indexing ‣ ...
Table of Contents01   Smart Search so far02   Smart Search in action03   Smart Search under the hood04   Smart Search tips...
Where next?Smart Search and Beyond
Search Working Group ‣ Meeting at J and Beyond  • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 rel...
Improved language support ‣ Improve common word support ‣ Improve stemmer support  • Native PHP stemmers? ‣ Improve morpho...
Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for r...
Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in ...
Documentationhttp://docs.joomla.org/Category:Smart_SearchSmart Search and Beyond
Questions?Smart Search and Beyond
Dont forget   Search Working Group         Meeting    Saturday 19 May 2012          11:30 AMSmart Search and Beyond
Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.j...
Upcoming SlideShare
Loading in...5
×

JAB2012 Smart Search Presentation

1,562

Published on

Smart Search and Beyond
Presentation given at J and Beyond, Bad Nauheim, Germany, May 2012.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,562
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

JAB2012 Smart Search Presentation

  1. 1. Smart Search and Beyond
  2. 2. Who? Chris Davenport Production Leadership TeamSmart Search and Beyond
  3. 3. Solving the search problemSmart Search and Beyond
  4. 4. Old Joomla Search Sucks! Cannot rank by relevance across content types Only very crude filtering Can be slow to searchSmart Search and Beyond
  5. 5. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  6. 6. A Short History ‣ Old Joomla Search • Introduced in Mambo • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ Finder Integration Working Group • Smart Search for Joomla 2.5 ‣ Search Working GroupSmart Search and Beyond
  7. 7. Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ ...and more besidesSmart Search and Beyond
  8. 8. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  9. 9. Auto-completionSmart Search and Beyond
  10. 10. Another exampleSmart Search and Beyond
  11. 11. Another exampleSmart Search and Beyond
  12. 12. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  13. 13. Under the hoodSmart Search and Beyond
  14. 14. A problem in two halvesSmart Search and Beyond
  15. 15. First half: Indexing INDEX Raw dataSmart Search and Beyond
  16. 16. Second half: Querying Search INDEX Search queries resultsSmart Search and Beyond
  17. 17. Search resultsSearch results are rendered purely fromdata in the index, not the raw data.Smart Search and Beyond
  18. 18. IndexingSmart Search and Beyond
  19. 19. Indexing Parsing Stemming Tokenisation Analysis Token aggregation Term weighting Filtration ClassificationSmart Search and Beyond
  20. 20. Terms indexSmart Search and Beyond
  21. 21. Parsing ‣ Extract plain text from raw data • HTML, RTF supported out-of-the-box • PDF, MS Word could be supported ‣ For example, HTML • Essentially the same as PHP strip_tagsSmart Search and Beyond
  22. 22. Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma, dot and quotes ‣ Remove non-alphanumerics ‣ Replace multiple spaces with one space ‣ Special support for ChineseSmart Search and Beyond
  23. 23. Token aggregationOn a clear disk you can seek foreveron a clearon a a clear clear diskon a clear a clear disk clear disk youdisk you candisk you you can can seekdisk you can you can seek can seek foreverseek foreverseek foreverSmart Search and Beyond
  24. 24. Filtration ‣ “Stop word removal” • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only • Other languages need to add their common words to the tableSmart Search and Beyond
  25. 25. Stemmingfishingfished fishfisherfishSmart Search and Beyond
  26. 26. Stemming ‣ “Snowball” is used by default • Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish • BUT it requires PHP extension ‣ “English only” uses a pure PHP stemmer • Recommended for all English sitesSmart Search and Beyond
  27. 27. Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If no search results found, then... • Match on Soundex code • Return nearest term/phrase by Levenshtein distanceSmart Search and Beyond
  28. 28. Term weightingContext MultiplierTitle 1.7Text 0.7Meta 1.2Path 2.0Miscellaneous 0.3Smart Search and Beyond
  29. 29. ClassificationSmart Search and Beyond
  30. 30. Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully supported (yet)Smart Search and Beyond
  31. 31. Taxonomies - drop-downsSmart Search and Beyond
  32. 32. Taxonomies - checkboxesSmart Search and Beyond
  33. 33. Taxonomies - linksSmart Search and Beyond
  34. 34. Database ERDSmart Search and Beyond
  35. 35. Smart Search Plug-ins /plugins /content /finder /system /finder /categories /highlight /contacts /content /newsfeeds /weblinksSmart Search and Beyond
  36. 36. Smart Search Plug-inscontent/finder finder/[type] onContentBeforeSave onFinderBeforeSave onContentAfterSave onFinderAfterSave onContentAfterDelete onFinderAfterDelete onContentChangeState onFinderChangeState onCategoryChangeState onFinderCategoryChangeStateSmart Search and Beyond
  37. 37. Query parsing URI argument Query stringTerms q=Some+text Some textPhrases q=”Some+text” “Some text”Logical operators q=This+and+that This and thatBefore a date d1=2012-05-16 before:2012-05-16After a date d2=2012-05-18 after:2012-05-18Content type filter t[]=98233 type:ArticlesTaxonomy filter t[]=30922 author:Chris DavenportStatic filter f=2Highlight qh=Some+textSmart Search and Beyond
  38. 38. Results rendering ‣ com_finder • search Search results ‣ default.php page ‣ form.php ‣ default_results.php ‣ default_result.php For custom types ‣ default_[type].php ‣ mod_finder ‣ default.php Search moduleSmart Search and Beyond
  39. 39. Layout overrides exampleSmart Search and Beyond
  40. 40. Alternative overrideSmart Search and Beyond
  41. 41. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  42. 42. Tips and tricksSmart Search and Beyond
  43. 43. Tips and tricks ‣ HTML Parser • Invalid HTML can confuse the parser • Invalid UTF8 is ignored • Text in attributes is ignoredSmart Search and Beyond
  44. 44. When to do a purge ‣ Indexing is incremental so most of the time you dont need to. ‣ Changes to taxonomies that do not involve changes to content items ‣ Changes to term weights ‣ Changing the stemmer ‣ Changes to content items that do not trigger the standard content events ‣ IMPORTANT • If you have static filters they will be lost when you do a purge.Smart Search and Beyond
  45. 45. Tuning Smart Search ‣ Use the CLI for indexing • http://docs.joomla.org/Setting_up_automatic_Smart_ Search_indexing ‣ Out of memory issues • Please report out of memory issues so we can understand them better. • Reduce batch size ‣ Default is 50. Drop it to 5 or even 1. • Terms per batch ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGESmart Search and Beyond
  46. 46. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  47. 47. Where next?Smart Search and Beyond
  48. 48. Search Working Group ‣ Meeting at J and Beyond • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 release September 2012 ‣ Meeting at Joomla World Conference • San Jose, California, November 2012Smart Search and Beyond
  49. 49. Improved language support ‣ Improve common word support ‣ Improve stemmer support • Native PHP stemmers? ‣ Improve morphological coding • Non-English alternatives to Soundex ‣ Mixed language content items • Language tagging of tokens/terms?Smart Search and Beyond
  50. 50. Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for range queries ‣ Search logging via JLog ‣ Variable-length token aggregation ‣ Multi-level taxonomies ‣ Add parsers for PDF, MS WordSmart Search and Beyond
  51. 51. Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in Smart Search for Joomla 4.0Smart Search and Beyond
  52. 52. Documentationhttp://docs.joomla.org/Category:Smart_SearchSmart Search and Beyond
  53. 53. Questions?Smart Search and Beyond
  54. 54. Dont forget Search Working Group Meeting Saturday 19 May 2012 11:30 AMSmart Search and Beyond
  55. 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless Ive accidentally missed crediting them.Image Credits
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×