JAB2012 Smart Search Presentation
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


JAB2012 Smart Search Presentation



Smart Search and Beyond

Smart Search and Beyond
Presentation given at J and Beyond, Bad Nauheim, Germany, May 2012.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

JAB2012 Smart Search Presentation Presentation Transcript

  • 1. Smart Search and Beyond
  • 2. Who? Chris Davenport Production Leadership TeamSmart Search and Beyond
  • 3. Solving the search problemSmart Search and Beyond
  • 4. Old Joomla Search Sucks! Cannot rank by relevance across content types Only very crude filtering Can be slow to searchSmart Search and Beyond
  • 5. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 6. A Short History ‣ Old Joomla Search • Introduced in Mambo • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ Finder Integration Working Group • Smart Search for Joomla 2.5 ‣ Search Working GroupSmart Search and Beyond
  • 7. Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ ...and more besidesSmart Search and Beyond
  • 8. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 9. Auto-completionSmart Search and Beyond
  • 10. Another exampleSmart Search and Beyond
  • 11. Another exampleSmart Search and Beyond
  • 12. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 13. Under the hoodSmart Search and Beyond
  • 14. A problem in two halvesSmart Search and Beyond
  • 15. First half: Indexing INDEX Raw dataSmart Search and Beyond
  • 16. Second half: Querying Search INDEX Search queries resultsSmart Search and Beyond
  • 17. Search resultsSearch results are rendered purely fromdata in the index, not the raw data.Smart Search and Beyond
  • 18. IndexingSmart Search and Beyond
  • 19. Indexing Parsing Stemming Tokenisation Analysis Token aggregation Term weighting Filtration ClassificationSmart Search and Beyond
  • 20. Terms indexSmart Search and Beyond
  • 21. Parsing ‣ Extract plain text from raw data • HTML, RTF supported out-of-the-box • PDF, MS Word could be supported ‣ For example, HTML • Essentially the same as PHP strip_tagsSmart Search and Beyond
  • 22. Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma, dot and quotes ‣ Remove non-alphanumerics ‣ Replace multiple spaces with one space ‣ Special support for ChineseSmart Search and Beyond
  • 23. Token aggregationOn a clear disk you can seek foreveron a clearon a a clear clear diskon a clear a clear disk clear disk youdisk you candisk you you can can seekdisk you can you can seek can seek foreverseek foreverseek foreverSmart Search and Beyond
  • 24. Filtration ‣ “Stop word removal” • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only • Other languages need to add their common words to the tableSmart Search and Beyond
  • 25. Stemmingfishingfished fishfisherfishSmart Search and Beyond
  • 26. Stemming ‣ “Snowball” is used by default • Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish • BUT it requires PHP extension ‣ “English only” uses a pure PHP stemmer • Recommended for all English sitesSmart Search and Beyond
  • 27. Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If no search results found, then... • Match on Soundex code • Return nearest term/phrase by Levenshtein distanceSmart Search and Beyond
  • 28. Term weightingContext MultiplierTitle 1.7Text 0.7Meta 1.2Path 2.0Miscellaneous 0.3Smart Search and Beyond
  • 29. ClassificationSmart Search and Beyond
  • 30. Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully supported (yet)Smart Search and Beyond
  • 31. Taxonomies - drop-downsSmart Search and Beyond
  • 32. Taxonomies - checkboxesSmart Search and Beyond
  • 33. Taxonomies - linksSmart Search and Beyond
  • 34. Database ERDSmart Search and Beyond
  • 35. Smart Search Plug-ins /plugins /content /finder /system /finder /categories /highlight /contacts /content /newsfeeds /weblinksSmart Search and Beyond
  • 36. Smart Search Plug-inscontent/finder finder/[type] onContentBeforeSave onFinderBeforeSave onContentAfterSave onFinderAfterSave onContentAfterDelete onFinderAfterDelete onContentChangeState onFinderChangeState onCategoryChangeState onFinderCategoryChangeStateSmart Search and Beyond
  • 37. Query parsing URI argument Query stringTerms q=Some+text Some textPhrases q=”Some+text” “Some text”Logical operators q=This+and+that This and thatBefore a date d1=2012-05-16 before:2012-05-16After a date d2=2012-05-18 after:2012-05-18Content type filter t[]=98233 type:ArticlesTaxonomy filter t[]=30922 author:Chris DavenportStatic filter f=2Highlight qh=Some+textSmart Search and Beyond
  • 38. Results rendering ‣ com_finder • search Search results ‣ default.php page ‣ form.php ‣ default_results.php ‣ default_result.php For custom types ‣ default_[type].php ‣ mod_finder ‣ default.php Search moduleSmart Search and Beyond
  • 39. Layout overrides exampleSmart Search and Beyond
  • 40. Alternative overrideSmart Search and Beyond
  • 41. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 42. Tips and tricksSmart Search and Beyond
  • 43. Tips and tricks ‣ HTML Parser • Invalid HTML can confuse the parser • Invalid UTF8 is ignored • Text in attributes is ignoredSmart Search and Beyond
  • 44. When to do a purge ‣ Indexing is incremental so most of the time you dont need to. ‣ Changes to taxonomies that do not involve changes to content items ‣ Changes to term weights ‣ Changing the stemmer ‣ Changes to content items that do not trigger the standard content events ‣ IMPORTANT • If you have static filters they will be lost when you do a purge.Smart Search and Beyond
  • 45. Tuning Smart Search ‣ Use the CLI for indexing • http://docs.joomla.org/Setting_up_automatic_Smart_ Search_indexing ‣ Out of memory issues • Please report out of memory issues so we can understand them better. • Reduce batch size ‣ Default is 50. Drop it to 5 or even 1. • Terms per batch ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGESmart Search and Beyond
  • 46. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 47. Where next?Smart Search and Beyond
  • 48. Search Working Group ‣ Meeting at J and Beyond • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 release September 2012 ‣ Meeting at Joomla World Conference • San Jose, California, November 2012Smart Search and Beyond
  • 49. Improved language support ‣ Improve common word support ‣ Improve stemmer support • Native PHP stemmers? ‣ Improve morphological coding • Non-English alternatives to Soundex ‣ Mixed language content items • Language tagging of tokens/terms?Smart Search and Beyond
  • 50. Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for range queries ‣ Search logging via JLog ‣ Variable-length token aggregation ‣ Multi-level taxonomies ‣ Add parsers for PDF, MS WordSmart Search and Beyond
  • 51. Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in Smart Search for Joomla 4.0Smart Search and Beyond
  • 52. Documentationhttp://docs.joomla.org/Category:Smart_SearchSmart Search and Beyond
  • 53. Questions?Smart Search and Beyond
  • 54. Dont forget Search Working Group Meeting Saturday 19 May 2012 11:30 AMSmart Search and Beyond
  • 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless Ive accidentally missed crediting them.Image Credits