JAB2012 Smart Search Presentation


Published on

Smart Search and Beyond
Presentation given at J and Beyond, Bad Nauheim, Germany, May 2012.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

JAB2012 Smart Search Presentation

  1. 1. Smart Search and Beyond
  2. 2. Who? Chris Davenport Production Leadership TeamSmart Search and Beyond
  3. 3. Solving the search problemSmart Search and Beyond
  4. 4. Old Joomla Search Sucks! Cannot rank by relevance across content types Only very crude filtering Can be slow to searchSmart Search and Beyond
  5. 5. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  6. 6. A Short History ‣ Old Joomla Search • Introduced in Mambo • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ Finder Integration Working Group • Smart Search for Joomla 2.5 ‣ Search Working GroupSmart Search and Beyond
  7. 7. Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ ...and more besidesSmart Search and Beyond
  8. 8. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  9. 9. Auto-completionSmart Search and Beyond
  10. 10. Another exampleSmart Search and Beyond
  11. 11. Another exampleSmart Search and Beyond
  12. 12. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  13. 13. Under the hoodSmart Search and Beyond
  14. 14. A problem in two halvesSmart Search and Beyond
  15. 15. First half: Indexing INDEX Raw dataSmart Search and Beyond
  16. 16. Second half: Querying Search INDEX Search queries resultsSmart Search and Beyond
  17. 17. Search resultsSearch results are rendered purely fromdata in the index, not the raw data.Smart Search and Beyond
  18. 18. IndexingSmart Search and Beyond
  19. 19. Indexing Parsing Stemming Tokenisation Analysis Token aggregation Term weighting Filtration ClassificationSmart Search and Beyond
  20. 20. Terms indexSmart Search and Beyond
  21. 21. Parsing ‣ Extract plain text from raw data • HTML, RTF supported out-of-the-box • PDF, MS Word could be supported ‣ For example, HTML • Essentially the same as PHP strip_tagsSmart Search and Beyond
  22. 22. Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma, dot and quotes ‣ Remove non-alphanumerics ‣ Replace multiple spaces with one space ‣ Special support for ChineseSmart Search and Beyond
  23. 23. Token aggregationOn a clear disk you can seek foreveron a clearon a a clear clear diskon a clear a clear disk clear disk youdisk you candisk you you can can seekdisk you can you can seek can seek foreverseek foreverseek foreverSmart Search and Beyond
  24. 24. Filtration ‣ “Stop word removal” • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only • Other languages need to add their common words to the tableSmart Search and Beyond
  25. 25. Stemmingfishingfished fishfisherfishSmart Search and Beyond
  26. 26. Stemming ‣ “Snowball” is used by default • Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish • BUT it requires PHP extension ‣ “English only” uses a pure PHP stemmer • Recommended for all English sitesSmart Search and Beyond
  27. 27. Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If no search results found, then... • Match on Soundex code • Return nearest term/phrase by Levenshtein distanceSmart Search and Beyond
  28. 28. Term weightingContext MultiplierTitle 1.7Text 0.7Meta 1.2Path 2.0Miscellaneous 0.3Smart Search and Beyond
  29. 29. ClassificationSmart Search and Beyond
  30. 30. Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully supported (yet)Smart Search and Beyond
  31. 31. Taxonomies - drop-downsSmart Search and Beyond
  32. 32. Taxonomies - checkboxesSmart Search and Beyond
  33. 33. Taxonomies - linksSmart Search and Beyond
  34. 34. Database ERDSmart Search and Beyond
  35. 35. Smart Search Plug-ins /plugins /content /finder /system /finder /categories /highlight /contacts /content /newsfeeds /weblinksSmart Search and Beyond
  36. 36. Smart Search Plug-inscontent/finder finder/[type] onContentBeforeSave onFinderBeforeSave onContentAfterSave onFinderAfterSave onContentAfterDelete onFinderAfterDelete onContentChangeState onFinderChangeState onCategoryChangeState onFinderCategoryChangeStateSmart Search and Beyond
  37. 37. Query parsing URI argument Query stringTerms q=Some+text Some textPhrases q=”Some+text” “Some text”Logical operators q=This+and+that This and thatBefore a date d1=2012-05-16 before:2012-05-16After a date d2=2012-05-18 after:2012-05-18Content type filter t[]=98233 type:ArticlesTaxonomy filter t[]=30922 author:Chris DavenportStatic filter f=2Highlight qh=Some+textSmart Search and Beyond
  38. 38. Results rendering ‣ com_finder • search Search results ‣ default.php page ‣ form.php ‣ default_results.php ‣ default_result.php For custom types ‣ default_[type].php ‣ mod_finder ‣ default.php Search moduleSmart Search and Beyond
  39. 39. Layout overrides exampleSmart Search and Beyond
  40. 40. Alternative overrideSmart Search and Beyond
  41. 41. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  42. 42. Tips and tricksSmart Search and Beyond
  43. 43. Tips and tricks ‣ HTML Parser • Invalid HTML can confuse the parser • Invalid UTF8 is ignored • Text in attributes is ignoredSmart Search and Beyond
  44. 44. When to do a purge ‣ Indexing is incremental so most of the time you dont need to. ‣ Changes to taxonomies that do not involve changes to content items ‣ Changes to term weights ‣ Changing the stemmer ‣ Changes to content items that do not trigger the standard content events ‣ IMPORTANT • If you have static filters they will be lost when you do a purge.Smart Search and Beyond
  45. 45. Tuning Smart Search ‣ Use the CLI for indexing • http://docs.joomla.org/Setting_up_automatic_Smart_ Search_indexing ‣ Out of memory issues • Please report out of memory issues so we can understand them better. • Reduce batch size ‣ Default is 50. Drop it to 5 or even 1. • Terms per batch ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGESmart Search and Beyond
  46. 46. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  47. 47. Where next?Smart Search and Beyond
  48. 48. Search Working Group ‣ Meeting at J and Beyond • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 release September 2012 ‣ Meeting at Joomla World Conference • San Jose, California, November 2012Smart Search and Beyond
  49. 49. Improved language support ‣ Improve common word support ‣ Improve stemmer support • Native PHP stemmers? ‣ Improve morphological coding • Non-English alternatives to Soundex ‣ Mixed language content items • Language tagging of tokens/terms?Smart Search and Beyond
  50. 50. Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for range queries ‣ Search logging via JLog ‣ Variable-length token aggregation ‣ Multi-level taxonomies ‣ Add parsers for PDF, MS WordSmart Search and Beyond
  51. 51. Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in Smart Search for Joomla 4.0Smart Search and Beyond
  52. 52. Documentationhttp://docs.joomla.org/Category:Smart_SearchSmart Search and Beyond
  53. 53. Questions?Smart Search and Beyond
  54. 54. Dont forget Search Working Group Meeting Saturday 19 May 2012 11:30 AMSmart Search and Beyond
  55. 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless Ive accidentally missed crediting them.Image Credits