JAB2012 Smart Search Presentation

  • 1,355 views
Uploaded on

Smart Search and Beyond …

Smart Search and Beyond
Presentation given at J and Beyond, Bad Nauheim, Germany, May 2012.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,355
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Smart Search and Beyond
  • 2. Who? Chris Davenport Production Leadership TeamSmart Search and Beyond
  • 3. Solving the search problemSmart Search and Beyond
  • 4. Old Joomla Search Sucks! Cannot rank by relevance across content types Only very crude filtering Can be slow to searchSmart Search and Beyond
  • 5. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 6. A Short History ‣ Old Joomla Search • Introduced in Mambo • Largely unchanged since ‣ JXTended Finder for Joomla 1.5 ‣ Finder Integration Working Group • Smart Search for Joomla 2.5 ‣ Search Working GroupSmart Search and Beyond
  • 7. Smart Search for Joomla 2.5 ‣ Separate index ‣ Auto-completion ‣ Facetted search ‣ Relevancy ordering ‣ Did you mean? ‣ ...and more besidesSmart Search and Beyond
  • 8. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 9. Auto-completionSmart Search and Beyond
  • 10. Another exampleSmart Search and Beyond
  • 11. Another exampleSmart Search and Beyond
  • 12. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 13. Under the hoodSmart Search and Beyond
  • 14. A problem in two halvesSmart Search and Beyond
  • 15. First half: Indexing INDEX Raw dataSmart Search and Beyond
  • 16. Second half: Querying Search INDEX Search queries resultsSmart Search and Beyond
  • 17. Search resultsSearch results are rendered purely fromdata in the index, not the raw data.Smart Search and Beyond
  • 18. IndexingSmart Search and Beyond
  • 19. Indexing Parsing Stemming Tokenisation Analysis Token aggregation Term weighting Filtration ClassificationSmart Search and Beyond
  • 20. Terms indexSmart Search and Beyond
  • 21. Parsing ‣ Extract plain text from raw data • HTML, RTF supported out-of-the-box • PDF, MS Word could be supported ‣ For example, HTML • Essentially the same as PHP strip_tagsSmart Search and Beyond
  • 22. Tokenisation ‣ Fold to lowercase ‣ Special handling for plus, dash, comma, dot and quotes ‣ Remove non-alphanumerics ‣ Replace multiple spaces with one space ‣ Special support for ChineseSmart Search and Beyond
  • 23. Token aggregationOn a clear disk you can seek foreveron a clearon a a clear clear diskon a clear a clear disk clear disk youdisk you candisk you you can can seekdisk you can you can seek can seek foreverseek foreverseek foreverSmart Search and Beyond
  • 24. Filtration ‣ “Stop word removal” • Not removed, just given a low weight ‣ jos_finder_terms_common ‣ English only • Other languages need to add their common words to the tableSmart Search and Beyond
  • 25. Stemmingfishingfished fishfisherfishSmart Search and Beyond
  • 26. Stemming ‣ “Snowball” is used by default • Danish, German, English, Spanish, Finnish, French, Hungarian, Italian, Norwegian, Dutch, Portuguese, Romanian, Russian, Swedish and Turkish • BUT it requires PHP extension ‣ “English only” uses a pure PHP stemmer • Recommended for all English sitesSmart Search and Beyond
  • 27. Morphological analysis ‣ Currently uses Soundex ‣ Not used in search as such ‣ Used for the “Did you mean?” feature ‣ If no search results found, then... • Match on Soundex code • Return nearest term/phrase by Levenshtein distanceSmart Search and Beyond
  • 28. Term weightingContext MultiplierTitle 1.7Text 0.7Meta 1.2Path 2.0Miscellaneous 0.3Smart Search and Beyond
  • 29. ClassificationSmart Search and Beyond
  • 30. Taxonomies ‣ “Content maps” in Administrator ‣ Basis for facetted search ‣ Multi-level taxonomies not fully supported (yet)Smart Search and Beyond
  • 31. Taxonomies - drop-downsSmart Search and Beyond
  • 32. Taxonomies - checkboxesSmart Search and Beyond
  • 33. Taxonomies - linksSmart Search and Beyond
  • 34. Database ERDSmart Search and Beyond
  • 35. Smart Search Plug-ins /plugins /content /finder /system /finder /categories /highlight /contacts /content /newsfeeds /weblinksSmart Search and Beyond
  • 36. Smart Search Plug-inscontent/finder finder/[type] onContentBeforeSave onFinderBeforeSave onContentAfterSave onFinderAfterSave onContentAfterDelete onFinderAfterDelete onContentChangeState onFinderChangeState onCategoryChangeState onFinderCategoryChangeStateSmart Search and Beyond
  • 37. Query parsing URI argument Query stringTerms q=Some+text Some textPhrases q=”Some+text” “Some text”Logical operators q=This+and+that This and thatBefore a date d1=2012-05-16 before:2012-05-16After a date d2=2012-05-18 after:2012-05-18Content type filter t[]=98233 type:ArticlesTaxonomy filter t[]=30922 author:Chris DavenportStatic filter f=2Highlight qh=Some+textSmart Search and Beyond
  • 38. Results rendering ‣ com_finder • search Search results ‣ default.php page ‣ form.php ‣ default_results.php ‣ default_result.php For custom types ‣ default_[type].php ‣ mod_finder ‣ default.php Search moduleSmart Search and Beyond
  • 39. Layout overrides exampleSmart Search and Beyond
  • 40. Alternative overrideSmart Search and Beyond
  • 41. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 42. Tips and tricksSmart Search and Beyond
  • 43. Tips and tricks ‣ HTML Parser • Invalid HTML can confuse the parser • Invalid UTF8 is ignored • Text in attributes is ignoredSmart Search and Beyond
  • 44. When to do a purge ‣ Indexing is incremental so most of the time you dont need to. ‣ Changes to taxonomies that do not involve changes to content items ‣ Changes to term weights ‣ Changing the stemmer ‣ Changes to content items that do not trigger the standard content events ‣ IMPORTANT • If you have static filters they will be lost when you do a purge.Smart Search and Beyond
  • 45. Tuning Smart Search ‣ Use the CLI for indexing • http://docs.joomla.org/Setting_up_automatic_Smart_ Search_indexing ‣ Out of memory issues • Please report out of memory issues so we can understand them better. • Reduce batch size ‣ Default is 50. Drop it to 5 or even 1. • Terms per batch ‣ Can be increased BUT NEEDS APACHE SERVER CONFIG CHANGESmart Search and Beyond
  • 46. Table of Contents01 Smart Search so far02 Smart Search in action03 Smart Search under the hood04 Smart Search tips and tricks05 Smart Search where next?Smart Search and Beyond
  • 47. Where next?Smart Search and Beyond
  • 48. Search Working Group ‣ Meeting at J and Beyond • 19 May 2012 11:30 AM ‣ Stable ready for merge July 2012 ‣ Joomla 3.0 release September 2012 ‣ Meeting at Joomla World Conference • San Jose, California, November 2012Smart Search and Beyond
  • 49. Improved language support ‣ Improve common word support ‣ Improve stemmer support • Native PHP stemmers? ‣ Improve morphological coding • Non-English alternatives to Soundex ‣ Mixed language content items • Language tagging of tokens/terms?Smart Search and Beyond
  • 50. Other possibilities ‣ Preserve static filters on purge/index ‣ Decouple indexing via message queues ‣ Easier support for range queries ‣ Search logging via JLog ‣ Variable-length token aggregation ‣ Multi-level taxonomies ‣ Add parsers for PDF, MS WordSmart Search and Beyond
  • 51. Search API ‣ Very important going forward ‣ Too big a leap for Joomla 3.0 ‣ Develop in parallel during 3.x cycle ‣ Use in Smart Search for Joomla 4.0Smart Search and Beyond
  • 52. Documentationhttp://docs.joomla.org/Category:Smart_SearchSmart Search and Beyond
  • 53. Questions?Smart Search and Beyond
  • 54. Dont forget Search Working Group Meeting Saturday 19 May 2012 11:30 AMSmart Search and Beyond
  • 55. Haystack - Mark Duncan CC-BY-SA 2.0 Generic http://commons.wikimedia.org/wiki/File%3AHaystack_-_geograph.org.uk_-_462934.jpg Under the hood - ilovebutter CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Trabant_601_S_of_Trabi_Safari_in_Dresden_8.jpg Child sucking thumb - Thahira CC-BY-SA 3.0 Unported http://commons.wikimedia.org/wiki/File:Sucking_finger.jpg Future car - Arthur C. Bade (1899–1975), Science and Mechanics Publishing - Public domain http://commons.wikimedia.org/wiki/File:Car_of_the_Future_1950_unrestored.jpg Magician - Kellar: Levitation, magician poster, ca. 1894 - CC-BY 2.0 Generic http://commons.wikimedia.org/wiki/File:Flickr_-_%E2%80%A6trialsanderrors_-_Kellar,_Levitation,_magician_poster,_ca._1894.jpg Index pages - Starbäck (1828-1885) and Föreningens Boktryckeri, Norrköping, Sweden (scanned by Ristesson Ent.) - Public domain http://commons.wikimedia.org/wiki/File:Index_Pages.jpg Twenty Questions - DuMont Television/Rosen Studios, New York-photographer. - Public domain http://commons.wikimedia.org/wiki/File:20_questions_1954.JPG Linnaeus taxonomy - Public domain http://commons.wikimedia.org/wiki/File:Linnaeus_-_Regnum_Animale_%281735%29.png All other images are Copyright (C) 2012 Chris Davenport unless Ive accidentally missed crediting them.Image Credits