Find it, possibly also near you!

2,624 views
2,547 views

Published on

Apache Solr is a state of the art, high performance and scalable search server you can use in your (PHP) application to provide a very feature rich search experience. Besides full-text search, it also provides spell checking, highlighting, facets and powerful functions that can put it in the realm of a general information retrieval engine, replacing complex database queries you would (need to) use otherwise.

Use cases range from e-commerce, real-estate database search, intranets/extranets, content management systems, document management systems and anything that offers exploration of structured and/or unstructured information. The recent addition of geo-aware features makes even location searches possible.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,624
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Find it, possibly also near you!

  1. 1. Find it, possibly also near you! Paul Borgermans
  2. 2. About me ● Currently employed by eZ Systems http://ez.no ● Active in open source community for a while – Squid http proxy server (about 15 y ago) – PHP based CMS solutions (mostly eZ Publish) – executive committee ● Currently fancying : – PHP as the master glue language for almost everything – Apache Lucene family of projects (mainly Solr) – NoSQL (Not only SQL) and scalable architectures – CMS systems & information management
  3. 3. Outline ● Overview of Apache Solr ● Concepts & internals ● How to use it with PHP ● Use cases & tips ● Resources
  4. 4. Overview of Apache Solr
  5. 5. Apache Solr Curriculum Vitae ● Open source Apache Lucene project, started by Yonik Seeley ● Standalone, enterprise grade search server built on top of Lucene ● Lives in a Java servlet container ● Access through a REST-ful API – HTTP – Primary payload in requests: XML – Other response formats: PHP, JSON, …
  6. 6. Used by .. And many more ...
  7. 7. Solr in a nutshell ● State of the art, advanced full text search and information retrieval ● Fast, scalable with native replication features ● Flexible configuration ● Document oriented storage ● Geospatial search ● Native cloud features
  8. 8. Full text search main features ● Tuneable relevancy ranking on top of internal similarity algorithms ● Highlighting ● Sorting ● Filtering ● “Drill-down” navigation (facets) ● Automatic related content ● Spell checking ● Multilingual text analysis
  9. 9. At a glance ..
  10. 10. Tunable relevancy ranking ● “Boosting” at index and query time – certain types of content – certain parts of content (“fields”) – page-rank like if the content has relations ● Elevate request component – predefined “pages/documents” to the top when certain keywords are entered ● With customised functions – more recent articles – proximity (geolocations)
  11. 11. Filtering ● Does not influence the relevancy ● Narrows down the scope ● Very powerful: full boolean, wildcards, fuzzy, and unlimited combinations ● Ranges (dates, numbers, alphanumeric, ...) Also for implementing security!
  12. 12. Facets ● Along the main query, “facet fields” may be defined, usually operating on meta-data: – Type of content – Publication year – Keywords – Author .... ● The result set is returned offering the number hits within each “facet” ● You can use the selected facet as a subsequent filter
  13. 13. Facets: example
  14. 14. Automatic related content (“More Like This”) ● Search engine determines itself which are the important terms of a page and performs a query ● All other normal features can be used – Filtering – Sorting – Facets
  15. 15. Spell checking ● Two possible strategies – Dictionary look-up – Using the indexed words itself (recommended) ● Possible “Google” approach using the “best guess” – Search for “Grein botle“ => suggests “Green bottle” ● Let Solr return individual keyword suggestions => more client side processing required
  16. 16. Multilingual features ● Adapted tokenizers ● Stemming (reducing words to common form) – Reduces some spelling errors too! – May decrease accuracy ● Different algorithms per language ● Normalisation (“latin 1 characters”) – élève = eleve, Spaß = spass, ...
  17. 17. Geospatial search
  18. 18. Performance ● Solr employs intelligent caches – filters – queries – internal indexes ● Optimized for search/retrieval ● Possible autowarming on start up ● When updates are done, caches are reconstructed on the fly in the background
  19. 19. Performance (2) ● Replication – master-slave for now – works across platforms with same configuration – no native OS features needed (or rsync) – more cloud features under development ● Sharding (client driven)
  20. 20. Concepts and internals
  21. 21. The Solr/Lucene index ● Inverted index ● Holds a collection of “documents” (hello NoSQL) ● Document – Collection of fields – Flexible schema! – Unique ID (user defined) ● Solr uses a XML based config file: schema.xml
  22. 22. Fields ● Various field types, derived from base classes ● Indexed – contains the inverted index – usually analyzed & tokenized – makes it searchable and sortable ● Stored – contains also the original content – content can be part of the request response ● Can be multi-valued! – opens possibilities beyond full text search
  23. 23. Field definitions: schema.xml ● Field types – text – numerical – dates – location – … (about 25 in total) ● Actual fields (name, definition, properties) ● Dynamic fields ● Copy fields (as aggregators)
  24. 24. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  25. 25. schema.xml: more complex field type <!-- A general unstemmed text field - good if one does not know the language of the field --> <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  26. 26. Huh?
  27. 27. Analysis ● Solr does not really search your text, but rather the terms that result from the analysis of text ● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – …
  28. 28. Solr comes with many tokenizers and filters ● Some are language specific ● Others are very specialised ● It is very important to get this right otherwise, you may not get what you expect!
  29. 29. Text analysis examples String Field term term type position position “text” 1 2 iPad => i pad ipad élève. => elev PowerS => power shot hot powershot
  30. 30. Character filters ● Used to cleanup text before tokenizing – HTMLStripCharFilter (strips html, xml, js, css) – MappingCharFilter (normalisation of characters, removing accents) – Regular expression filter
  31. 31. Tokenizers ● Convert text to tokens (terms) ● You can define only one per field/analyzer ● Examples – WhitespaceTokenizer (splits on white space) – StandardTokenizer – CJK variants
  32. 32. Additional filters ● Many possible per field/analyzer ● Many delivered with Solr out of the box ● If not enough, write a tiny bit of Java or look for contributions ● Examples ...
  33. 33. Phonetic filters ● PhoneticFilterFactory ● “sounds like” transformations and matching ● Algorithms: – Metaphone – Double Metaphone – Soundex – Refined Soundex
  34. 34. Reversing Filter ● Reverses the order of characters ● Use: allow “leading wildcards” ● *thing => gniht* ● A lot faster (prefixes)
  35. 35. Synonyms ● Inject synonyms for certain terms ● Language specific ● Best used for query time analysis – may inflate the search index too much – decreases relevancy
  36. 36. Stemming ● Reduce terms to their root form – Plural forms – Conjugations ● Language specific (or not relevant, CJK) ● Many specialised stemmers available – Most european languages – Dutch (!)
  37. 37. Copy fields ● Analysis is done differently for – searching/filtering – faceting/sorting ● Stemming and not stemming in different fields can increase relevance of results ● Use copy fields in schema.xml or do it client side
  38. 38. Geospatial search ● Solr dedicated fields – Latitude Longitude type ● Special geospatial functions in filtering & boosting – Haversine distance (geosphere) – Simple ranges (squares in 2-D) – Special query constructs (upcoming)
  39. 39. How to use it with
  40. 40. Get the data and feed it ● Most *AMP applications have databases ● Map your data to a “document model” – denormalization, flattening – most DB fields can be fed unaltered, Solr takes care of the rest ● Send it through HTTP as XML ● One constraint: it must be UTF-8!
  41. 41. Searching ● Construct a GET/POST query ● Base parameters – “q” for query text – “start” for offset – “rows” for max number of results to return Example: http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
  42. 42. Searching (2) ● Additional parameters – response format (wt) ●php = array(), json, ... – type of search handler (qt) – highlighting (hl.*) – facets (f.<fieldName>.<FacetParam>=<value>) – spellcheck (spellcheck) – …
  43. 43. PHP client side ● Roll your own classes & functions – Not difficult, it's REST after all – Some Curl, XML, Json or native PHP array parsing ● Use existing libraries – PECL: http://pecl.php.net/package/solr – http://code.google.com/p/solr-php-client/ (follows ZF coding standards) – eZ Components: ezcSearch ● PHP CMS's usually come with their own – eZ Publish, Drupal, Symfony ...
  44. 44. Use-cases & tips
  45. 45. Indexing binary files ● Solr includes the Apache Tika libraries – convert about any format to plain text – you can activate a dedicated requesthandler for it OR ● Use it standalone (command line) for integration into existing code See: http://lucene.apache.org/tika/
  46. 46. Integrate legacy data ● Use the Solr Data Import Handler ● Able to index DB's directly – define the schema to use (including possible joins) – fire simple requests to Solr to actually index/update ● Also XML feeds, files (csv), ...
  47. 47. e-Commerce ● If you want so sell, make sure users find the products they want – Use facets (categories, drill-down, …) – Push high margin / hot / new products with elevation – Pay a lot of attention to index and query time analysis ● Feed additional meta-data and use it to tune – Ratings – Analytics (Google, Omniture, ...)
  48. 48. Have multilingual content? ● Multi-core configuration – Setup a dedicated Solr core per language – Each has its own schema definitions, while you can still use common field names ● If using one index – Use dynamic fields and create language specific analyzers for dedicate language suffixes/prefixes
  49. 49. Resources ● Solr: wiki, mailing lists, downloads http://lucene.apache.org/solr/ ● Free book, articles (by core Solr devs) http://www.lucidimagination.com/ ● Bother me ;)
  50. 50. Thank you! Questions? email: paul dot borgermans at gmail dot com http://twitter.com/paulborgermans Please rate this talk/slides: http://joind.in/talk/view/1504

×