Your SlideShare is downloading. ×
  • Like
  • Save
Find it, possibly also near you!
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Find it, possibly also near you!

  • 2,364 views
Published

Apache Solr is a state of the art, high performance and scalable search server you can use in your (PHP) application to provide a very feature rich search experience. Besides full-text search, it also …

Apache Solr is a state of the art, high performance and scalable search server you can use in your (PHP) application to provide a very feature rich search experience. Besides full-text search, it also provides spell checking, highlighting, facets and powerful functions that can put it in the realm of a general information retrieval engine, replacing complex database queries you would (need to) use otherwise.

Use cases range from e-commerce, real-estate database search, intranets/extranets, content management systems, document management systems and anything that offers exploration of structured and/or unstructured information. The recent addition of geo-aware features makes even location searches possible.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,364
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Find it, possibly also near you! Paul Borgermans
  • 2. About me ● Currently employed by eZ Systems http://ez.no ● Active in open source community for a while – Squid http proxy server (about 15 y ago) – PHP based CMS solutions (mostly eZ Publish) – executive committee ● Currently fancying : – PHP as the master glue language for almost everything – Apache Lucene family of projects (mainly Solr) – NoSQL (Not only SQL) and scalable architectures – CMS systems & information management
  • 3. Outline ● Overview of Apache Solr ● Concepts & internals ● How to use it with PHP ● Use cases & tips ● Resources
  • 4. Overview of Apache Solr
  • 5. Apache Solr Curriculum Vitae ● Open source Apache Lucene project, started by Yonik Seeley ● Standalone, enterprise grade search server built on top of Lucene ● Lives in a Java servlet container ● Access through a REST-ful API – HTTP – Primary payload in requests: XML – Other response formats: PHP, JSON, …
  • 6. Used by .. And many more ...
  • 7. Solr in a nutshell ● State of the art, advanced full text search and information retrieval ● Fast, scalable with native replication features ● Flexible configuration ● Document oriented storage ● Geospatial search ● Native cloud features
  • 8. Full text search main features ● Tuneable relevancy ranking on top of internal similarity algorithms ● Highlighting ● Sorting ● Filtering ● “Drill-down” navigation (facets) ● Automatic related content ● Spell checking ● Multilingual text analysis
  • 9. At a glance ..
  • 10. Tunable relevancy ranking ● “Boosting” at index and query time – certain types of content – certain parts of content (“fields”) – page-rank like if the content has relations ● Elevate request component – predefined “pages/documents” to the top when certain keywords are entered ● With customised functions – more recent articles – proximity (geolocations)
  • 11. Filtering ● Does not influence the relevancy ● Narrows down the scope ● Very powerful: full boolean, wildcards, fuzzy, and unlimited combinations ● Ranges (dates, numbers, alphanumeric, ...) Also for implementing security!
  • 12. Facets ● Along the main query, “facet fields” may be defined, usually operating on meta-data: – Type of content – Publication year – Keywords – Author .... ● The result set is returned offering the number hits within each “facet” ● You can use the selected facet as a subsequent filter
  • 13. Facets: example
  • 14. Automatic related content (“More Like This”) ● Search engine determines itself which are the important terms of a page and performs a query ● All other normal features can be used – Filtering – Sorting – Facets
  • 15. Spell checking ● Two possible strategies – Dictionary look-up – Using the indexed words itself (recommended) ● Possible “Google” approach using the “best guess” – Search for “Grein botle“ => suggests “Green bottle” ● Let Solr return individual keyword suggestions => more client side processing required
  • 16. Multilingual features ● Adapted tokenizers ● Stemming (reducing words to common form) – Reduces some spelling errors too! – May decrease accuracy ● Different algorithms per language ● Normalisation (“latin 1 characters”) – élève = eleve, Spaß = spass, ...
  • 17. Geospatial search
  • 18. Performance ● Solr employs intelligent caches – filters – queries – internal indexes ● Optimized for search/retrieval ● Possible autowarming on start up ● When updates are done, caches are reconstructed on the fly in the background
  • 19. Performance (2) ● Replication – master-slave for now – works across platforms with same configuration – no native OS features needed (or rsync) – more cloud features under development ● Sharding (client driven)
  • 20. Concepts and internals
  • 21. The Solr/Lucene index ● Inverted index ● Holds a collection of “documents” (hello NoSQL) ● Document – Collection of fields – Flexible schema! – Unique ID (user defined) ● Solr uses a XML based config file: schema.xml
  • 22. Fields ● Various field types, derived from base classes ● Indexed – contains the inverted index – usually analyzed & tokenized – makes it searchable and sortable ● Stored – contains also the original content – content can be part of the request response ● Can be multi-valued! – opens possibilities beyond full text search
  • 23. Field definitions: schema.xml ● Field types – text – numerical – dates – location – … (about 25 in total) ● Actual fields (name, definition, properties) ● Dynamic fields ● Copy fields (as aggregators)
  • 24. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  • 25. schema.xml: more complex field type <!-- A general unstemmed text field - good if one does not know the language of the field --> <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 26. Huh?
  • 27. Analysis ● Solr does not really search your text, but rather the terms that result from the analysis of text ● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – …
  • 28. Solr comes with many tokenizers and filters ● Some are language specific ● Others are very specialised ● It is very important to get this right otherwise, you may not get what you expect!
  • 29. Text analysis examples String Field term term type position position “text” 1 2 iPad => i pad ipad élève. => elev PowerS => power shot hot powershot
  • 30. Character filters ● Used to cleanup text before tokenizing – HTMLStripCharFilter (strips html, xml, js, css) – MappingCharFilter (normalisation of characters, removing accents) – Regular expression filter
  • 31. Tokenizers ● Convert text to tokens (terms) ● You can define only one per field/analyzer ● Examples – WhitespaceTokenizer (splits on white space) – StandardTokenizer – CJK variants
  • 32. Additional filters ● Many possible per field/analyzer ● Many delivered with Solr out of the box ● If not enough, write a tiny bit of Java or look for contributions ● Examples ...
  • 33. Phonetic filters ● PhoneticFilterFactory ● “sounds like” transformations and matching ● Algorithms: – Metaphone – Double Metaphone – Soundex – Refined Soundex
  • 34. Reversing Filter ● Reverses the order of characters ● Use: allow “leading wildcards” ● *thing => gniht* ● A lot faster (prefixes)
  • 35. Synonyms ● Inject synonyms for certain terms ● Language specific ● Best used for query time analysis – may inflate the search index too much – decreases relevancy
  • 36. Stemming ● Reduce terms to their root form – Plural forms – Conjugations ● Language specific (or not relevant, CJK) ● Many specialised stemmers available – Most european languages – Dutch (!)
  • 37. Copy fields ● Analysis is done differently for – searching/filtering – faceting/sorting ● Stemming and not stemming in different fields can increase relevance of results ● Use copy fields in schema.xml or do it client side
  • 38. Geospatial search ● Solr dedicated fields – Latitude Longitude type ● Special geospatial functions in filtering & boosting – Haversine distance (geosphere) – Simple ranges (squares in 2-D) – Special query constructs (upcoming)
  • 39. How to use it with
  • 40. Get the data and feed it ● Most *AMP applications have databases ● Map your data to a “document model” – denormalization, flattening – most DB fields can be fed unaltered, Solr takes care of the rest ● Send it through HTTP as XML ● One constraint: it must be UTF-8!
  • 41. Searching ● Construct a GET/POST query ● Base parameters – “q” for query text – “start” for offset – “rows” for max number of results to return Example: http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
  • 42. Searching (2) ● Additional parameters – response format (wt) ●php = array(), json, ... – type of search handler (qt) – highlighting (hl.*) – facets (f.<fieldName>.<FacetParam>=<value>) – spellcheck (spellcheck) – …
  • 43. PHP client side ● Roll your own classes & functions – Not difficult, it's REST after all – Some Curl, XML, Json or native PHP array parsing ● Use existing libraries – PECL: http://pecl.php.net/package/solr – http://code.google.com/p/solr-php-client/ (follows ZF coding standards) – eZ Components: ezcSearch ● PHP CMS's usually come with their own – eZ Publish, Drupal, Symfony ...
  • 44. Use-cases & tips
  • 45. Indexing binary files ● Solr includes the Apache Tika libraries – convert about any format to plain text – you can activate a dedicated requesthandler for it OR ● Use it standalone (command line) for integration into existing code See: http://lucene.apache.org/tika/
  • 46. Integrate legacy data ● Use the Solr Data Import Handler ● Able to index DB's directly – define the schema to use (including possible joins) – fire simple requests to Solr to actually index/update ● Also XML feeds, files (csv), ...
  • 47. e-Commerce ● If you want so sell, make sure users find the products they want – Use facets (categories, drill-down, …) – Push high margin / hot / new products with elevation – Pay a lot of attention to index and query time analysis ● Feed additional meta-data and use it to tune – Ratings – Analytics (Google, Omniture, ...)
  • 48. Have multilingual content? ● Multi-core configuration – Setup a dedicated Solr core per language – Each has its own schema definitions, while you can still use common field names ● If using one index – Use dynamic fields and create language specific analyzers for dedicate language suffixes/prefixes
  • 49. Resources ● Solr: wiki, mailing lists, downloads http://lucene.apache.org/solr/ ● Free book, articles (by core Solr devs) http://www.lucidimagination.com/ ● Bother me ;)
  • 50. Thank you! Questions? email: paul dot borgermans at gmail dot com http://twitter.com/paulborgermans Please rate this talk/slides: http://joind.in/talk/view/1504