After a thorough overview of the main features and benefits of Apache Solr (an open source search server), the architecture of Solr and strategies to adopt it for your PHP application and data model will be presented. The main lessons learned around dealing with a mix of structured and non-structured content, multilingual aspects, tuning and the various state-of-the-art features of Solr will be shared as well
1. Get the most out of
Solr search with PHP
Paul Borgermans
2. About me
● Active in open source community for a while
● Squid Proxy server (about 15y ago)
● PHP based CMS solutions (mostly eZ Publish)
● Currently fancying :
● PHP as the master glue language for almost everything
● Apache Lucene family of projects (mainly Solr)
● NoSQL (Not only SQL) and scalable architectures
● CMS systems & all kinds of challenges in information
management
3. Outline
● Overview of Apache Solr
● How to use it with PHP (1)
● Concepts & internals
● How to use it with PHP (2)
● Miscellaneous tips
● Resources
5. Solr Curriculum Vitae
● Open source Apache Lucene subproject
● Standalone, enterprise grade search server
built on top of Lucene
● Lives in a Java servlet container
● Access through a REST-ful API
● HTTP
● Primary payload in requests: XML
● Other response formats: PHP, JSON, …
6. Solr in a nutshell
● State of the art, advanced full text search
and information retrieval
● Fast, scalable with native replication features
● Flexible configuration
● Document oriented storage
● Extensible (if you know a bit of Java), but
usually not needed
7. Full text search main features
● Tuneable relevancy ranking on top of internal similarity
algorithms
● Highlighting
● Sorting
● Filtering
● “Drill-down” navigation (facets)
● Automatic related content
● Spell checking
● Multilingual text analysis
9. Tunable relevancy ranking
● “Boosting” at index and query time
● certain types of content
● certain parts of content (“fields”)
● page-rank like if the content has relations
● Elevate request component
● predefined “pages/documents” to the top when
certain keywords are entered
● With customised functions
● more recent articles
● proximity (geolocations)
10. Filtering
● Does not influence the relevancy
● Narrows down the scope
● Very powerful: full boolean, wildcards, fuzzy,
and unlimited combinations
● Ranges (dates, numbers, alphanumeric, ...)
Also for implementing security!
11. Facets
● Along the main query, “facet fields” may be defined,
usually operating on meta-data:
● Type of content
● Publication year
● Keywords
● Author ....
● The result set is returned offering the number hits
within each “facet”
● You can use the selected facet as a subsequent filter
13. Automatic related content (“More Like This”)
● Search engine determines itself which are
the important terms of a page and
performs a query
● All other normal features can be used
● Filtering
● Sorting
● Facets
15. Spell checking
● Two possible strategies
● Dictionary look-up
● Using the indexed words itself
(recommended)
● Possible “Google” approach using the “best
guess”
● Search for “Grein botle“
=> suggests “Green bottle”
● Let Solr return individual keyword suggestions
=> more client side processing required
16. Multilingual features
● Adapted tokenizers
● Stemming (reducing words to common form)
● Reduces some spelling errors too!
● May decrease accuracy
● Different algorithms per language
● Normalisation (“latin 1 characters”)
● élève = eleve, Spaß = spass, ...
17. Performance
● Solr employs intelligent caches
● filters
● queries
● internal indexes
● Optimized for search/retrieval
● Possible autowarming on start up
● When updates are done, caches are reconstructed
on the fly in the background
18. Performance (2)
● Replication
● master-slave for now
● works across platforms with same
configuration
● no native OS features needed (or rsync)
● more cloud features under development
● Sharding (client driven)
22. PHP: the client side
● Roll your own classes
● Not difficult, it's REST after all
● Some Curl, XML, Json or native PHP array parsing
● Use existing libraries
● PECL: http://pecl.php.net/package/solr
● http://code.google.com/p/solr-php-client/
(follows ZF coding standards)
● eZ Components: ezcSearch
● PHP CMS's usually come with their own
● eZ Publish, Drupal, Symfony ...
23. What's next?
● Getting data into Solr
● Basic searches
● Advanced requests
● But first something on the concepts and
internals
25. The Solr/Lucene index
● Inverted index
● Holds a collection of “documents”
● Document
● Collection of fields
● Flexible schema!
● Unique ID (user defined)
● Solr uses a XML based config file:
schema.xml
26. Fields
● Various field types, derived from base classes
● Indexed
● contains the inverted index
● usually analyzed & tokenized
● makes it searchable and sortable
● Stored
● contains also the original content
● content can be part of the request response
● Can be multi-valued!
● opens possibilities beyond full text search
27. Field definitions: schema.xml
● Field types
● text
● numerical
● dates
● location
● … (about 25 in total)
● Actual fields (name, definition, properties)
● Dynamic fields
● Copy fields (as aggregators)
28. schema.xml: simple field type examples
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>
<!-- A Trie based date field for faster date range
queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>
<!-- A text field that only splits on whitespace for exact
matching of words -->
<fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
29. schema.xml: more complex field type
<!-- A general unstemmed text field - good if one does not know the language of the field -->
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
31. Analysis
● Solr does not really search your text, but rather
the terms that result from the analysis of text
● Typically a chain of
● Character filter(s)
● Tokenisation
● Filter A
● Filter B
● …
32. Solr comes with many tokenizers and filters
● Some are language specific
● Others are very specialised
● It is very important to get this right
otherwise, you may not get what you
expect!
33. Text analysis examples
String Field type “text” term position 1 term position 2
iPad => i pad
ipad
élève. => elev
PowerShot => power shot
powershot
Lets have a look: http://localhost:8983/solr/admin
34. Character filters
● Used to cleanup text before tokenizing
● HTMLStripCharFilter (strips html, xml, js,
css)
● MappingCharFilter (normalisation of
characters, removing accents)
● Regular expression filter
35. Tokenizers
● Convert text to tokens (terms)
● You can define only one per field/analyzer
● Examples
● WhitespaceTokenizer (splits on white
space)
● StandardTokenizer
● CJK variants
36. Additional filters
● Many possible per field/analyzer
● Many delivered with Solr out of the box
● If not enough, write a tiny bit of Java or
look for contributions
● Examples ...
38. Reversing Filter
● Reverses the order of characters
● Use: allow “leading wildcards”
● *thing => gniht*
● A lot faster (prefixes)
39. Synonyms
● Inject synonyms for certain terms
● Language specific
● Best used for query time analysis
● may inflate the search index too much
● decreases relevancy
40. Stemming
● Reduce terms to their root form
● Language specific (or not relevant, CJK)
● Many specialised stemmers available
● Most european languages
41. Copy fields
● Analysis is done differently for
● searching/filtering
● faceting/sorting
● Stemming and not stemming in different fields
can increase relevance of results
● Use copy fields in schema.xml or do it client side
43. Get the data and feed it
● Most *AMP applications have databases
● Map your data to a “document model”
● denormalization, flattening
● most DB fields can be fed unaltered, Solr
takes care of the rest
● One constraint: it must be UTF-8!
44. Snippets (1)
class eZSolrDoc
{
function eZSolrDoc( $boost = false )
public function setBoost ( $boost = false )
public function addField ( $name, $content, $boost = false )
public function docToXML()
}
class eZSolr
{
public function addDocs ( $docs = array(), $commit = true,
$optimize = false, $commitWithin = 0 )
.....
45. Searching
● Construct a GET/POST query
● Base parameters
● “q” for query text
● “start” for offset
● “rows” for max number of results to
return
49. Indexing binary files
● Solr 1.4 includes the Apche Tika libraries
● convert about any format to plain text
● you can activate a dedicated
requesthandler for it
OR
● Use it standalone (command line) for
integration into existing code
See: http://lucene.apache.org/tika/
50. Integrate legacy data
● Use the Solr Data Import Handler
● Able to index DB's directly
● define the schema to use (including
possible joins)
● fire simple requests to Solr to actually
index/update
● Also XML feeds, files (csv), ...
51. Have multilingual content?
● Multi-core configuration
● Setup a dedicated Solr core per language
● Each has its own schema definitions, while
you can still use common field names
● If using one index
● Use dynamic fields and create language
specific analyzers for dedicate language
suffixes/prefixes