Find it, possibly also near you!

Find it,
possibly also near you!
Paul Borgermans

About me
● Currently employed by eZ Systems http://ez.no
● Active in open source community for a while
– Squid http proxy server (about 15 y ago)
– PHP based CMS solutions (mostly eZ Publish)
– executive committee

● Currently fancying :
– PHP as the master glue language for almost everything
– Apache Lucene family of projects (mainly Solr)
– NoSQL (Not only SQL) and scalable architectures
– CMS systems & information management

Outline
● Overview of Apache Solr
● Concepts & internals
● How to use it with PHP
● Use cases & tips
● Resources

Apache Solr Curriculum Vitae
● Open source Apache Lucene project,
started by Yonik Seeley
● Standalone, enterprise grade search
server built on top of Lucene
● Lives in a Java servlet container
● Access through a REST-ful API
– HTTP
– Primary payload in requests: XML
– Other response formats: PHP, JSON, …

Used by ..

And many more ...

Solr in a nutshell
● State of the art, advanced full text search and
information retrieval
● Fast, scalable with native replication features
● Flexible configuration
● Document oriented storage
● Geospatial search
● Native cloud features

Full text search main features
● Tuneable relevancy ranking on top of internal
similarity algorithms
● Highlighting
● Sorting
● Filtering
● “Drill-down” navigation (facets)
● Automatic related content
● Spell checking
● Multilingual text analysis

Tunable relevancy ranking
● “Boosting” at index and query time
– certain types of content
– certain parts of content (“fields”)
– page-rank like if the content has relations
● Elevate request component
– predefined “pages/documents” to the top when certain
keywords are entered
● With customised functions
– more recent articles
– proximity (geolocations)

Filtering
● Does not influence the relevancy
● Narrows down the scope
● Very powerful: full boolean, wildcards,
fuzzy, and unlimited combinations
● Ranges (dates, numbers,
alphanumeric, ...)

Also for implementing security!

Facets
● Along the main query, “facet fields” may be defined,
usually operating on meta-data:
– Type of content
– Publication year
– Keywords
– Author ....
● The result set is returned offering the number hits
within each “facet”
● You can use the selected facet as a subsequent filter

Automatic related content
(“More Like This”)
● Search engine determines itself which are the
important terms of a page and performs a query
● All other normal features can be used
– Filtering
– Sorting
– Facets

Spell checking
● Two possible strategies
– Dictionary look-up
– Using the indexed words itself (recommended)
● Possible “Google” approach using the “best guess”
– Search for “Grein botle“
=> suggests “Green bottle”
● Let Solr return individual keyword suggestions
=> more client side processing required

Multilingual features
● Adapted tokenizers
● Stemming (reducing words to common form)
– Reduces some spelling errors too!
– May decrease accuracy
● Different algorithms per language
● Normalisation (“latin 1 characters”)
– élève = eleve, Spaß = spass, ...

Performance
● Solr employs intelligent caches
– filters
– queries
– internal indexes
● Optimized for search/retrieval
● Possible autowarming on start up
● When updates are done, caches are
reconstructed on the fly in the background

Performance (2)
● Replication
– master-slave for now
– works across platforms with same configuration
– no native OS features needed (or rsync)
– more cloud features under development
● Sharding (client driven)

The Solr/Lucene index
● Inverted index
● Holds a collection of “documents” (hello NoSQL)
● Document
– Collection of fields
– Flexible schema!
– Unique ID (user defined)
● Solr uses a XML based config file:

schema.xml

Fields
● Various field types, derived from base classes
● Indexed
– contains the inverted index
– usually analyzed & tokenized
– makes it searchable and sortable
● Stored
– contains also the original content
– content can be part of the request response
● Can be multi-valued!
– opens possibilities beyond full text search

Field definitions: schema.xml
● Field types
– text
– numerical
– dates
– location
– … (about 25 in total)
● Actual fields (name, definition, properties)
● Dynamic fields
● Copy fields (as aggregators)

schema.xml: simple field type examples
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>


<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>


<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>


<fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

schema.xml: more complex field type


<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Analysis
● Solr does not really search your text, but rather
the terms that result from the analysis of text
● Typically a chain of
– Character filter(s)
– Tokenisation
– Filter A
– Filter B
– …

Solr comes with many tokenizers and
filters

● Some are language specific
● Others are very specialised
● It is very important to get this right

otherwise, you may not get what you expect!

Text analysis examples
String Field term term
type position position
“text” 1 2

iPad => i pad
ipad
élève. => elev

PowerS => power shot
hot powershot

Character filters
● Used to cleanup text before tokenizing
– HTMLStripCharFilter (strips html, xml, js, css)
– MappingCharFilter (normalisation of characters,
removing accents)
– Regular expression filter

Tokenizers
● Convert text to tokens (terms)
● You can define only one per field/analyzer
● Examples
– WhitespaceTokenizer (splits on white space)
– StandardTokenizer
– CJK variants

Additional filters
● Many possible per field/analyzer
● Many delivered with Solr out of the box
● If not enough, write a tiny bit of Java or look for
contributions

● Examples ...

Phonetic filters
● PhoneticFilterFactory
● “sounds like” transformations and matching
● Algorithms:
– Metaphone
– Double Metaphone
– Soundex
– Refined Soundex

Reversing Filter
● Reverses the order of characters
● Use: allow “leading wildcards”
● *thing => gniht*
● A lot faster (prefixes)

Synonyms
● Inject synonyms for certain terms
● Language specific
● Best used for query time analysis
– may inflate the search index too much
– decreases relevancy

Stemming
● Reduce terms to their root form
– Plural forms
– Conjugations
● Language specific (or not relevant, CJK)
● Many specialised stemmers available
– Most european languages
– Dutch (!)

Copy fields
● Analysis is done differently for
– searching/filtering
– faceting/sorting
● Stemming and not stemming in different fields
can increase relevance of results

● Use copy fields in schema.xml or do it client
side

Geospatial search
● Solr dedicated fields
– Latitude Longitude type
● Special geospatial functions in filtering &
boosting
– Haversine distance (geosphere)
– Simple ranges (squares in 2-D)
– Special query constructs (upcoming)

Get the data and feed it
● Most *AMP applications have databases
● Map your data to a “document model”
– denormalization, flattening
– most DB fields can be fed unaltered, Solr takes
care of the rest
● Send it through HTTP as XML

● One constraint: it must be UTF-8!

Searching
● Construct a GET/POST query
● Base parameters
– “q” for query text
– “start” for offset
– “rows” for max number of results to return
Example:
http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php

Searching (2)
● Additional parameters
– response format (wt)
●php = array(), json, ...
– type of search handler (qt)
– highlighting (hl.*)
– facets (f.<fieldName>.<FacetParam>=<value>)
– spellcheck (spellcheck)
– …

PHP client side
● Roll your own classes & functions
– Not difficult, it's REST after all
– Some Curl, XML, Json or native PHP array parsing
● Use existing libraries
– PECL: http://pecl.php.net/package/solr
– http://code.google.com/p/solr-php-client/
(follows ZF coding standards)
– eZ Components: ezcSearch
● PHP CMS's usually come with their own
– eZ Publish, Drupal, Symfony ...

Indexing binary files
● Solr includes the Apache Tika libraries
– convert about any format to plain text
– you can activate a dedicated requesthandler for it

OR
● Use it standalone (command line) for integration into
existing code

See: http://lucene.apache.org/tika/

Integrate legacy data
● Use the Solr Data Import Handler
● Able to index DB's directly
– define the schema to use (including possible
joins)
– fire simple requests to Solr to actually
index/update
● Also XML feeds, files (csv), ...

e-Commerce
● If you want so sell, make sure users find the products
they want
– Use facets (categories, drill-down, …)
– Push high margin / hot / new products with elevation
– Pay a lot of attention to index and query time analysis
● Feed additional meta-data and use it to tune
– Ratings
– Analytics (Google, Omniture, ...)

Have multilingual content?
● Multi-core configuration
– Setup a dedicated Solr core per language
– Each has its own schema definitions, while you
can still use common field names
● If using one index
– Use dynamic fields and create language specific
analyzers for dedicate language
suffixes/prefixes

Resources
● Solr: wiki, mailing lists, downloads
http://lucene.apache.org/solr/
● Free book, articles (by core Solr devs)
http://www.lucidimagination.com/
● Bother me ;)

Thank you!

Questions?

email: paul dot borgermans at gmail dot com
http://twitter.com/paulborgermans

Please rate this talk/slides:
http://joind.in/talk/view/1504

Find it, possibly also near you!

More Related Content

What's hot

Viewers also liked

Similar to Find it, possibly also near you!

Find it, possibly also near you!