SlideShare a Scribd company logo
1 of 53
Get the most out of
Solr search with PHP
      Paul Borgermans
About me

●   Active in open source community for a while
    ●   Squid Proxy server (about 15y ago)
    ●   PHP based CMS solutions (mostly eZ Publish)
●   Currently fancying :
    ●   PHP as the master glue language for almost everything
    ●   Apache Lucene family of projects (mainly Solr)
    ●   NoSQL (Not only SQL) and scalable architectures
    ●   CMS systems & all kinds of challenges in information
        management
Outline


●   Overview of Apache Solr
●   How to use it with PHP (1)
●   Concepts & internals
●   How to use it with PHP (2)
●   Miscellaneous tips
●   Resources
Overview of Apache
       Solr
Solr Curriculum Vitae

●   Open source Apache Lucene subproject
●   Standalone, enterprise grade search server
    built on top of Lucene
●   Lives in a Java servlet container
●   Access through a REST-ful API
    ●   HTTP
    ●   Primary payload in requests: XML
    ●   Other response formats: PHP, JSON, …
Solr in a nutshell

●   State of the art, advanced full text search
    and information retrieval
●   Fast, scalable with native replication features
●   Flexible configuration
●   Document oriented storage
●   Extensible (if you know a bit of Java), but
    usually not needed
Full text search main features
●   Tuneable relevancy ranking on top of internal similarity
    algorithms
●   Highlighting
●   Sorting
●   Filtering
●   “Drill-down” navigation (facets)
●   Automatic related content
●   Spell checking
●   Multilingual text analysis
At a glance ..
Tunable relevancy ranking
●   “Boosting” at index and query time
    ●   certain types of content
    ●   certain parts of content (“fields”)
    ●          page-rank like if the content has relations

●   Elevate request component
    ●   predefined “pages/documents” to the top when
        certain keywords are entered
●   With customised functions
    ●   more recent articles
    ●   proximity (geolocations)
Filtering

●   Does not influence the relevancy
●   Narrows down the scope
●   Very powerful: full boolean, wildcards, fuzzy,
    and unlimited combinations
●   Ranges (dates, numbers, alphanumeric, ...)


       Also for implementing security!
Facets

●   Along the main query, “facet fields” may be defined,
    usually operating on meta-data:
    ●   Type of content
    ●   Publication year
    ●   Keywords
    ●   Author ....
●   The result set is returned offering the number hits
    within each “facet”
●   You can use the selected facet as a subsequent filter
Facets: example
Automatic related content (“More Like This”)

   ●   Search engine determines itself which are
       the important terms of a page and
       performs a query
   ●   All other normal features can be used
       ●   Filtering
       ●   Sorting
       ●   Facets
Automatic related content (“More Like This”)
Spell checking
●   Two possible strategies
    ●   Dictionary look-up
    ●   Using the indexed words itself
        (recommended)
●   Possible “Google” approach using the “best
    guess”
    ●   Search for “Grein botle“
        =>         suggests “Green bottle”
●   Let Solr return individual keyword suggestions
      => more client side processing required
Multilingual features

●   Adapted tokenizers
●   Stemming (reducing words to common form)
    ●   Reduces some spelling errors too!
    ●   May decrease accuracy
●   Different algorithms per language
●   Normalisation (“latin 1 characters”)
    ●   élève = eleve, Spaß = spass, ...
Performance

●   Solr employs intelligent caches
    ●   filters
    ●   queries
    ●   internal indexes
●   Optimized for search/retrieval
●   Possible autowarming on start up
●   When updates are done, caches are reconstructed
    on the fly in the background
Performance (2)

●   Replication
    ●   master-slave for now
    ●   works across platforms with same
        configuration
    ●   no native OS features needed (or rsync)
    ●   more cloud features under development
●   Sharding (client driven)
How to use it with

        part 1
Installation of backend: 4 easy steps


●   Download from
    http://lucene.apache.org/solr/index.html
    and unpack
●   Make sure you have a Java VM >= 1.5
    ●   $ java -version
    ●   Sun/IBM recommended
    ●   gcj won't do!
●   $ java -jar start.jar
●   http://localhost:8983/solr/admin
Voila!
PHP: the client side

●   Roll your own classes
    ●   Not difficult, it's REST after all
    ●   Some Curl, XML, Json or native PHP array parsing
●   Use existing libraries
    ●   PECL: http://pecl.php.net/package/solr
    ●   http://code.google.com/p/solr-php-client/
        (follows ZF coding standards)
    ●   eZ Components: ezcSearch
●   PHP CMS's usually come with their own
    ●   eZ Publish, Drupal, Symfony ...
What's next?

●   Getting data into Solr
●   Basic searches
●   Advanced requests


●   But first something on the concepts and
    internals
Concepts and internals
The Solr/Lucene index

●   Inverted index
●   Holds a collection of “documents”
●   Document
    ●   Collection of fields
    ●   Flexible schema!
    ●   Unique ID (user defined)
●   Solr uses a XML based config file:

    schema.xml
Fields

●   Various field types, derived from base classes
●   Indexed
    ●   contains the inverted index
    ●   usually analyzed & tokenized
    ●   makes it searchable and sortable
●   Stored
    ●   contains also the original content
    ●   content can be part of the request response
●   Can be multi-valued!
    ●   opens possibilities beyond full text search
Field definitions: schema.xml

●   Field types
    ●   text
    ●   numerical
    ●   dates
    ●   location
    ●   … (about 25 in total)
●   Actual fields (name, definition, properties)
●   Dynamic fields
●   Copy fields (as aggregators)
schema.xml: simple field type examples

 <fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>

    <!-- boolean type: "true" or "false" -->
    <fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>

    <!-- A Trie based date field for faster date range
queries and date faceting. -->
    <fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>

  <!-- A text field that only splits on whitespace for exact
matching of words -->
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
schema.xml: more complex field type

  <!-- A general unstemmed text field - good if one does not know the language of the field -->
    <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="false" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
        <filter class="solr.StopFilterFactory"
                 ignoreCase="true"
                 words="stopwords.txt"
                 enablePositionIncrements="true"
                 />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
Huh?
Analysis

●   Solr does not really search your text, but rather
    the terms that result from the analysis of text
●   Typically a chain of
    ●   Character filter(s)
    ●   Tokenisation
    ●   Filter A
    ●   Filter B
    ●   …
Solr comes with many tokenizers and filters

   ●   Some are language specific
   ●   Others are very specialised
   ●   It is very important to get this right

       otherwise, you may not get what you
       expect!
Text analysis examples

  String      Field type “text”   term position 1   term position 2




  iPad        =>                  i                 pad
                                                    ipad

  élève.      =>                  elev


  PowerShot =>                    power             shot
                                                    powershot




Lets have a look: http://localhost:8983/solr/admin
Character filters

●   Used to cleanup text before tokenizing
    ●   HTMLStripCharFilter (strips html, xml, js,
        css)
    ●   MappingCharFilter (normalisation of
        characters, removing accents)
    ●   Regular expression filter
Tokenizers

●   Convert text to tokens (terms)
●   You can define only one per field/analyzer
●   Examples
    ●   WhitespaceTokenizer (splits on white
        space)
    ●   StandardTokenizer
    ●   CJK variants
Additional filters

●   Many possible per field/analyzer
●   Many delivered with Solr out of the box
●   If not enough, write a tiny bit of Java or
    look for contributions




●   Examples ...
Phonetic filters

●   PhoneticFilterFactory
●   “sounds like” transformations and matching
●   Algorithms:
    ●   Metaphone
    ●   Double Metaphone
    ●   Soundex
    ●   Refined Soundex
Reversing Filter

●   Reverses the order of characters
●   Use: allow “leading wildcards”
●   *thing => gniht*
●   A lot faster (prefixes)
Synonyms

●   Inject synonyms for certain terms
●   Language specific
●   Best used for query time analysis
    ●   may inflate the search index too much
    ●   decreases relevancy
Stemming

●   Reduce terms to their root form
●   Language specific (or not relevant, CJK)
●   Many specialised stemmers available
    ●   Most european languages
Copy fields

●   Analysis is done differently for
    ●   searching/filtering
    ●   faceting/sorting
●   Stemming and not stemming in different fields
    can increase relevance of results




●   Use copy fields in schema.xml or do it client side
How to use it with

       part I1
Get the data and feed it

●   Most *AMP applications have databases
●   Map your data to a “document model”
    ●   denormalization, flattening
    ●   most DB fields can be fed unaltered, Solr
        takes care of the rest


●   One constraint: it must be UTF-8!
Snippets (1)
   class eZSolrDoc
   {

        function eZSolrDoc( $boost = false )

        public function setBoost ( $boost = false )

        public function addField ( $name, $content, $boost = false )

        public function docToXML()

   }




  class eZSolr
  {
    public function addDocs ( $docs = array(), $commit = true,
                              $optimize = false, $commitWithin = 0   )

.....
Searching

●   Construct a GET/POST query
●   Base parameters
    ●   “q” for query text
    ●   “start” for offset
    ●   “rows” for max number of results to
        return
Searching (2)

●   Additional parameters
    ●   response format (wt)
        ●
            php = array(), json, ...
    ●   type of search handler (qt)
    ●   highlighting (hl.*)
    ●   facets (f.<fieldName>.<FacetParam>=<value>)
    ●   spellcheck (spellcheck)
    ●   …
Example: http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
Searching (3): a utility class
Some more tips
Indexing binary files

●   Solr 1.4 includes the Apche Tika libraries
    ●   convert about any format to plain text
    ●   you can activate a dedicated
        requesthandler for it

           OR
●   Use it standalone (command line) for
    integration into existing code
         See: http://lucene.apache.org/tika/
Integrate legacy data

●   Use the Solr Data Import Handler
●   Able to index DB's directly
    ●   define the schema to use (including
        possible joins)
    ●   fire simple requests to Solr to actually
        index/update
●   Also XML feeds, files (csv), ...
Have multilingual content?

●   Multi-core configuration
    ●   Setup a dedicated Solr core per language
    ●   Each has its own schema definitions, while
        you can still use common field names
●   If using one index
    ●   Use dynamic fields and create language
        specific analyzers for dedicate language
        suffixes/prefixes
Resources

●   Solr: wiki, mailing lists, downloads
    http://lucene.apache.org/solr/
●   Free book, articles (by core Solr devs)
    http://www.lucidimagination.com/
●   Ask me ;)
Thank you!


                Questions?


email: paul dot borgermans at gmail dot com
     http://twitter.com/paulborgermans

More Related Content

What's hot

Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Introduction Apache Solr & PHP
Introduction Apache Solr & PHPIntroduction Apache Solr & PHP
Introduction Apache Solr & PHPHiraq Citra M
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachAlexandre Rafalovitch
 

What's hot (20)

Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Introduction Apache Solr & PHP
Introduction Apache Solr & PHPIntroduction Apache Solr & PHP
Introduction Apache Solr & PHP
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 

Viewers also liked

Understanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal KucUnderstanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal Kuclucenerevolution
 
Tips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredTips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredAcquia
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience ReportNetcetera
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHPMike Lively
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
In A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life ProjectIn A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life ProjectOlga Morozan
 
Newton's laws jeopardy
Newton's laws jeopardyNewton's laws jeopardy
Newton's laws jeopardyrlinde
 
Quran in Hindi Part-30
Quran in Hindi Part-30Quran in Hindi Part-30
Quran in Hindi Part-30Sharaz Ahmed
 
Seres Dos CartõEs
Seres Dos CartõEsSeres Dos CartõEs
Seres Dos CartõEsSérgio Luiz
 
Kkpi
KkpiKkpi
Kkpipujil
 
Polymer and rubber manufacturing workforce development plan oct 2010
Polymer and rubber manufacturing workforce development plan oct 2010Polymer and rubber manufacturing workforce development plan oct 2010
Polymer and rubber manufacturing workforce development plan oct 2010RITCWA
 
Characteristics of narration
Characteristics of  narrationCharacteristics of  narration
Characteristics of narrationphoebinku
 

Viewers also liked (20)

Understanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal KucUnderstanding and visualizing solr explain information - Rafal Kuc
Understanding and visualizing solr explain information - Rafal Kuc
 
Tips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding RequiredTips for Tuning Solr Search: No Coding Required
Tips for Tuning Solr Search: No Coding Required
 
Apache Solr - An Experience Report
Apache Solr - An Experience ReportApache Solr - An Experience Report
Apache Solr - An Experience Report
 
Using Sphinx for Search in PHP
Using Sphinx for Search in PHPUsing Sphinx for Search in PHP
Using Sphinx for Search in PHP
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
In A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life ProjectIn A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life Project
 
Newton's laws jeopardy
Newton's laws jeopardyNewton's laws jeopardy
Newton's laws jeopardy
 
Quran in Hindi Part-30
Quran in Hindi Part-30Quran in Hindi Part-30
Quran in Hindi Part-30
 
Modul 5 kb 1
Modul 5   kb 1Modul 5   kb 1
Modul 5 kb 1
 
Seres Dos CartõEs
Seres Dos CartõEsSeres Dos CartõEs
Seres Dos CartõEs
 
Sed petrolgy[1]
Sed petrolgy[1]Sed petrolgy[1]
Sed petrolgy[1]
 
The Power of BIG OER
The Power of BIG OERThe Power of BIG OER
The Power of BIG OER
 
Kkpi
KkpiKkpi
Kkpi
 
Polymer and rubber manufacturing workforce development plan oct 2010
Polymer and rubber manufacturing workforce development plan oct 2010Polymer and rubber manufacturing workforce development plan oct 2010
Polymer and rubber manufacturing workforce development plan oct 2010
 
Description of goods
Description of goodsDescription of goods
Description of goods
 
Characteristics of narration
Characteristics of  narrationCharacteristics of  narration
Characteristics of narration
 
Hare And Tortoise
Hare And TortoiseHare And Tortoise
Hare And Tortoise
 
ppppk
ppppkppppk
ppppk
 
Wais i
Wais   iWais   i
Wais i
 
Intergenerational Networking
Intergenerational NetworkingIntergenerational Networking
Intergenerational Networking
 

Similar to Get the most out of Solr search with PHP

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!Paul Borgermans
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Websolutions Agency
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentAlkacon Software GmbH & Co. KG
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrJayesh Bhoyar
 
The Lumber Mill - XSLT For Your Templates
The Lumber Mill  - XSLT For Your TemplatesThe Lumber Mill  - XSLT For Your Templates
The Lumber Mill - XSLT For Your TemplatesThomas Weinert
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverLucidworks (Archived)
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoFu Cheng
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 

Similar to Get the most out of Solr search with PHP (20)

Find it, possibly also near you!
Find it, possibly also near you!Find it, possibly also near you!
Find it, possibly also near you!
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Apache solr liferay
Apache solr liferayApache solr liferay
Apache solr liferay
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Hands on-solr
Hands on-solrHands on-solr
Hands on-solr
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Solr5
Solr5Solr5
Solr5
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
The Lumber Mill - XSLT For Your Templates
The Lumber Mill  - XSLT For Your TemplatesThe Lumber Mill  - XSLT For Your Templates
The Lumber Mill - XSLT For Your Templates
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojo
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 

Recently uploaded

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 

Recently uploaded (20)

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 

Get the most out of Solr search with PHP

  • 1. Get the most out of Solr search with PHP Paul Borgermans
  • 2. About me ● Active in open source community for a while ● Squid Proxy server (about 15y ago) ● PHP based CMS solutions (mostly eZ Publish) ● Currently fancying : ● PHP as the master glue language for almost everything ● Apache Lucene family of projects (mainly Solr) ● NoSQL (Not only SQL) and scalable architectures ● CMS systems & all kinds of challenges in information management
  • 3. Outline ● Overview of Apache Solr ● How to use it with PHP (1) ● Concepts & internals ● How to use it with PHP (2) ● Miscellaneous tips ● Resources
  • 5. Solr Curriculum Vitae ● Open source Apache Lucene subproject ● Standalone, enterprise grade search server built on top of Lucene ● Lives in a Java servlet container ● Access through a REST-ful API ● HTTP ● Primary payload in requests: XML ● Other response formats: PHP, JSON, …
  • 6. Solr in a nutshell ● State of the art, advanced full text search and information retrieval ● Fast, scalable with native replication features ● Flexible configuration ● Document oriented storage ● Extensible (if you know a bit of Java), but usually not needed
  • 7. Full text search main features ● Tuneable relevancy ranking on top of internal similarity algorithms ● Highlighting ● Sorting ● Filtering ● “Drill-down” navigation (facets) ● Automatic related content ● Spell checking ● Multilingual text analysis
  • 9. Tunable relevancy ranking ● “Boosting” at index and query time ● certain types of content ● certain parts of content (“fields”) ● page-rank like if the content has relations ● Elevate request component ● predefined “pages/documents” to the top when certain keywords are entered ● With customised functions ● more recent articles ● proximity (geolocations)
  • 10. Filtering ● Does not influence the relevancy ● Narrows down the scope ● Very powerful: full boolean, wildcards, fuzzy, and unlimited combinations ● Ranges (dates, numbers, alphanumeric, ...) Also for implementing security!
  • 11. Facets ● Along the main query, “facet fields” may be defined, usually operating on meta-data: ● Type of content ● Publication year ● Keywords ● Author .... ● The result set is returned offering the number hits within each “facet” ● You can use the selected facet as a subsequent filter
  • 13. Automatic related content (“More Like This”) ● Search engine determines itself which are the important terms of a page and performs a query ● All other normal features can be used ● Filtering ● Sorting ● Facets
  • 14. Automatic related content (“More Like This”)
  • 15. Spell checking ● Two possible strategies ● Dictionary look-up ● Using the indexed words itself (recommended) ● Possible “Google” approach using the “best guess” ● Search for “Grein botle“ => suggests “Green bottle” ● Let Solr return individual keyword suggestions => more client side processing required
  • 16. Multilingual features ● Adapted tokenizers ● Stemming (reducing words to common form) ● Reduces some spelling errors too! ● May decrease accuracy ● Different algorithms per language ● Normalisation (“latin 1 characters”) ● élève = eleve, Spaß = spass, ...
  • 17. Performance ● Solr employs intelligent caches ● filters ● queries ● internal indexes ● Optimized for search/retrieval ● Possible autowarming on start up ● When updates are done, caches are reconstructed on the fly in the background
  • 18. Performance (2) ● Replication ● master-slave for now ● works across platforms with same configuration ● no native OS features needed (or rsync) ● more cloud features under development ● Sharding (client driven)
  • 19. How to use it with part 1
  • 20. Installation of backend: 4 easy steps ● Download from http://lucene.apache.org/solr/index.html and unpack ● Make sure you have a Java VM >= 1.5 ● $ java -version ● Sun/IBM recommended ● gcj won't do! ● $ java -jar start.jar ● http://localhost:8983/solr/admin
  • 22. PHP: the client side ● Roll your own classes ● Not difficult, it's REST after all ● Some Curl, XML, Json or native PHP array parsing ● Use existing libraries ● PECL: http://pecl.php.net/package/solr ● http://code.google.com/p/solr-php-client/ (follows ZF coding standards) ● eZ Components: ezcSearch ● PHP CMS's usually come with their own ● eZ Publish, Drupal, Symfony ...
  • 23. What's next? ● Getting data into Solr ● Basic searches ● Advanced requests ● But first something on the concepts and internals
  • 25. The Solr/Lucene index ● Inverted index ● Holds a collection of “documents” ● Document ● Collection of fields ● Flexible schema! ● Unique ID (user defined) ● Solr uses a XML based config file: schema.xml
  • 26. Fields ● Various field types, derived from base classes ● Indexed ● contains the inverted index ● usually analyzed & tokenized ● makes it searchable and sortable ● Stored ● contains also the original content ● content can be part of the request response ● Can be multi-valued! ● opens possibilities beyond full text search
  • 27. Field definitions: schema.xml ● Field types ● text ● numerical ● dates ● location ● … (about 25 in total) ● Actual fields (name, definition, properties) ● Dynamic fields ● Copy fields (as aggregators)
  • 28. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  • 29. schema.xml: more complex field type <!-- A general unstemmed text field - good if one does not know the language of the field --> <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 30. Huh?
  • 31. Analysis ● Solr does not really search your text, but rather the terms that result from the analysis of text ● Typically a chain of ● Character filter(s) ● Tokenisation ● Filter A ● Filter B ● …
  • 32. Solr comes with many tokenizers and filters ● Some are language specific ● Others are very specialised ● It is very important to get this right otherwise, you may not get what you expect!
  • 33. Text analysis examples String Field type “text” term position 1 term position 2 iPad => i pad ipad élève. => elev PowerShot => power shot powershot Lets have a look: http://localhost:8983/solr/admin
  • 34. Character filters ● Used to cleanup text before tokenizing ● HTMLStripCharFilter (strips html, xml, js, css) ● MappingCharFilter (normalisation of characters, removing accents) ● Regular expression filter
  • 35. Tokenizers ● Convert text to tokens (terms) ● You can define only one per field/analyzer ● Examples ● WhitespaceTokenizer (splits on white space) ● StandardTokenizer ● CJK variants
  • 36. Additional filters ● Many possible per field/analyzer ● Many delivered with Solr out of the box ● If not enough, write a tiny bit of Java or look for contributions ● Examples ...
  • 37. Phonetic filters ● PhoneticFilterFactory ● “sounds like” transformations and matching ● Algorithms: ● Metaphone ● Double Metaphone ● Soundex ● Refined Soundex
  • 38. Reversing Filter ● Reverses the order of characters ● Use: allow “leading wildcards” ● *thing => gniht* ● A lot faster (prefixes)
  • 39. Synonyms ● Inject synonyms for certain terms ● Language specific ● Best used for query time analysis ● may inflate the search index too much ● decreases relevancy
  • 40. Stemming ● Reduce terms to their root form ● Language specific (or not relevant, CJK) ● Many specialised stemmers available ● Most european languages
  • 41. Copy fields ● Analysis is done differently for ● searching/filtering ● faceting/sorting ● Stemming and not stemming in different fields can increase relevance of results ● Use copy fields in schema.xml or do it client side
  • 42. How to use it with part I1
  • 43. Get the data and feed it ● Most *AMP applications have databases ● Map your data to a “document model” ● denormalization, flattening ● most DB fields can be fed unaltered, Solr takes care of the rest ● One constraint: it must be UTF-8!
  • 44. Snippets (1) class eZSolrDoc { function eZSolrDoc( $boost = false ) public function setBoost ( $boost = false ) public function addField ( $name, $content, $boost = false ) public function docToXML() } class eZSolr { public function addDocs ( $docs = array(), $commit = true, $optimize = false, $commitWithin = 0 ) .....
  • 45. Searching ● Construct a GET/POST query ● Base parameters ● “q” for query text ● “start” for offset ● “rows” for max number of results to return
  • 46. Searching (2) ● Additional parameters ● response format (wt) ● php = array(), json, ... ● type of search handler (qt) ● highlighting (hl.*) ● facets (f.<fieldName>.<FacetParam>=<value>) ● spellcheck (spellcheck) ● … Example: http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
  • 47. Searching (3): a utility class
  • 49. Indexing binary files ● Solr 1.4 includes the Apche Tika libraries ● convert about any format to plain text ● you can activate a dedicated requesthandler for it OR ● Use it standalone (command line) for integration into existing code See: http://lucene.apache.org/tika/
  • 50. Integrate legacy data ● Use the Solr Data Import Handler ● Able to index DB's directly ● define the schema to use (including possible joins) ● fire simple requests to Solr to actually index/update ● Also XML feeds, files (csv), ...
  • 51. Have multilingual content? ● Multi-core configuration ● Setup a dedicated Solr core per language ● Each has its own schema definitions, while you can still use common field names ● If using one index ● Use dynamic fields and create language specific analyzers for dedicate language suffixes/prefixes
  • 52. Resources ● Solr: wiki, mailing lists, downloads http://lucene.apache.org/solr/ ● Free book, articles (by core Solr devs) http://www.lucidimagination.com/ ● Ask me ;)
  • 53. Thank you! Questions? email: paul dot borgermans at gmail dot com http://twitter.com/paulborgermans