Language support in searching Drupal with SOLR - Drupalcamp London 2013
LANGUAGE SUPPORT INSEARCHING DRUPAL WITHSOLRKalle Varisvirta
SOLR? What’s that andwhy do I care?§ SOLR is a open source search platform, optimized for full-text searching, hit highlighting, faceted search and lot more§ Incomparable to Drupal’s internal search; it blows you away when you compare it§ Integrates to Drupal in many ways and can be used in many ways – we’re focusing on the actual search functionality
SOLR§ Since it’s Java, it needs the Java-capable web- server and ships with one, Jetty§ Very easy to configure and start, even for a Drupal developer used to drush etc.§ Integrates for searching with “Apache SOLR search integration” –module sponsored by Acquia
How does Drupal integrateto SOLR§ Basically the module replaces Drupal’s internal search indexing and instead uses a SOLR schema (schema.xml) that ships with the module§ It defines the mandatory node fields in Drupal and uses SOLR’s cool dynamic field definitions to accommodate all your FieldAPI fields
So, what does SOLR do?§ Obviously first it looks at the type of the field, the behavior differs for different field types§ For text it does a lot, it makes your text searchable by first processing it in many ways and then indexing it§ The behavior differs in different languages – and we’ll come to that later – but here’s the basic process for a popular language example: English
SOLR processing§ First it tokenizes the text by whitespace§ Then it removes the stop words (words not to index, e.g. and or or)§ Then it splits words by case change, numerics and by couple of other rules, e.g. “PowerShot” => indexed as “Power” and “Shot”§ Then it stems the words, reducing inflected words to their stems, e.g. “stemming” => “stem”§ Then it removes duplicate tokens
SOLR processing FreeAir X500 Wireless Router is a powerful wireless solution well suited for the home or ofﬁce.
SOLR processing Separated by whitespace. FreeAir X500 Wireless Router is a powerful wireless solution well suited for the home or ofﬁce
SOLR processing Stop words removed. FreeAir X500 Wireless Router powerful wireless solution suited home ofﬁce
SOLR processing Words split, but not FreeAir, since it’s on the protected words list. FreeAir X 500 Wireless Router powerful wireless solution suited home ofﬁce
SOLR processing Everything in lowercase. freeair x 500 wireless router powerful wireless solution suited home ofﬁce
SOLR processing Stemmed. freeair x 500 wireless router power wireless solut suit home ofﬁc
Searching from SOLR§ Now when you search from SOLR, it does parts of the same magic to your query text§ This way you’ll match the indexed document even if you wrote it a bit differently§ “Office capable wireless routers” will match our indexed document just nicely, not by every word, but enough and close by each other, that it’ll be a good match and ranking high on SOLR’s relevance score
Apache SOLR integration§ All the special configurations you need for SOLR to run a site (in English) gets shipped with Apache SOLR search integration module§ Just copy them to SOLR and you’re good to go§ The rest of the presentation will presume you’re using this module to connect to SOLR, if you’re using Search API Solr search, you’re out of luck and will have to be doing a lot of more handywork, check out http://drupal.org/node/1210810
SO, MY SOLR SEARCHWORKS WELL WITH MYENGLISH CONTENT
But, then, this is EuropeWe do use a lot of other languages here too… and then, things get a bit more complicated
SOLR schema has to belanguage-aware§ Stemming, stopwords, compound words and such are all language dependent§ The SOLR main indexing and querying configuration, schema.xml, needs to be language specific§ Schema.xml is a long, complicated XML document and any errors in it will prevent SOLR to start
Here’s an exampleschema.xml<?xml version="1.0" encoding="UTF-8"?>!<!--! This is the Solr schema file. This file should be named "schema.xml" and! should be in the conf directory under the solr home! (i.e. ./solr/conf/schema.xml by default)! or located where the classloader for the Solr webapp can find it.!! For more information, on how to customize this file, please see! http://wiki.apache.org/solr/SchemaXml!-->!<schema name="drupal-3.0-0-solr3" version="1.3">! <!-- attribute "name" is the name of this schema and is only used for display purposes.! Applications should change this to reflect the nature of the search collection.! version="1.2" is Solrs version number for the schema syntax and semantics. It should! not normally be changed by applications.! 1.0: multiValued attribute did not exist, all fields are multiValued by nature! 1.1: multiValued attribute introduced, false by default! 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.! 1.3: removed optional field compress feature! -->! <types>! <!-- field type definitions. The "name" attribute is! just a label to be used by field definitions. The "class"! attribute and any other attributes determine the real! behavior of the fieldType.! Class names starting with "solr" refer to java classes in the! org.apache.solr.analysis package.! -->!! <!-- The StrField type is not analyzed, but indexed/stored verbatim.! - StrField and TextField support an optional compressThreshold which! limits compression (if enabled in the derived fields) to values which! exceed a certain size (in characters).! -->! <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>!!
There’s help available§ There are two modules in Drupal.org to make your life easier, Apache SOLR Multilingual and Apache SOLR config generator§ They combined will enable you to § Have a multi-language site with SOLR search optimized for each language § Generate configuration for such multi-language site, or even a site with one non-english language
Apache SOLR multilingual§ Apache SOLR multilingual will separate the Drupal node fields per language and store them into SOLR in different fields§ That way you can have different configuration setup for the same Drupal field in different languages§ It’ll handle the spell checking too§ Apache SOLR config generator will then generate you a suitable starting point for your SOLR configuration files
… but it doesn’t doeverything§ It ships with the stopword list for most common languages, the ISO Latin mapping list for German (the module author speaks German) and some other files§ Most of the language specific language lists, such as protwords (usually site-specific anyway), ISO mappings, synonyms and compound word lists you’ll have to provide yourself§ Some languages need a different stemmer to work properly, the configuration generator uses SnowBallFilterFactory
Stop words§ All the languages need the stop words list, these are the “and, or, then” words you don’t index at all§ Needless to say, they are language specific§ Luckily you’ll find most of them either in the Apache SOLR multilingual module or somewhere online
ISO mapping§ This means the special letter in some languages and how convert them for better matching§ This is done usually for accents and such, that are to guide the pronunciation of the word and doesn’t change the meaning (eg. café => cafe, in both indexing and querying)§ Umlauts (ä, ö, å) do change the meaning and usually are NOT replaced
Protected words§ As stated earlier, protected words are the words you don’t want the indexer to deform§ Usually trademarks, product names and such§ These are usually site-specific – for obvious reasons§ This also means you’ll have to be writing this list yourself – not a long list usually though
Synonyms§ Synonyms are good if you want to make sure your results are found even if the users don’t use the same word§ Also language specific and not easy to find for smaller languages§ Here’s an example: !GB,gib,gigabyte,gigabytes!
Compound words§ There’s also a file to split up compound words§ For a lot of languages you don’t even need it and for most a small one is only needed§ But then there are some languages you can’t go without one, like German or Finnish§ Let’s look a an example
Compound wordsexample§ We did a Drupal site that is about food recipes§ In English, searching for ‘soup’ would result in all the soups § Oxtail soup § Lentil soup § Goulash soup § Tomato soup … and so on
Compound wordsexample§ By searching with soup in Finnish, ‘keitto’, you’d normally get none of the following: § Häränhäntäkeitto § Linssikeitto § Gulassikeitto § Tomaattikeitto … see why?
Compound words§ See, SOLR doesn’t do infix indexes, that means it doesn’t find words “within” other words*§ So you’ll have to cut compound words to be able to access the words* There is a way to do infix indexes in SOLR, but that’s so complicated that it’s not evenfunny. You’ll have to have two indexes, one the normal way and one in reverse andthen reverse the query to search from the reverse index.
Some special languages§ Chinese, Japanese and Korean have their own different approach to indexing with SOLR, basically you don’t have to stem, but only cut the words out of the sentences (whitespace doesn’t work like in the European languages)§ For some languages, you can’t even find the basic stuff (try Mongolian for instance)
Multilingual SOLR search§ After adding all those word lists and retuning your search according to examples in SOLR’s wiki and example configurations, you’ll have a working multi- language SOLR search§ Let native users of that language use it and you’ll have some more tuning to do and words to add to those lists§ Eventually your site will be the benchmark for functional searching – working multi-language searches are that rare
Apache SOLR integration§ Apache SOLR integration is module for integrating your search to SOLR from Drupal§ It works well for English, even better if you tune the SOLR configuration a bit§ Apache SOLR multilingual and config generator enable to you index multiple language content§ If you’re using Search API Solr search, you in for a lot of manual labor
Apache SOLR multilingual§ But you need to tune your settings by hand and you need the word lists§ Word lists for stop words are easy to find for common languages§ Other word lists you can only find for really popular languages§ Protected words you’ll have to craft up yourself