Language support in searching Drupal with SOLR - Drupalcamp London 2013

LANGUAGE SUPPORT IN
SEARCHING DRUPAL WITH
SOLR

Kalle Varisvirta

SOLR? What’s that and
why do I care?
§  SOLR is a open source search platform,
optimized for full-text searching, hit highlighting,
faceted search and lot more
§  Incomparable to Drupal’s internal search; it
blows you away when you compare it
§  Integrates to Drupal in many ways and can be
used in many ways – we’re focusing on the
actual search functionality

SOLR
§  Since it’s Java, it needs the Java-capable web-
server and ships with one, Jetty
§  Very easy to configure and start, even for a
Drupal developer used to drush etc.
§  Integrates for searching with “Apache SOLR
search integration” –module sponsored by
Acquia

How does Drupal integrate
to SOLR
§  Basically the module replaces Drupal’s internal
search indexing and instead uses a SOLR
schema (schema.xml) that ships with the
module
§  It defines the mandatory node fields in Drupal
and uses SOLR’s cool dynamic field definitions
to accommodate all your FieldAPI fields

So, what does SOLR do?
§  Obviously first it looks at the type of the field, the
behavior differs for different field types
§  For text it does a lot, it makes your text
searchable by first processing it in many ways
and then indexing it
§  The behavior differs in different languages – and
we’ll come to that later – but here’s the basic
process for a popular language example:
English

SOLR processing
§  First it tokenizes the text by whitespace
§  Then it removes the stop words (words not to
index, e.g. and or or)
§  Then it splits words by case change, numerics
and by couple of other rules, e.g. “PowerShot”
=> indexed as “Power” and “Shot”
§  Then it stems the words, reducing inflected
words to their stems, e.g. “stemming” => “stem”
§  Then it removes duplicate tokens

SOLR processing

FreeAir X500 Wireless Router is a powerful wireless solution
well suited for the home or ofﬁce.

SOLR processing

Separated by whitespace.

FreeAir X500 Wireless Router is a powerful wireless solution
well suited for the home or ofﬁce

SOLR processing

Stop words removed.

FreeAir X500 Wireless Router powerful wireless solution
suited home ofﬁce

SOLR processing

Words split, but not FreeAir, since it’s on the protected words list.

FreeAir X 500 Wireless Router powerful wireless solution
suited home ofﬁce

SOLR processing

Everything in lowercase.

freeair x 500 wireless router powerful wireless solution
suited home ofﬁce

SOLR processing

Stemmed.

freeair x 500 wireless router power wireless solut
suit home ofﬁc

Searching from SOLR
§  Now when you search from SOLR, it does parts
of the same magic to your query text
§  This way you’ll match the indexed document
even if you wrote it a bit differently
§  “Office capable wireless routers” will match our
indexed document just nicely, not by every
word, but enough and close by each other, that
it’ll be a good match and ranking high on
SOLR’s relevance score

Apache SOLR integration
§  All the special configurations you need for SOLR to
run a site (in English) gets shipped with Apache
SOLR search integration module
§  Just copy them to SOLR and you’re good to go
§  The rest of the presentation will presume you’re
using this module to connect to SOLR, if you’re
using Search API Solr search, you’re out of luck and
will have to be doing a lot of more handywork,
check out http://drupal.org/node/1210810

SO, MY SOLR SEARCH
WORKS WELL WITH MY
ENGLISH CONTENT

But, then, this is Europe

We do use a lot of other languages here too
… and then, things get a bit more complicated

SOLR schema has to be
language-aware
§  Stemming, stopwords, compound words and
such are all language dependent
§  The SOLR main indexing and querying
configuration, schema.xml, needs to be
language specific
§  Schema.xml is a long, complicated XML
document and any errors in it will prevent SOLR
to start

Here’s an example
schema.xml
<?xml version="1.0" encoding="UTF-8"?>!

!

<schema name="drupal-3.0-0-solr3" version="1.3">!

!

<types>!

!

!

!
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>!
!

IN COMES THE APACHE
SOLR MULTILINGUAL

There’s help available
§  There are two modules in Drupal.org to make
your life easier, Apache SOLR Multilingual and
Apache SOLR config generator
§  They combined will enable you to
§  Have a multi-language site with SOLR search
optimized for each language
§  Generate configuration for such multi-language site,
or even a site with one non-english language

Apache SOLR multilingual
§  Apache SOLR multilingual will separate the Drupal
node fields per language and store them into SOLR
in different fields
§  That way you can have different configuration setup
for the same Drupal field in different languages
§  It’ll handle the spell checking too
§  Apache SOLR config generator will then generate
you a suitable starting point for your SOLR
configuration files

… but it doesn’t do
everything
§  It ships with the stopword list for most common
languages, the ISO Latin mapping list for German
(the module author speaks German) and some
other files
§  Most of the language specific language lists, such
as protwords (usually site-specific anyway), ISO
mappings, synonyms and compound word lists
you’ll have to provide yourself
§  Some languages need a different stemmer to work
properly, the configuration generator uses
SnowBallFilterFactory

Stop words
§  All the languages need the stop words list, these
are the “and, or, then” words you don’t index at
all
§  Needless to say, they are language specific
§  Luckily you’ll find most of them either in the
Apache SOLR multilingual module or
somewhere online

ISO mapping
§  This means the special letter in some languages
and how convert them for better matching
§  This is done usually for accents and such, that
are to guide the pronunciation of the word and
doesn’t change the meaning (eg. café => cafe,
in both indexing and querying)
§  Umlauts (ä, ö, å) do change the meaning and
usually are NOT replaced

Protected words
§  As stated earlier, protected words are the words
you don’t want the indexer to deform
§  Usually trademarks, product names and such
§  These are usually site-specific – for obvious
reasons
§  This also means you’ll have to be writing this list
yourself – not a long list usually though

Synonyms
§  Synonyms are good if you want to make sure
your results are found even if the users don’t
use the same word
§  Also language specific and not easy to find for
smaller languages
§  Here’s an example:
!GB,gib,gigabyte,gigabytes!

Compound words
§  There’s also a file to split up compound words
§  For a lot of languages you don’t even need it
and for most a small one is only needed
§  But then there are some languages you can’t go
without one, like German or Finnish
§  Let’s look a an example

Compound words
example
§  We did a Drupal site that is about food recipes
§  In English, searching for ‘soup’ would result in all
the soups
§  Oxtail soup
§  Lentil soup
§  Goulash soup
§  Tomato soup
… and so on

Compound words
example
§  By searching with soup in Finnish, ‘keitto’, you’d
normally get none of the following:
§  Häränhäntäkeitto
§  Linssikeitto
§  Gulassikeitto
§  Tomaattikeitto
… see why?

Compound words
§  See, SOLR doesn’t do infix indexes, that means
it doesn’t find words “within” other words*
§  So you’ll have to cut compound words to be
able to access the words

* There is a way to do infix indexes in SOLR, but that’s so complicated that it’s not even
funny. You’ll have to have two indexes, one the normal way and one in reverse and
then reverse the query to search from the reverse index.

Some special languages
§  Chinese, Japanese and Korean have their own
different approach to indexing with SOLR,
basically you don’t have to stem, but only cut the
words out of the sentences (whitespace doesn’t
work like in the European languages)
§  For some languages, you can’t even find the
basic stuff (try Mongolian for instance)

Multilingual SOLR search
§  After adding all those word lists and retuning your
search according to examples in SOLR’s wiki and
example configurations, you’ll have a working multi-
language SOLR search
§  Let native users of that language use it and you’ll
have some more tuning to do and words to add to
those lists
§  Eventually your site will be the benchmark for
functional searching – working multi-language
searches are that rare

Apache SOLR integration
§  Apache SOLR integration is module for
integrating your search to SOLR from Drupal
§  It works well for English, even better if you tune
the SOLR configuration a bit
§  Apache SOLR multilingual and config generator
enable to you index multiple language content
§  If you’re using Search API Solr search, you in for
a lot of manual labor

Apache SOLR multilingual
§  But you need to tune your settings by hand and
you need the word lists
§  Word lists for stop words are easy to find for
common languages
§  Other word lists you can only find for really
popular languages
§  Protected words you’ll have to craft up yourself

THANK YOU FOR YOUR
TIME. QUESTIONS?

Better Business on the Internet

Language support in searching Drupal with SOLR - Drupalcamp London 2013

Recommended

Recommended

More Related Content

Similar to Language support in searching Drupal with SOLR - Drupalcamp London 2013

Similar to Language support in searching Drupal with SOLR - Drupalcamp London 2013 (20)

More from Exove

More from Exove (20)

Recently uploaded

Recently uploaded (20)

Language support in searching Drupal with SOLR - Drupalcamp London 2013