SlideShare a Scribd company logo
LANGUAGE SUPPORT IN
SEARCHING DRUPAL WITH
SOLR

Kalle Varisvirta
SOLR? What’s that and
why do I care?
§  SOLR is a open source search platform,
    optimized for full-text searching, hit highlighting,
    faceted search and lot more
§  Incomparable to Drupal’s internal search; it
    blows you away when you compare it
§  Integrates to Drupal in many ways and can be
    used in many ways – we’re focusing on the
    actual search functionality
SOLR
§  Since it’s Java, it needs the Java-capable web-
    server and ships with one, Jetty
§  Very easy to configure and start, even for a
    Drupal developer used to drush etc.
§  Integrates for searching with “Apache SOLR
    search integration” –module sponsored by
    Acquia
How does Drupal integrate
to SOLR
§  Basically the module replaces Drupal’s internal
    search indexing and instead uses a SOLR
    schema (schema.xml) that ships with the
    module
§  It defines the mandatory node fields in Drupal
    and uses SOLR’s cool dynamic field definitions
    to accommodate all your FieldAPI fields
So, what does SOLR do?
§  Obviously first it looks at the type of the field, the
    behavior differs for different field types
§  For text it does a lot, it makes your text
    searchable by first processing it in many ways
    and then indexing it
§  The behavior differs in different languages – and
    we’ll come to that later – but here’s the basic
    process for a popular language example:
    English
SOLR processing
§  First it tokenizes the text by whitespace
§  Then it removes the stop words (words not to
    index, e.g. and or or)
§  Then it splits words by case change, numerics
    and by couple of other rules, e.g. “PowerShot”
    => indexed as “Power” and “Shot”
§  Then it stems the words, reducing inflected
    words to their stems, e.g. “stemming” => “stem”
§  Then it removes duplicate tokens
SOLR processing


  FreeAir X500 Wireless Router is a powerful wireless solution
  well suited for the home or office.
SOLR processing

 Separated by whitespace.

   FreeAir X500 Wireless Router is a powerful wireless solution
   well suited for the home or office
SOLR processing

 Stop words removed.

   FreeAir X500 Wireless Router powerful wireless solution
   suited home office
SOLR processing

 Words split, but not FreeAir, since it’s on the protected words list.

    FreeAir X 500 Wireless Router powerful wireless solution
    suited home office
SOLR processing

 Everything in lowercase.

   freeair x 500 wireless router powerful wireless solution
   suited home office
SOLR processing

 Stemmed.

   freeair x 500 wireless router power wireless solut 
   suit home offic
Searching from SOLR
§  Now when you search from SOLR, it does parts
    of the same magic to your query text
§  This way you’ll match the indexed document
    even if you wrote it a bit differently
§  “Office capable wireless routers” will match our
    indexed document just nicely, not by every
    word, but enough and close by each other, that
    it’ll be a good match and ranking high on
    SOLR’s relevance score
Apache SOLR integration
§  All the special configurations you need for SOLR to
    run a site (in English) gets shipped with Apache
    SOLR search integration module
§  Just copy them to SOLR and you’re good to go
§  The rest of the presentation will presume you’re
    using this module to connect to SOLR, if you’re
    using Search API Solr search, you’re out of luck and
    will have to be doing a lot of more handywork,
    check out http://drupal.org/node/1210810
SO, MY SOLR SEARCH
WORKS WELL WITH MY
ENGLISH CONTENT
But, then, this is Europe


We do use a lot of other languages here too
… and then, things get a bit more complicated
SOLR schema has to be
language-aware
§  Stemming, stopwords, compound words and
    such are all language dependent
§  The SOLR main indexing and querying
    configuration, schema.xml, needs to be
    language specific
§  Schema.xml is a long, complicated XML
    document and any errors in it will prevent SOLR
    to start
Here’s an example
schema.xml
<?xml version="1.0" encoding="UTF-8"?>!

<!--!

    This is the Solr schema file. This file should be named "schema.xml" and!

    should be in the conf directory under the solr home!

    (i.e. ./solr/conf/schema.xml by default)!
    or located where the classloader for the Solr webapp can find it.!

!

    For more information, on how to customize this file, please see!

    http://wiki.apache.org/solr/SchemaXml!

-->!

<schema name="drupal-3.0-0-solr3" version="1.3">!

       <!-- attribute "name" is the name of this schema and is only used for display purposes.!

              Applications should change this to reflect the nature of the search collection.!

              version="1.2" is Solr's version number for the schema syntax and semantics.   It should!

              not normally be changed by applications.!

              1.0: multiValued attribute did not exist, all fields are multiValued by nature!
              1.1: multiValued attribute introduced, false by default!

              1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.!

              1.3: removed optional field compress feature!

          -->!

     <types>!

       <!-- field type definitions. The "name" attribute is!

          just a label to be used by field definitions.       The "class"!

          attribute and any other attributes determine the real!
          behavior of the fieldType.!

              Class names starting with "solr" refer to java classes in the!

          org.apache.solr.analysis package.!

       -->!

!

       <!-- The StrField type is not analyzed, but indexed/stored verbatim.!

          - StrField and TextField support an optional compressThreshold which!
          limits compression (if enabled in the derived fields) to values which!
          exceed a certain size (in characters).!

       -->!
       <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>!
!
IN COMES THE APACHE
SOLR MULTILINGUAL
There’s help available
§  There are two modules in Drupal.org to make
    your life easier, Apache SOLR Multilingual and
    Apache SOLR config generator
§  They combined will enable you to
  §  Have a multi-language site with SOLR search
      optimized for each language
  §  Generate configuration for such multi-language site,
      or even a site with one non-english language
Apache SOLR multilingual
§  Apache SOLR multilingual will separate the Drupal
    node fields per language and store them into SOLR
    in different fields
§  That way you can have different configuration setup
    for the same Drupal field in different languages
§  It’ll handle the spell checking too
§  Apache SOLR config generator will then generate
    you a suitable starting point for your SOLR
    configuration files
… but it doesn’t do
everything
§  It ships with the stopword list for most common
    languages, the ISO Latin mapping list for German
    (the module author speaks German) and some
    other files
§  Most of the language specific language lists, such
    as protwords (usually site-specific anyway), ISO
    mappings, synonyms and compound word lists
    you’ll have to provide yourself
§  Some languages need a different stemmer to work
    properly, the configuration generator uses
    SnowBallFilterFactory
Stop words
§  All the languages need the stop words list, these
    are the “and, or, then” words you don’t index at
    all
§  Needless to say, they are language specific
§  Luckily you’ll find most of them either in the
    Apache SOLR multilingual module or
    somewhere online
ISO mapping
§  This means the special letter in some languages
    and how convert them for better matching
§  This is done usually for accents and such, that
    are to guide the pronunciation of the word and
    doesn’t change the meaning (eg. café => cafe,
    in both indexing and querying)
§  Umlauts (ä, ö, å) do change the meaning and
    usually are NOT replaced
Protected words
§  As stated earlier, protected words are the words
    you don’t want the indexer to deform
§  Usually trademarks, product names and such
§  These are usually site-specific – for obvious
    reasons
§  This also means you’ll have to be writing this list
    yourself – not a long list usually though
Synonyms
§  Synonyms are good if you want to make sure
    your results are found even if the users don’t
    use the same word
§  Also language specific and not easy to find for
    smaller languages
§  Here’s an example:
  !GB,gib,gigabyte,gigabytes!
Compound words
§  There’s also a file to split up compound words
§  For a lot of languages you don’t even need it
    and for most a small one is only needed
§  But then there are some languages you can’t go
    without one, like German or Finnish
§  Let’s look a an example
Compound words
example
§  We did a Drupal site that is about food recipes
§  In English, searching for ‘soup’ would result in all
    the soups
   §  Oxtail soup
   §  Lentil soup
   §  Goulash soup
   §  Tomato soup
        … and so on
Compound words
example
§  By searching with soup in Finnish, ‘keitto’, you’d
    normally get none of the following:
   §  Häränhäntäkeitto
   §  Linssikeitto
   §  Gulassikeitto
   §  Tomaattikeitto
        … see why?
Compound words
§  See, SOLR doesn’t do infix indexes, that means
    it doesn’t find words “within” other words*
§  So you’ll have to cut compound words to be
    able to access the words



* There is a way to do infix indexes in SOLR, but that’s so complicated that it’s not even
funny. You’ll have to have two indexes, one the normal way and one in reverse and
then reverse the query to search from the reverse index.
Some special languages
§  Chinese, Japanese and Korean have their own
    different approach to indexing with SOLR,
    basically you don’t have to stem, but only cut the
    words out of the sentences (whitespace doesn’t
    work like in the European languages)
§  For some languages, you can’t even find the
    basic stuff (try Mongolian for instance)
Multilingual SOLR search
§  After adding all those word lists and retuning your
    search according to examples in SOLR’s wiki and
    example configurations, you’ll have a working multi-
    language SOLR search
§  Let native users of that language use it and you’ll
    have some more tuning to do and words to add to
    those lists
§  Eventually your site will be the benchmark for
    functional searching – working multi-language
    searches are that rare
A QUICK
RECAP
Apache SOLR integration
§  Apache SOLR integration is module for
    integrating your search to SOLR from Drupal
§  It works well for English, even better if you tune
    the SOLR configuration a bit
§  Apache SOLR multilingual and config generator
    enable to you index multiple language content
§  If you’re using Search API Solr search, you in for
    a lot of manual labor
Apache SOLR multilingual
§  But you need to tune your settings by hand and
    you need the word lists
§  Word lists for stop words are easy to find for
    common languages
§  Other word lists you can only find for really
    popular languages
§  Protected words you’ll have to craft up yourself
THANK YOU FOR YOUR
TIME. QUESTIONS?
Better Business on the Internet

More Related Content

Similar to Language support in searching Drupal with SOLR - Drupalcamp London 2013

Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
Alexandre Rafalovitch
 
Making your Drupal fly with Apache SOLR
Making your Drupal fly with Apache SOLRMaking your Drupal fly with Apache SOLR
Making your Drupal fly with Apache SOLRExove
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
Alihossein shahabi
 
Weird Plsql
Weird PlsqlWeird Plsql
Weird Plsql
webanddb
 
Shell script-sec
Shell script-secShell script-sec
Shell script-sec
SRIKANTH ANDE
 
Oracle regular expressions
Oracle regular expressionsOracle regular expressions
Oracle regular expressionsudaya1988
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
Nick Zadrozny
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
OpenRestyCon
 
Big data elasticsearch practical
Big data  elasticsearch practicalBig data  elasticsearch practical
Big data elasticsearch practical
JWORKS powered by Ordina
 
Low maintenance perl notes
Low maintenance perl notesLow maintenance perl notes
Low maintenance perl notes
Perrin Harkins
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />tutorialsruby
 
&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>
&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>
&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>tutorialsruby
 
Software Engineering Thailand: Programming with Scala
Software Engineering Thailand: Programming with ScalaSoftware Engineering Thailand: Programming with Scala
Software Engineering Thailand: Programming with Scala
Brian Topping
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
Espen Brækken
 

Similar to Language support in searching Drupal with SOLR - Drupalcamp London 2013 (20)

Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Making your Drupal fly with Apache SOLR
Making your Drupal fly with Apache SOLRMaking your Drupal fly with Apache SOLR
Making your Drupal fly with Apache SOLR
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
Weird Plsql
Weird PlsqlWeird Plsql
Weird Plsql
 
Shell script-sec
Shell script-secShell script-sec
Shell script-sec
 
Oracle regular expressions
Oracle regular expressionsOracle regular expressions
Oracle regular expressions
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Hands on-solr
Hands on-solrHands on-solr
Hands on-solr
 
C 2
C 2C 2
C 2
 
Big data elasticsearch practical
Big data  elasticsearch practicalBig data  elasticsearch practical
Big data elasticsearch practical
 
Low maintenance perl notes
Low maintenance perl notesLow maintenance perl notes
Low maintenance perl notes
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />&lt;img src="../i/r_14.png" />
&lt;img src="../i/r_14.png" />
 
&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>
&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>
&lt;b>PHP&lt;/b> Reference: Beginner to Intermediate &lt;b>PHP5&lt;/b>
 
Software Engineering Thailand: Programming with Scala
Software Engineering Thailand: Programming with ScalaSoftware Engineering Thailand: Programming with Scala
Software Engineering Thailand: Programming with Scala
 
Full text search
Full text searchFull text search
Full text search
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
 

More from Exove

Data security in the age of GDPR – most common data security problems
Data security in the age of GDPR – most common data security problemsData security in the age of GDPR – most common data security problems
Data security in the age of GDPR – most common data security problems
Exove
 
Provisioning infrastructure to AWS using Terraform – Exove
Provisioning infrastructure to AWS using Terraform – ExoveProvisioning infrastructure to AWS using Terraform – Exove
Provisioning infrastructure to AWS using Terraform – Exove
Exove
 
Advanced custom fields in Wordpress
Advanced custom fields in WordpressAdvanced custom fields in Wordpress
Advanced custom fields in Wordpress
Exove
 
Introduction to Robot Framework – Exove
Introduction to Robot Framework – ExoveIntroduction to Robot Framework – Exove
Introduction to Robot Framework – Exove
Exove
 
Jenkins and visual regression – Exove
Jenkins and visual regression – ExoveJenkins and visual regression – Exove
Jenkins and visual regression – Exove
Exove
 
Server-side React with Headless CMS – Exove
Server-side React with Headless CMS – ExoveServer-side React with Headless CMS – Exove
Server-side React with Headless CMS – Exove
Exove
 
WebSockets in Bravo Dashboard – Exove
WebSockets in Bravo Dashboard – ExoveWebSockets in Bravo Dashboard – Exove
WebSockets in Bravo Dashboard – Exove
Exove
 
Diversity in recruitment
Diversity in recruitmentDiversity in recruitment
Diversity in recruitment
Exove
 
Saavutettavuus liiketoimintana
Saavutettavuus liiketoimintanaSaavutettavuus liiketoimintana
Saavutettavuus liiketoimintana
Exove
 
Saavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Saavutettavuus osana Eläkeliiton verkkosivu-uudistustaSaavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Saavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Exove
 
Mitä saavutettavuusdirektiivi pitää sisällään
Mitä saavutettavuusdirektiivi pitää sisälläänMitä saavutettavuusdirektiivi pitää sisällään
Mitä saavutettavuusdirektiivi pitää sisällään
Exove
 
Creating Landing Pages for Drupal 8
Creating Landing Pages for Drupal 8Creating Landing Pages for Drupal 8
Creating Landing Pages for Drupal 8
Exove
 
GDPR for developers
GDPR for developersGDPR for developers
GDPR for developers
Exove
 
Managing Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with DrupalManaging Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with Drupal
Exove
 
Life with digital services after GDPR
Life with digital services after GDPRLife with digital services after GDPR
Life with digital services after GDPR
Exove
 
GDPR - no beginning no end
GDPR - no beginning no endGDPR - no beginning no end
GDPR - no beginning no end
Exove
 
Developing truly personalised experiences
Developing truly personalised experiencesDeveloping truly personalised experiences
Developing truly personalised experiences
Exove
 
Customer Experience and Personalisation
Customer Experience and PersonalisationCustomer Experience and Personalisation
Customer Experience and Personalisation
Exove
 
Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Adventures In Programmatic Branding – How To Design With Algorithms And How T...Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Exove
 
Dataohjattu asiakaskokemus
Dataohjattu asiakaskokemusDataohjattu asiakaskokemus
Dataohjattu asiakaskokemus
Exove
 

More from Exove (20)

Data security in the age of GDPR – most common data security problems
Data security in the age of GDPR – most common data security problemsData security in the age of GDPR – most common data security problems
Data security in the age of GDPR – most common data security problems
 
Provisioning infrastructure to AWS using Terraform – Exove
Provisioning infrastructure to AWS using Terraform – ExoveProvisioning infrastructure to AWS using Terraform – Exove
Provisioning infrastructure to AWS using Terraform – Exove
 
Advanced custom fields in Wordpress
Advanced custom fields in WordpressAdvanced custom fields in Wordpress
Advanced custom fields in Wordpress
 
Introduction to Robot Framework – Exove
Introduction to Robot Framework – ExoveIntroduction to Robot Framework – Exove
Introduction to Robot Framework – Exove
 
Jenkins and visual regression – Exove
Jenkins and visual regression – ExoveJenkins and visual regression – Exove
Jenkins and visual regression – Exove
 
Server-side React with Headless CMS – Exove
Server-side React with Headless CMS – ExoveServer-side React with Headless CMS – Exove
Server-side React with Headless CMS – Exove
 
WebSockets in Bravo Dashboard – Exove
WebSockets in Bravo Dashboard – ExoveWebSockets in Bravo Dashboard – Exove
WebSockets in Bravo Dashboard – Exove
 
Diversity in recruitment
Diversity in recruitmentDiversity in recruitment
Diversity in recruitment
 
Saavutettavuus liiketoimintana
Saavutettavuus liiketoimintanaSaavutettavuus liiketoimintana
Saavutettavuus liiketoimintana
 
Saavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Saavutettavuus osana Eläkeliiton verkkosivu-uudistustaSaavutettavuus osana Eläkeliiton verkkosivu-uudistusta
Saavutettavuus osana Eläkeliiton verkkosivu-uudistusta
 
Mitä saavutettavuusdirektiivi pitää sisällään
Mitä saavutettavuusdirektiivi pitää sisälläänMitä saavutettavuusdirektiivi pitää sisällään
Mitä saavutettavuusdirektiivi pitää sisällään
 
Creating Landing Pages for Drupal 8
Creating Landing Pages for Drupal 8Creating Landing Pages for Drupal 8
Creating Landing Pages for Drupal 8
 
GDPR for developers
GDPR for developersGDPR for developers
GDPR for developers
 
Managing Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with DrupalManaging Complexity and Privacy Debt with Drupal
Managing Complexity and Privacy Debt with Drupal
 
Life with digital services after GDPR
Life with digital services after GDPRLife with digital services after GDPR
Life with digital services after GDPR
 
GDPR - no beginning no end
GDPR - no beginning no endGDPR - no beginning no end
GDPR - no beginning no end
 
Developing truly personalised experiences
Developing truly personalised experiencesDeveloping truly personalised experiences
Developing truly personalised experiences
 
Customer Experience and Personalisation
Customer Experience and PersonalisationCustomer Experience and Personalisation
Customer Experience and Personalisation
 
Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Adventures In Programmatic Branding – How To Design With Algorithms And How T...Adventures In Programmatic Branding – How To Design With Algorithms And How T...
Adventures In Programmatic Branding – How To Design With Algorithms And How T...
 
Dataohjattu asiakaskokemus
Dataohjattu asiakaskokemusDataohjattu asiakaskokemus
Dataohjattu asiakaskokemus
 

Recently uploaded

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 

Recently uploaded (20)

Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 

Language support in searching Drupal with SOLR - Drupalcamp London 2013

  • 1. LANGUAGE SUPPORT IN SEARCHING DRUPAL WITH SOLR Kalle Varisvirta
  • 2. SOLR? What’s that and why do I care? §  SOLR is a open source search platform, optimized for full-text searching, hit highlighting, faceted search and lot more §  Incomparable to Drupal’s internal search; it blows you away when you compare it §  Integrates to Drupal in many ways and can be used in many ways – we’re focusing on the actual search functionality
  • 3. SOLR §  Since it’s Java, it needs the Java-capable web- server and ships with one, Jetty §  Very easy to configure and start, even for a Drupal developer used to drush etc. §  Integrates for searching with “Apache SOLR search integration” –module sponsored by Acquia
  • 4.
  • 5. How does Drupal integrate to SOLR §  Basically the module replaces Drupal’s internal search indexing and instead uses a SOLR schema (schema.xml) that ships with the module §  It defines the mandatory node fields in Drupal and uses SOLR’s cool dynamic field definitions to accommodate all your FieldAPI fields
  • 6. So, what does SOLR do? §  Obviously first it looks at the type of the field, the behavior differs for different field types §  For text it does a lot, it makes your text searchable by first processing it in many ways and then indexing it §  The behavior differs in different languages – and we’ll come to that later – but here’s the basic process for a popular language example: English
  • 7. SOLR processing §  First it tokenizes the text by whitespace §  Then it removes the stop words (words not to index, e.g. and or or) §  Then it splits words by case change, numerics and by couple of other rules, e.g. “PowerShot” => indexed as “Power” and “Shot” §  Then it stems the words, reducing inflected words to their stems, e.g. “stemming” => “stem” §  Then it removes duplicate tokens
  • 8. SOLR processing FreeAir X500 Wireless Router is a powerful wireless solution well suited for the home or office.
  • 9. SOLR processing Separated by whitespace. FreeAir X500 Wireless Router is a powerful wireless solution well suited for the home or office
  • 10. SOLR processing Stop words removed. FreeAir X500 Wireless Router powerful wireless solution suited home office
  • 11. SOLR processing Words split, but not FreeAir, since it’s on the protected words list. FreeAir X 500 Wireless Router powerful wireless solution suited home office
  • 12. SOLR processing Everything in lowercase. freeair x 500 wireless router powerful wireless solution suited home office
  • 13. SOLR processing Stemmed. freeair x 500 wireless router power wireless solut suit home offic
  • 14. Searching from SOLR §  Now when you search from SOLR, it does parts of the same magic to your query text §  This way you’ll match the indexed document even if you wrote it a bit differently §  “Office capable wireless routers” will match our indexed document just nicely, not by every word, but enough and close by each other, that it’ll be a good match and ranking high on SOLR’s relevance score
  • 15. Apache SOLR integration §  All the special configurations you need for SOLR to run a site (in English) gets shipped with Apache SOLR search integration module §  Just copy them to SOLR and you’re good to go §  The rest of the presentation will presume you’re using this module to connect to SOLR, if you’re using Search API Solr search, you’re out of luck and will have to be doing a lot of more handywork, check out http://drupal.org/node/1210810
  • 16. SO, MY SOLR SEARCH WORKS WELL WITH MY ENGLISH CONTENT
  • 17. But, then, this is Europe We do use a lot of other languages here too … and then, things get a bit more complicated
  • 18. SOLR schema has to be language-aware §  Stemming, stopwords, compound words and such are all language dependent §  The SOLR main indexing and querying configuration, schema.xml, needs to be language specific §  Schema.xml is a long, complicated XML document and any errors in it will prevent SOLR to start
  • 19. Here’s an example schema.xml <?xml version="1.0" encoding="UTF-8"?>! <!--! This is the Solr schema file. This file should be named "schema.xml" and! should be in the conf directory under the solr home! (i.e. ./solr/conf/schema.xml by default)! or located where the classloader for the Solr webapp can find it.! ! For more information, on how to customize this file, please see! http://wiki.apache.org/solr/SchemaXml! -->! <schema name="drupal-3.0-0-solr3" version="1.3">! <!-- attribute "name" is the name of this schema and is only used for display purposes.! Applications should change this to reflect the nature of the search collection.! version="1.2" is Solr's version number for the schema syntax and semantics. It should! not normally be changed by applications.! 1.0: multiValued attribute did not exist, all fields are multiValued by nature! 1.1: multiValued attribute introduced, false by default! 1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.! 1.3: removed optional field compress feature! -->! <types>! <!-- field type definitions. The "name" attribute is! just a label to be used by field definitions. The "class"! attribute and any other attributes determine the real! behavior of the fieldType.! Class names starting with "solr" refer to java classes in the! org.apache.solr.analysis package.! -->! ! <!-- The StrField type is not analyzed, but indexed/stored verbatim.! - StrField and TextField support an optional compressThreshold which! limits compression (if enabled in the derived fields) to values which! exceed a certain size (in characters).! -->! <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>! !
  • 20. IN COMES THE APACHE SOLR MULTILINGUAL
  • 21.
  • 22. There’s help available §  There are two modules in Drupal.org to make your life easier, Apache SOLR Multilingual and Apache SOLR config generator §  They combined will enable you to §  Have a multi-language site with SOLR search optimized for each language §  Generate configuration for such multi-language site, or even a site with one non-english language
  • 23. Apache SOLR multilingual §  Apache SOLR multilingual will separate the Drupal node fields per language and store them into SOLR in different fields §  That way you can have different configuration setup for the same Drupal field in different languages §  It’ll handle the spell checking too §  Apache SOLR config generator will then generate you a suitable starting point for your SOLR configuration files
  • 24. … but it doesn’t do everything §  It ships with the stopword list for most common languages, the ISO Latin mapping list for German (the module author speaks German) and some other files §  Most of the language specific language lists, such as protwords (usually site-specific anyway), ISO mappings, synonyms and compound word lists you’ll have to provide yourself §  Some languages need a different stemmer to work properly, the configuration generator uses SnowBallFilterFactory
  • 25. Stop words §  All the languages need the stop words list, these are the “and, or, then” words you don’t index at all §  Needless to say, they are language specific §  Luckily you’ll find most of them either in the Apache SOLR multilingual module or somewhere online
  • 26. ISO mapping §  This means the special letter in some languages and how convert them for better matching §  This is done usually for accents and such, that are to guide the pronunciation of the word and doesn’t change the meaning (eg. café => cafe, in both indexing and querying) §  Umlauts (ä, ö, å) do change the meaning and usually are NOT replaced
  • 27. Protected words §  As stated earlier, protected words are the words you don’t want the indexer to deform §  Usually trademarks, product names and such §  These are usually site-specific – for obvious reasons §  This also means you’ll have to be writing this list yourself – not a long list usually though
  • 28. Synonyms §  Synonyms are good if you want to make sure your results are found even if the users don’t use the same word §  Also language specific and not easy to find for smaller languages §  Here’s an example: !GB,gib,gigabyte,gigabytes!
  • 29. Compound words §  There’s also a file to split up compound words §  For a lot of languages you don’t even need it and for most a small one is only needed §  But then there are some languages you can’t go without one, like German or Finnish §  Let’s look a an example
  • 30. Compound words example §  We did a Drupal site that is about food recipes §  In English, searching for ‘soup’ would result in all the soups §  Oxtail soup §  Lentil soup §  Goulash soup §  Tomato soup … and so on
  • 31. Compound words example §  By searching with soup in Finnish, ‘keitto’, you’d normally get none of the following: §  Häränhäntäkeitto §  Linssikeitto §  Gulassikeitto §  Tomaattikeitto … see why?
  • 32. Compound words §  See, SOLR doesn’t do infix indexes, that means it doesn’t find words “within” other words* §  So you’ll have to cut compound words to be able to access the words * There is a way to do infix indexes in SOLR, but that’s so complicated that it’s not even funny. You’ll have to have two indexes, one the normal way and one in reverse and then reverse the query to search from the reverse index.
  • 33. Some special languages §  Chinese, Japanese and Korean have their own different approach to indexing with SOLR, basically you don’t have to stem, but only cut the words out of the sentences (whitespace doesn’t work like in the European languages) §  For some languages, you can’t even find the basic stuff (try Mongolian for instance)
  • 34. Multilingual SOLR search §  After adding all those word lists and retuning your search according to examples in SOLR’s wiki and example configurations, you’ll have a working multi- language SOLR search §  Let native users of that language use it and you’ll have some more tuning to do and words to add to those lists §  Eventually your site will be the benchmark for functional searching – working multi-language searches are that rare
  • 36. Apache SOLR integration §  Apache SOLR integration is module for integrating your search to SOLR from Drupal §  It works well for English, even better if you tune the SOLR configuration a bit §  Apache SOLR multilingual and config generator enable to you index multiple language content §  If you’re using Search API Solr search, you in for a lot of manual labor
  • 37. Apache SOLR multilingual §  But you need to tune your settings by hand and you need the word lists §  Word lists for stop words are easy to find for common languages §  Other word lists you can only find for really popular languages §  Protected words you’ll have to craft up yourself
  • 38. THANK YOU FOR YOUR TIME. QUESTIONS?
  • 39. Better Business on the Internet