SlideShare a Scribd company logo
1 of 1
 TOC  quot;
1-3quot;
 Full text search in in Rails with Sunspot and Solr PAGEREF _Toc156559729  2<br />Starting the engines PAGEREF _Toc156559730  3<br />Listing 1 – sunspot.yml PAGEREF _Toc156559731  3<br />Listing 2 – create_base_tables.rb PAGEREF _Toc156559732  3<br />Listing 3 – category.rb PAGEREF _Toc156559733  5<br />Listing 4 – product.rb PAGEREF _Toc156559734  5<br />Searching PAGEREF _Toc156559735  6<br />Listing 4 – products_controller.rb PAGEREF _Toc156559736  6<br />Listing 5 – sunspot_hack.rb PAGEREF _Toc156559737  7<br />Indexing PAGEREF _Toc156559738  7<br />Image 1 – Solr schema browser PAGEREF _Toc156559739  8<br />Image 2 – Viewing the analysis and search filters PAGEREF _Toc156559740  9<br />Image 3 – Solr analyzer page PAGEREF _Toc156559741  10<br />Customizing fields PAGEREF _Toc156559742  10<br />Listing 6 – solr/conf/schema.xml except PAGEREF _Toc156559743  10<br />Listing 7 – solr/config/schema.xml except PAGEREF _Toc156559744  11<br />Image 4 – Solr analyzer page PAGEREF _Toc156559745  12<br />Partial matching PAGEREF _Toc156559746  12<br />Listing 8 – solr/config/schema.xml except PAGEREF _Toc156559747  13<br />Image 5 – Analyzer output with partial matching enabled PAGEREF _Toc156559748  14<br />Faceting PAGEREF _Toc156559749  14<br />Listing 9 – products_controller.rb except PAGEREF _Toc156559750  15<br />Listing 10 – product.rb except PAGEREF _Toc156559751  15<br />Listing 11 – products/index.html.haml except PAGEREF _Toc156559752  16<br />Image 6 – Faceting information PAGEREF _Toc156559753  16<br />Conclusion PAGEREF _Toc156559754  16<br />Full text search in in Rails with Sunspot and Solr<br />Everyone wants to take their databases to run everything as fast as possible. We usually say query less, add more caching mechanisms, add indexes to the columns being searched, but another solution is not to use the database at all and look for better solutions for your querying needs. <br />When querying for text in our databases, we’re often doing “LIKE” searches. Like searches are only performant if we have an index in that field and the query is written in a way that the index is used. Imagine that you have a field “name” and it contains the text “Battlestar Galactica”. This query would be able to run and use the index:<br />SELECT p.* FROM products p WHERE p.name LIKE “Battlestar%”<br />The database would be able to optimize this query and use the index to find the expected row. But, what if the query was like this one:<br />SELECT p.* FROM products p WHERE p.name LIKE “%Galactica”<br />Database indexes usually match from left to right, so, unless you have a nasty trick under your sleeve, this query will just look at ALL the rows in the products table and perform a match on every “name” column before returning a result. And that’s Really Bad News for you, as the DBA will probably come for you holding a Morning Star to beat you badly. So, querying with “LIKE” when you what you need is full text search isn’t nice. <br />That’s where full text search based solutions come in for help. Tools like Solr allow you to perform optimized text searches, filter input, categorization and even features like Google’s “Did you mean?”.<br />In this tutorial you’ll learn how to add full text searching capabilities to your Rails application using Sunpot and Solr. We will also delve a little bit into Solr’s configuration and learn how to use specific tokenizers to clear input, perform partial matching of words and faceting results.<br />This project uses Rails 3 and Ruby 1.9.2, you’ll find a Gemfile and and “.rvmrc” with all dependencies declared, it should be pretty easy to follow or setup your environment based on it (if you’re not using RVM, that’s a GREAT time to learn using it).<br />You can possibly follow this tutorial with a previous Rails version and without bundler or RVM, given all models and most of the code will look exactly the same in Rails 2 and Sunspot is compatible to Rails 2 too.<br />The source code for this application is available at GitHub here - https://github.com/mauricio/sunspot_tutorial <br />Starting the engines<br />Download the Sunspot source code from Github - https://github.com/outoftime/sunspot<br />Enter the project folder and go to “sunspot/solr-1.3”, inside that folder you should see a “solr” folder, copy this folder into your project’s folder. This is where the general Solr configuration is going to live, don’t worry about these files just yet, we’ll get to them later in this tutorial.<br />Now create a “sunspot.yml” file under your project’s “config” folder, here’s a sample:<br />Listing 1 – sunspot.yml<br />development:<br />  solr:<br />    hostname: localhost<br />    port: 8980<br />    log_level: INFO<br />  auto_commit_after_delete_request: true<br />test:<br />  solr:<br />    hostname: localhost<br />    port: 8981<br />    log_level: OFF<br />production:<br />  solr:<br />    hostname: localhost<br />    port: 8982<br />    log_level: WARNING<br />  auto_commit_after_request: true    <br />You can have different configurations for every environment you’re running. To see all configuration options, go to the Sunspot source code and head to the “sunspot_rails/lib/sunspot/rails/configuration.rb” file.<br />Now we’ll create two models, Product and Category, so let’s start by creating the migration that will setup them:<br />rails g migration create_base_tables<br />Listing 2 – create_base_tables.rb<br />class CreateBaseTables < ActiveRecord::Migration<br />  def self.up<br />    create_table :categories do |t|<br />      t.string :name, :null => false<br />    end<br />    <br />    create_table :products do |t|<br />      t.string  :name, :null => false<br />      t.decimal :price, :scale => 2, :precision => 16, :null => false<br />      t.text    :description<br />      t.integer :category_id, :null => false<br />    end<br />  <br />    add_index :products, :category_id<br />    <br />  end<br />  def self.down<br />    drop_table :categories<br />    drop_table :products<br />  end<br />  <br />end<br />Now we move on to the basic models, starting with the Category model:<br />Listing 3 – category.rb<br />class Category < ActiveRecord::Base<br />  <br />  has_many :products<br />  <br />  validates_presence_of :name<br />  validates_uniqueness_of :name, :allow_blank => true<br />  <br />  searchable :auto_index => true, :auto_remove => true do<br />    text :name<br />  end<br />  <br />  def to_s<br />    self.name<br />  end<br />  <br />end<br />Here in the Category class we see our first reference to Sunspot, the “searchable” method, where we configure the fields that should be indexed by Solr. At the Category class, there’s only one field that’s useful at this moment, the “name”, so we tell Sunspot to configure the field name to be indexed as “text” (you usually don’t want your text indexed as “string”, as it will only be a hit in a full match). <br />The :auto_index and :auto_remove options are there to let Sunspot automatically send your model to be indexed at Solr when it is created/updated/destroyed. The default is “false” for both values, which means you have to manually send your data to Solr and unless you really want to do that, you should keep both of these values as “true” in your models.<br />Now lets look at the Product class:<br />Listing 4 – product.rb<br />class Product < ActiveRecord::Base<br />  <br />  belongs_to :category<br />  <br />  validates_presence_of :name, :description, :category_id, :price<br />  validates_uniqueness_of :name, :allow_blank => true<br />  <br />  searchable :auto_index => true, :auto_remove => true do<br />    text :name, :boost => 2.0<br />    text :description<br />    float :price<br />    integer :category_id<br />  end<br />  <br />  def to_s<br />    self.name<br />  end<br />  <br />end<br />In our Product class things are a little bit different, we have more fields (and more kinds) being indexed. “float” and “integer” are pretty self explanatory, but the “name” field has some black magic floating around, with the “boost” parameter. Boosting a field when indexing means that if the match is in that specific field, it has more “relevance” than if found somewhere else.<br />Imagine that you’re looking for Iron Maiden’s “Powerslave” album. You go to Iron Maiden’s Online Store and search for “powerslave”, hoping that the album will be the first hit, but then you see “Live After Dead” before “Powerslave”. Why did it happen? The “Live After Dead” album contains the “Powerslave” song in it’s track listing, so it’s a match as much as the real “Powerslave” album. What we need here is to tell the search tool that if a match is on an album name, it has higher relevance than if the hit is in the track listing.<br />Boosting allows you to reduce these issues. Some fields are inherently more important than others and you can tell that to Solr by configuring a “:boost” value for them. When something matches on them, the relevance of that match will be improved and it should come up before the other results in search.<br />Searching<br />Now let’s take a look at the ProductsController to see how we perform the search:<br />Listing 4 – products_controller.rb<br />class ProductsController < ApplicationController<br />  <br />  def index<br />    @products = if params[:q].blank?<br />      Product.all :order => 'name ASC'<br />    else<br />      Product.solr_search do |s|<br />        s.keywords params[:q]<br />      end<br />    end<br />  end<br />end<br />As you can see, searching is quite simple, you just call the solr_search method and send in the text to be searched for. One thing that I don’t like about Sunspot is that searches do not return an Array like object, you get a Sunspot::Search::StandardSearch object that has, as a property, the results array which contains the records returned by the search.<br />Here’s a simple way to fix this issue (I usually place the contents of this file inside an initializer in “config/initializers”):<br />Listing 5 – sunspot_hack.rb<br />::Sunspot::Search::StandardSearch.class_eval do<br />  include Enumerable<br />  delegate( <br />    :current_page, <br />    :per_page, <br />    :total_entries, <br />    :total_pages, <br />    :offset, <br />    :previous_page, <br />    :next_page, <br />    :out_of_bounds?,<br />    :each,<br />    :in_groups_of,<br />    :blank?,<br />    :[],<br />    :to => :results)<br />end<br />This simple monkeypatch makes the search object itself behave like an Enumerable/Array and you can use it to navigate directly in the results, without having to call the “results” method. The methods usually used by will_paginate helpers are also included so you can pass this object to a will_paginate call in your view and it’s just going to work.<br />Indexing<br />Now that all the models are in place, we can start fine tuning the Solr indexing process. First thing to understand here is what happens when you send text to be indexed by Solr, let’s get into the tool, starting the server:<br />rake sunspot:solr:run<br />This rake task starts Solr in the foreground (if you wanted to start it in the background, you’d use “sunspot:solr:start”). With Solr running, you should add some data to the database, this tutorial’s project on Github contains a “seed.rb” file with some basic data for testing, just copy it over your project. <br />Also copy the “lib/tasks/db.rake” from the project to your project, it contains a “db:prepare” task that truncates the database, seeds it and then indexes all items in Solr and we’re doing to be reindexing data a lot.<br />With everything copied, run the “db:prepare” task:<br />rake db:prepare<br />This will add the categories and products to your database and also index them in Solr. If this task did run successfully, head to the Solr administration interface, at this URL:<br />http://localhost:8980/solr/admin/schema.jsp<br />Once you go to it, click on the “FIELDS”, then on “NAME_TEXT”, you should see a screen just like the one in image 1:<br />Image 1 – Solr schema browser<br />If you don’t see all the fields that are available in this image, your “rake db:prepare” command has probably failed or Solr wasn’t running when you called it. <br />What we see here is the information about the fields we’re indexing. This specific field contains all data from the name properties from both Category and Product classes, as you can notice from the top 10 terms.<br />The name field is not indexed by it’s full content, as a relational database would usually do, the text is broken into tokens, by the solr.StandardTokenizerFactory class in Solr. This class receives our text, like “Battlestar Galactica: The Boardgame” and turns it into:<br />[“Battlestar”, “Galactica”, “The”, “Boardgame”]<br />This is what gets indexed and, ultimately, searched by Solr. If you open the web application now and try to search for “battle”, you won’t have any matches. If you search for “Battlestar”, you get the two products that match the name.<br />Everything when indexing information in Solr revolves around building the best “tokens” available for your input. You have to teach Solr to crunch your data in a way that makes sense and makes it easy to search for, and adding filters to the indexing process does this. While in the same page as Image 1 above, click on the “DETAILS” links as shown in Image 2:<br />Image 2 – Viewing the analysis and search filters<br />Each field in Solr has two analyzers, one is the “index” analyzer, that prepares the input to be indexed and the other is the “query” analyzer that prepares the search input to finally perform a search. Unless you have some special need, both of them are usually the same. <br />In our current configuration, we have the same two filters for both of the analyzers. The StandardFilterFactory filter removes punctuation characters from our input (the “:” in “Battlestar Galactica: The Boardgame” is not in our tokens) and the LowerCaseFilterFactory makes all input lowercased so we can search with “baTTle”, “BATTLE”, “BaTtLe” and they’re all going to work.<br />Before we move on to add more filters to our analyzers, let’s take a look at the analyzer screen in Solr Admin at - http://localhost:8980/solr/admin/analysis.jsp?highlight=on<br />In this screen we see how our input is going to be transformed into tokens by the configured analyzers.<br />Image 3 – Solr analyzer page<br />In this screen we have selected the “name_text” field in Solr. In the “Field value (Index)” you enter the values you’re sending to be indexed, just like you would send from your model property, in the “Field value (Query)” you enter the values you’d use to search.<br />Once you type and hit “Analyze” you should see the output just below the form as we see in Image 3. This output shows how your input is transformed into tokens by the tokenizer and filters, this way you can easily experiment by adding more filters and seeing if the output really matches the way you’d expect it to. This analysis view is your best friend when debugging search/indexing related issues or trying out ways to improve the way Solr indexes and matches your data.<br />Customizing fields<br />Now that you have an idea about how the indexing and searching process work, let’s start to customize the fields in Solr, open up the “solr/conf/schema.xml” file and look for this reference:<br />Listing 6 – solr/conf/schema.xml except<br />    <fieldtype class=quot;
solr.TextFieldquot;
 positionIncrementGap=quot;
100quot;
 name=quot;
textquot;
><br />      <analyzer><br />        <tokenizer class=quot;
solr.StandardTokenizerFactoryquot;
/><br />        <filter class=quot;
solr.StandardFilterFactoryquot;
/><br />        <filter class=quot;
solr.LowerCaseFilterFactoryquot;
/><br />      </analyzer><br />    </fieldtype><br />If you look at Image 1, where we saw the “name_text” configuration, you’ll see that the field type is “text”, this except above is the configuration for all fields of type “text”, which means that if we add more filters here we’ll affect all fields of this type. This greatly simplifies the way we configure the tool, as we don’t have to define explicit configurations for every single field that our models have, we can just reuse this same “text” config for all fields that are supposed to be indexed as text.<br />But that’s a lot of talking, let’s get into action!<br />Let’s start the job by looking at our indexed data from before:<br />[“battlestar”, “galactica”, “the”, “boardgame”]<br />The “the” is mostly useless, as it’s going to be available in almost all properties and no one is ever going to search for “the” (oh yeah, there might be that ONE guy that does it). In Information Retrieval lingo, “the” is a stop word, it usually doesn’t have meaning by itself and doesn’t represent valuable information for our indexer, removing all stop words from your input improves performance and the relevance of your results.<br />Given that this is a common operation, Solr already contains a filter that’s capable of removing all stop words from your data, the solr.StopFilterFactory, let’s see how we can add it to our config:<br />Listing 7 – solr/config/schema.xml except<br /><fieldtype class=quot;
solr.TextFieldquot;
 positionIncrementGap=quot;
100quot;
 name=quot;
textquot;
><br />  <analyzer><br />    <tokenizer class=quot;
solr.StandardTokenizerFactoryquot;
/><br />    <filter class=quot;
solr.StandardFilterFactoryquot;
/><br />    <filter class=quot;
solr.LowerCaseFilterFactoryquot;
/><br />    <filter class=quot;
solr.StopFilterFactoryquot;
 words=quot;
stopwords.txtquot;
 ignoreCase=quot;
truequot;
/><br />    <filter class=quot;
solr.ISOLatin1AccentFilterFactoryquot;
/><br />    <filter class=quot;
solr.TrimFilterFactoryquot;
 />      <br />  </analyzer><br /></fieldtype><br />If you look at the “solr/config” folder you’ll se a “stopwords.txt” file that already contains most of the common stop words in English, you can add or remove words from there as needed and if you’re not indexing English text you can just remove the English names and add your language’s stop words. Now change this in your “solr/config/schema.xml” file and stop and start Solr again and open the analyzer:<br />Image 4 – Solr analyzer page <br />As you can see, in the last step, the “the” was removed from both the index input and the query input, we’re maintaining only the pieces of information that are really useful, this makes our index smaller and also speeds up searching.<br />While you were not looking, we have also added two other filters, solr.ISOLatin1AccentFilterFactory, that removes accents from words in Latin based languages, like Portuguese. If the input is “não”, it becomes “nao”. And after that there’s solr.TrimFilterFactory, that removes unnecessary spaces from our tokens.<br />Partial matching<br />Another pretty common need is to be able to match only a part of a word, usually a prefix. In the beginning of the tutorial, we saw that searching for “battle” doesn’t yield any results, while “battlestar” does. This happens because Solr, by default, only sees a match if it’s a full match. The word you entered must be exactly the same as a token that’s available in the index, if there is no exact match, Solr you tell you that there are no results.<br />If you look at Lucene’s Query Parser Syntax - http://lucene.apache.org/java/2_9_1/queryparsersyntax.html  (Solr is somewhat a web interface to Lucene) you’ll see that you can use the “*” operator to perform a partial match. We could then search for “battle*” and this would yield the results we expect, but doing this kind of partial matching is slow and could possibly become a bottleneck for your application, so we have to figure out another way to do this.<br />When all you need is prefixed partial matching, the solr.EdgeNGramFilterFactory is your best friend. It will break words into pieces that will then be added to the index, so it looks like you have partial matching, but in fact the partials are tokens by themselves in the index, let’s see how our config would look like in this case:<br />Listing 8 – solr/config/schema.xml except<br /><fieldtype class=quot;
solr.TextFieldquot;
 positionIncrementGap=quot;
100quot;
 name=quot;
textquot;
><br />  <analyzer type=quot;
indexquot;
><br />    <tokenizer class=quot;
solr.StandardTokenizerFactoryquot;
/><br />    <filter class=quot;
solr.StandardFilterFactoryquot;
/><br />    <filter class=quot;
solr.LowerCaseFilterFactoryquot;
/><br />    <filter class=quot;
solr.StopFilterFactoryquot;
 words=quot;
stopwords.txtquot;
 ignoreCase=quot;
truequot;
/><br />    <filter class=quot;
solr.ISOLatin1AccentFilterFactoryquot;
/><br />    <filter class=quot;
solr.TrimFilterFactoryquot;
 />    <br />    <filter class=quot;
solr.EdgeNGramFilterFactoryquot;
<br />      minGramSize=quot;
3quot;
<br />      maxGramSize=quot;
30quot;
/><br />  </analyzer><br />  <analyzer type=quot;
queryquot;
><br />    <tokenizer class=quot;
solr.StandardTokenizerFactoryquot;
/><br />    <filter class=quot;
solr.StandardFilterFactoryquot;
/><br />    <filter class=quot;
solr.LowerCaseFilterFactoryquot;
/><br />    <filter class=quot;
solr.StopFilterFactoryquot;
 words=quot;
stopwords.txtquot;
 ignoreCase=quot;
truequot;
/><br />    <filter class=quot;
solr.ISOLatin1AccentFilterFactoryquot;
/><br />    <filter class=quot;
solr.TrimFilterFactoryquot;
 />      <br />  </analyzer><br /></fieldtype><br />As you can see, now we have two <analyzer> sections in our <fieldtype>, one of the analyzers is for “index” and the other is for “query”. This is needed because we don’t want to have our search parameters being transformed for a partial match. If the user is searching for “battle”, it doesn’t makes sense to show him results for “bat”, so the generation of pieces of each word should be done only when indexing information.<br />Now restart your Solr instance and head run again the form we had in the analyzer view, you should see something like Image 5:<br />Image 5 – Analyzer output with partial matching enabled<br />Looking at the output, “battlestar” became:<br />[“bat”, “batt”, “battl”, “battle”, “battles”, “battlest”, “battlesta”, “battlestar”]<br />Now, if you search for “battle”, you should find all products that have “battle” as a prefix in any of their words and the search input is not affected by this change.<br />Faceting<br />Faceting of results is YACF (Yet Another Cool Feature) that you have when using Solr and Sunspot. “What does that mean?”, you might ask, it means that Solr is able to organize your results based on one of it’s properties and tell you how many results did match for every property value.<br />“I still don’t get it”, you might be thinking now. In our Product model we’re indexing the “category_id” property, we’ll tell Sunspot to facet our search based on the “category_id” field and Sunspot will tell us how many matches each category had, even if we’re paginating the results. Let’s see how our searching code would change:<br />Listing 9 – products_controller.rb except<br />  def index<br />    @page = (params[:page] || 1).to_i<br />    @products = if params[:q].blank?<br />      Product.paginate :order => 'name ASC', :per_page => 3, :page => @page<br />    else<br />      <br />      result = Product.solr_search do |s|<br />        s.keywords params[:q]<br />        unless params[:category_id].blank?<br />          s.with( :category_id ).equal_to( params[:category_id].to_i )<br />        else<br />          s.facet :category_id<br />        end<br />        s.paginate :per_page => 3, :page => @page<br />      end<br />      <br />      if result.facet( :category_id )<br />        @facet_rows = result.facet(:category_id).rows  <br />      end<br />      <br />      result<br />    end<br />  end<br />The search code really changed a lot, now if there’s a “category_id” parameter we will use that to filter our search, if there isn’t we’re going to perform faceting with the “s.facet :category_id” call. There’s also a slight change to the “product.rb” class, let’s see it:<br />Listing 10 – product.rb except<br />  searchable :auto_index => true, :auto_remove => true do<br />    text :name, :boost => 2.0<br />    text :description<br />    float :price<br />    integer :category_id, :references => ::Category<br />  end<br />We’ve added the “:references => ::Category” to the “:category_id” field configuration so Sunspot knows that this field is, in fact, a foreign key to another object, this will allow Sunspot to load the categories in the facets automatically for you.<br />The “result.facet(:category_id)” asks the search object for the array that contains the facets returned for the :category_id field in this search. Each row in this list contains an “instance” (which, in our case, is an Category object) and a “count”, that’s the number of hits in that specific facet. Once you get your hands at the rows, we can use it in our view, let’s see how we used them:<br />Listing 11 – products/index.html.haml except<br />  - if !@facet_rows.blank? && @facet_rows.size > 1<br />    %ul<br />      - for row in @facet_rows<br />        %li= link_to( quot;
#{row.instance} (#{row.count})quot;
, products_path( :q => params[:q], :category_id => row.instance ) )<br />If there are facets available, we use them to add links that will make the user filter based on each specific facet, each row object has an instance and a count, and we use both in the interface to tell the user which category is it and how many hits it had. Look at how our user interface looks like:<br />Image 6 – Faceting information<br />And now you finally have search functionality added to a Rails project, with partial matching, faceting, pagination and input cleanup. Just forget that you have ever performed a “SELECT p.* FROM products p WHERE p.name LIKE ‘%battle%’” and be happy to be using a great full text search solution.<br />Conclusion<br />Hopefully this tutorial should be enough to get you up and running with Solr, for more advanced features I’d recommend you to search on the Solr wiki (http://wiki.apache.org/solr/FrontPage ) and buy “Solr 1.4 – Enterprise Search Server”  by David Smiley and Erick Pugh (http://www.amazon.com/gp/product/1847195881?ie=UTF8&tag=ultimaspalavr-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1847195881  ).<br />

More Related Content

More from Maurício Linhares

Unindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDrop
Unindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDropUnindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDrop
Unindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDropMaurício Linhares
 
Mixing Ruby and Java in a Service Oriented Architecture at OfficeDrop
Mixing Ruby and Java in a Service Oriented Architecture at OfficeDropMixing Ruby and Java in a Service Oriented Architecture at OfficeDrop
Mixing Ruby and Java in a Service Oriented Architecture at OfficeDropMaurício Linhares
 
Curso java 08 - mais sobre coleções
Curso java   08 - mais sobre coleçõesCurso java   08 - mais sobre coleções
Curso java 08 - mais sobre coleçõesMaurício Linhares
 
Curso java 06 - mais construtores, interfaces e polimorfismo
Curso java   06 - mais construtores, interfaces e polimorfismoCurso java   06 - mais construtores, interfaces e polimorfismo
Curso java 06 - mais construtores, interfaces e polimorfismoMaurício Linhares
 
Curso java 05 - herança, classes e métodos abstratos
Curso java   05 - herança, classes e métodos abstratosCurso java   05 - herança, classes e métodos abstratos
Curso java 05 - herança, classes e métodos abstratosMaurício Linhares
 
Curso java 04 - ap is e bibliotecas
Curso java   04 - ap is e bibliotecasCurso java   04 - ap is e bibliotecas
Curso java 04 - ap is e bibliotecasMaurício Linhares
 
Curso java 01 - molhando os pés com java
Curso java   01 - molhando os pés com javaCurso java   01 - molhando os pés com java
Curso java 01 - molhando os pés com javaMaurício Linhares
 
Curso java 03 - métodos e parâmetros
Curso java   03 - métodos e parâmetrosCurso java   03 - métodos e parâmetros
Curso java 03 - métodos e parâmetrosMaurício Linhares
 
Outsourcing e trabalho remoto para a nuvem
Outsourcing e trabalho remoto para a nuvemOutsourcing e trabalho remoto para a nuvem
Outsourcing e trabalho remoto para a nuvemMaurício Linhares
 
Aulas de Java Avançado 2- Faculdade iDez 2010
Aulas de Java Avançado 2- Faculdade iDez 2010Aulas de Java Avançado 2- Faculdade iDez 2010
Aulas de Java Avançado 2- Faculdade iDez 2010Maurício Linhares
 

More from Maurício Linhares (20)

Mercado de TI
Mercado de TIMercado de TI
Mercado de TI
 
Unindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDrop
Unindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDropUnindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDrop
Unindo Ruby e Java através de uma arquitetura orientada a serviços na OfficeDrop
 
Mixing Ruby and Java in a Service Oriented Architecture at OfficeDrop
Mixing Ruby and Java in a Service Oriented Architecture at OfficeDropMixing Ruby and Java in a Service Oriented Architecture at OfficeDrop
Mixing Ruby and Java in a Service Oriented Architecture at OfficeDrop
 
Aprendendo ruby
Aprendendo rubyAprendendo ruby
Aprendendo ruby
 
Curso java 07 - exceções
Curso java   07 - exceçõesCurso java   07 - exceções
Curso java 07 - exceções
 
Curso java 08 - mais sobre coleções
Curso java   08 - mais sobre coleçõesCurso java   08 - mais sobre coleções
Curso java 08 - mais sobre coleções
 
Curso java 06 - mais construtores, interfaces e polimorfismo
Curso java   06 - mais construtores, interfaces e polimorfismoCurso java   06 - mais construtores, interfaces e polimorfismo
Curso java 06 - mais construtores, interfaces e polimorfismo
 
Curso java 05 - herança, classes e métodos abstratos
Curso java   05 - herança, classes e métodos abstratosCurso java   05 - herança, classes e métodos abstratos
Curso java 05 - herança, classes e métodos abstratos
 
Curso java 04 - ap is e bibliotecas
Curso java   04 - ap is e bibliotecasCurso java   04 - ap is e bibliotecas
Curso java 04 - ap is e bibliotecas
 
Curso java 01 - molhando os pés com java
Curso java   01 - molhando os pés com javaCurso java   01 - molhando os pés com java
Curso java 01 - molhando os pés com java
 
Curso java 02 - variáveis
Curso java   02 - variáveisCurso java   02 - variáveis
Curso java 02 - variáveis
 
Curso java 03 - métodos e parâmetros
Curso java   03 - métodos e parâmetrosCurso java   03 - métodos e parâmetros
Curso java 03 - métodos e parâmetros
 
Extreme programming
Extreme programmingExtreme programming
Extreme programming
 
Feature Driven Development
Feature Driven DevelopmentFeature Driven Development
Feature Driven Development
 
Migrando pra Scala
Migrando pra ScalaMigrando pra Scala
Migrando pra Scala
 
Outsourcing e trabalho remoto para a nuvem
Outsourcing e trabalho remoto para a nuvemOutsourcing e trabalho remoto para a nuvem
Outsourcing e trabalho remoto para a nuvem
 
Mercado hoje
Mercado hojeMercado hoje
Mercado hoje
 
Análise de sistemas oo 1
Análise de sistemas oo   1Análise de sistemas oo   1
Análise de sistemas oo 1
 
Revisão html e java script
Revisão html e java scriptRevisão html e java script
Revisão html e java script
 
Aulas de Java Avançado 2- Faculdade iDez 2010
Aulas de Java Avançado 2- Faculdade iDez 2010Aulas de Java Avançado 2- Faculdade iDez 2010
Aulas de Java Avançado 2- Faculdade iDez 2010
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Full text search in Rails with Sunspot and Solr

  • 1. TOC quot; 1-3quot; Full text search in in Rails with Sunspot and Solr PAGEREF _Toc156559729 2<br />Starting the engines PAGEREF _Toc156559730 3<br />Listing 1 – sunspot.yml PAGEREF _Toc156559731 3<br />Listing 2 – create_base_tables.rb PAGEREF _Toc156559732 3<br />Listing 3 – category.rb PAGEREF _Toc156559733 5<br />Listing 4 – product.rb PAGEREF _Toc156559734 5<br />Searching PAGEREF _Toc156559735 6<br />Listing 4 – products_controller.rb PAGEREF _Toc156559736 6<br />Listing 5 – sunspot_hack.rb PAGEREF _Toc156559737 7<br />Indexing PAGEREF _Toc156559738 7<br />Image 1 – Solr schema browser PAGEREF _Toc156559739 8<br />Image 2 – Viewing the analysis and search filters PAGEREF _Toc156559740 9<br />Image 3 – Solr analyzer page PAGEREF _Toc156559741 10<br />Customizing fields PAGEREF _Toc156559742 10<br />Listing 6 – solr/conf/schema.xml except PAGEREF _Toc156559743 10<br />Listing 7 – solr/config/schema.xml except PAGEREF _Toc156559744 11<br />Image 4 – Solr analyzer page PAGEREF _Toc156559745 12<br />Partial matching PAGEREF _Toc156559746 12<br />Listing 8 – solr/config/schema.xml except PAGEREF _Toc156559747 13<br />Image 5 – Analyzer output with partial matching enabled PAGEREF _Toc156559748 14<br />Faceting PAGEREF _Toc156559749 14<br />Listing 9 – products_controller.rb except PAGEREF _Toc156559750 15<br />Listing 10 – product.rb except PAGEREF _Toc156559751 15<br />Listing 11 – products/index.html.haml except PAGEREF _Toc156559752 16<br />Image 6 – Faceting information PAGEREF _Toc156559753 16<br />Conclusion PAGEREF _Toc156559754 16<br />Full text search in in Rails with Sunspot and Solr<br />Everyone wants to take their databases to run everything as fast as possible. We usually say query less, add more caching mechanisms, add indexes to the columns being searched, but another solution is not to use the database at all and look for better solutions for your querying needs. <br />When querying for text in our databases, we’re often doing “LIKE” searches. Like searches are only performant if we have an index in that field and the query is written in a way that the index is used. Imagine that you have a field “name” and it contains the text “Battlestar Galactica”. This query would be able to run and use the index:<br />SELECT p.* FROM products p WHERE p.name LIKE “Battlestar%”<br />The database would be able to optimize this query and use the index to find the expected row. But, what if the query was like this one:<br />SELECT p.* FROM products p WHERE p.name LIKE “%Galactica”<br />Database indexes usually match from left to right, so, unless you have a nasty trick under your sleeve, this query will just look at ALL the rows in the products table and perform a match on every “name” column before returning a result. And that’s Really Bad News for you, as the DBA will probably come for you holding a Morning Star to beat you badly. So, querying with “LIKE” when you what you need is full text search isn’t nice. <br />That’s where full text search based solutions come in for help. Tools like Solr allow you to perform optimized text searches, filter input, categorization and even features like Google’s “Did you mean?”.<br />In this tutorial you’ll learn how to add full text searching capabilities to your Rails application using Sunpot and Solr. We will also delve a little bit into Solr’s configuration and learn how to use specific tokenizers to clear input, perform partial matching of words and faceting results.<br />This project uses Rails 3 and Ruby 1.9.2, you’ll find a Gemfile and and “.rvmrc” with all dependencies declared, it should be pretty easy to follow or setup your environment based on it (if you’re not using RVM, that’s a GREAT time to learn using it).<br />You can possibly follow this tutorial with a previous Rails version and without bundler or RVM, given all models and most of the code will look exactly the same in Rails 2 and Sunspot is compatible to Rails 2 too.<br />The source code for this application is available at GitHub here - https://github.com/mauricio/sunspot_tutorial <br />Starting the engines<br />Download the Sunspot source code from Github - https://github.com/outoftime/sunspot<br />Enter the project folder and go to “sunspot/solr-1.3”, inside that folder you should see a “solr” folder, copy this folder into your project’s folder. This is where the general Solr configuration is going to live, don’t worry about these files just yet, we’ll get to them later in this tutorial.<br />Now create a “sunspot.yml” file under your project’s “config” folder, here’s a sample:<br />Listing 1 – sunspot.yml<br />development:<br /> solr:<br /> hostname: localhost<br /> port: 8980<br /> log_level: INFO<br /> auto_commit_after_delete_request: true<br />test:<br /> solr:<br /> hostname: localhost<br /> port: 8981<br /> log_level: OFF<br />production:<br /> solr:<br /> hostname: localhost<br /> port: 8982<br /> log_level: WARNING<br /> auto_commit_after_request: true <br />You can have different configurations for every environment you’re running. To see all configuration options, go to the Sunspot source code and head to the “sunspot_rails/lib/sunspot/rails/configuration.rb” file.<br />Now we’ll create two models, Product and Category, so let’s start by creating the migration that will setup them:<br />rails g migration create_base_tables<br />Listing 2 – create_base_tables.rb<br />class CreateBaseTables < ActiveRecord::Migration<br /> def self.up<br /> create_table :categories do |t|<br /> t.string :name, :null => false<br /> end<br /> <br /> create_table :products do |t|<br /> t.string :name, :null => false<br /> t.decimal :price, :scale => 2, :precision => 16, :null => false<br /> t.text :description<br /> t.integer :category_id, :null => false<br /> end<br /> <br /> add_index :products, :category_id<br /> <br /> end<br /> def self.down<br /> drop_table :categories<br /> drop_table :products<br /> end<br /> <br />end<br />Now we move on to the basic models, starting with the Category model:<br />Listing 3 – category.rb<br />class Category < ActiveRecord::Base<br /> <br /> has_many :products<br /> <br /> validates_presence_of :name<br /> validates_uniqueness_of :name, :allow_blank => true<br /> <br /> searchable :auto_index => true, :auto_remove => true do<br /> text :name<br /> end<br /> <br /> def to_s<br /> self.name<br /> end<br /> <br />end<br />Here in the Category class we see our first reference to Sunspot, the “searchable” method, where we configure the fields that should be indexed by Solr. At the Category class, there’s only one field that’s useful at this moment, the “name”, so we tell Sunspot to configure the field name to be indexed as “text” (you usually don’t want your text indexed as “string”, as it will only be a hit in a full match). <br />The :auto_index and :auto_remove options are there to let Sunspot automatically send your model to be indexed at Solr when it is created/updated/destroyed. The default is “false” for both values, which means you have to manually send your data to Solr and unless you really want to do that, you should keep both of these values as “true” in your models.<br />Now lets look at the Product class:<br />Listing 4 – product.rb<br />class Product < ActiveRecord::Base<br /> <br /> belongs_to :category<br /> <br /> validates_presence_of :name, :description, :category_id, :price<br /> validates_uniqueness_of :name, :allow_blank => true<br /> <br /> searchable :auto_index => true, :auto_remove => true do<br /> text :name, :boost => 2.0<br /> text :description<br /> float :price<br /> integer :category_id<br /> end<br /> <br /> def to_s<br /> self.name<br /> end<br /> <br />end<br />In our Product class things are a little bit different, we have more fields (and more kinds) being indexed. “float” and “integer” are pretty self explanatory, but the “name” field has some black magic floating around, with the “boost” parameter. Boosting a field when indexing means that if the match is in that specific field, it has more “relevance” than if found somewhere else.<br />Imagine that you’re looking for Iron Maiden’s “Powerslave” album. You go to Iron Maiden’s Online Store and search for “powerslave”, hoping that the album will be the first hit, but then you see “Live After Dead” before “Powerslave”. Why did it happen? The “Live After Dead” album contains the “Powerslave” song in it’s track listing, so it’s a match as much as the real “Powerslave” album. What we need here is to tell the search tool that if a match is on an album name, it has higher relevance than if the hit is in the track listing.<br />Boosting allows you to reduce these issues. Some fields are inherently more important than others and you can tell that to Solr by configuring a “:boost” value for them. When something matches on them, the relevance of that match will be improved and it should come up before the other results in search.<br />Searching<br />Now let’s take a look at the ProductsController to see how we perform the search:<br />Listing 4 – products_controller.rb<br />class ProductsController < ApplicationController<br /> <br /> def index<br /> @products = if params[:q].blank?<br /> Product.all :order => 'name ASC'<br /> else<br /> Product.solr_search do |s|<br /> s.keywords params[:q]<br /> end<br /> end<br /> end<br />end<br />As you can see, searching is quite simple, you just call the solr_search method and send in the text to be searched for. One thing that I don’t like about Sunspot is that searches do not return an Array like object, you get a Sunspot::Search::StandardSearch object that has, as a property, the results array which contains the records returned by the search.<br />Here’s a simple way to fix this issue (I usually place the contents of this file inside an initializer in “config/initializers”):<br />Listing 5 – sunspot_hack.rb<br />::Sunspot::Search::StandardSearch.class_eval do<br /> include Enumerable<br /> delegate( <br /> :current_page, <br /> :per_page, <br /> :total_entries, <br /> :total_pages, <br /> :offset, <br /> :previous_page, <br /> :next_page, <br /> :out_of_bounds?,<br /> :each,<br /> :in_groups_of,<br /> :blank?,<br /> :[],<br /> :to => :results)<br />end<br />This simple monkeypatch makes the search object itself behave like an Enumerable/Array and you can use it to navigate directly in the results, without having to call the “results” method. The methods usually used by will_paginate helpers are also included so you can pass this object to a will_paginate call in your view and it’s just going to work.<br />Indexing<br />Now that all the models are in place, we can start fine tuning the Solr indexing process. First thing to understand here is what happens when you send text to be indexed by Solr, let’s get into the tool, starting the server:<br />rake sunspot:solr:run<br />This rake task starts Solr in the foreground (if you wanted to start it in the background, you’d use “sunspot:solr:start”). With Solr running, you should add some data to the database, this tutorial’s project on Github contains a “seed.rb” file with some basic data for testing, just copy it over your project. <br />Also copy the “lib/tasks/db.rake” from the project to your project, it contains a “db:prepare” task that truncates the database, seeds it and then indexes all items in Solr and we’re doing to be reindexing data a lot.<br />With everything copied, run the “db:prepare” task:<br />rake db:prepare<br />This will add the categories and products to your database and also index them in Solr. If this task did run successfully, head to the Solr administration interface, at this URL:<br />http://localhost:8980/solr/admin/schema.jsp<br />Once you go to it, click on the “FIELDS”, then on “NAME_TEXT”, you should see a screen just like the one in image 1:<br />Image 1 – Solr schema browser<br />If you don’t see all the fields that are available in this image, your “rake db:prepare” command has probably failed or Solr wasn’t running when you called it. <br />What we see here is the information about the fields we’re indexing. This specific field contains all data from the name properties from both Category and Product classes, as you can notice from the top 10 terms.<br />The name field is not indexed by it’s full content, as a relational database would usually do, the text is broken into tokens, by the solr.StandardTokenizerFactory class in Solr. This class receives our text, like “Battlestar Galactica: The Boardgame” and turns it into:<br />[“Battlestar”, “Galactica”, “The”, “Boardgame”]<br />This is what gets indexed and, ultimately, searched by Solr. If you open the web application now and try to search for “battle”, you won’t have any matches. If you search for “Battlestar”, you get the two products that match the name.<br />Everything when indexing information in Solr revolves around building the best “tokens” available for your input. You have to teach Solr to crunch your data in a way that makes sense and makes it easy to search for, and adding filters to the indexing process does this. While in the same page as Image 1 above, click on the “DETAILS” links as shown in Image 2:<br />Image 2 – Viewing the analysis and search filters<br />Each field in Solr has two analyzers, one is the “index” analyzer, that prepares the input to be indexed and the other is the “query” analyzer that prepares the search input to finally perform a search. Unless you have some special need, both of them are usually the same. <br />In our current configuration, we have the same two filters for both of the analyzers. The StandardFilterFactory filter removes punctuation characters from our input (the “:” in “Battlestar Galactica: The Boardgame” is not in our tokens) and the LowerCaseFilterFactory makes all input lowercased so we can search with “baTTle”, “BATTLE”, “BaTtLe” and they’re all going to work.<br />Before we move on to add more filters to our analyzers, let’s take a look at the analyzer screen in Solr Admin at - http://localhost:8980/solr/admin/analysis.jsp?highlight=on<br />In this screen we see how our input is going to be transformed into tokens by the configured analyzers.<br />Image 3 – Solr analyzer page<br />In this screen we have selected the “name_text” field in Solr. In the “Field value (Index)” you enter the values you’re sending to be indexed, just like you would send from your model property, in the “Field value (Query)” you enter the values you’d use to search.<br />Once you type and hit “Analyze” you should see the output just below the form as we see in Image 3. This output shows how your input is transformed into tokens by the tokenizer and filters, this way you can easily experiment by adding more filters and seeing if the output really matches the way you’d expect it to. This analysis view is your best friend when debugging search/indexing related issues or trying out ways to improve the way Solr indexes and matches your data.<br />Customizing fields<br />Now that you have an idea about how the indexing and searching process work, let’s start to customize the fields in Solr, open up the “solr/conf/schema.xml” file and look for this reference:<br />Listing 6 – solr/conf/schema.xml except<br /> <fieldtype class=quot; solr.TextFieldquot; positionIncrementGap=quot; 100quot; name=quot; textquot; ><br /> <analyzer><br /> <tokenizer class=quot; solr.StandardTokenizerFactoryquot; /><br /> <filter class=quot; solr.StandardFilterFactoryquot; /><br /> <filter class=quot; solr.LowerCaseFilterFactoryquot; /><br /> </analyzer><br /> </fieldtype><br />If you look at Image 1, where we saw the “name_text” configuration, you’ll see that the field type is “text”, this except above is the configuration for all fields of type “text”, which means that if we add more filters here we’ll affect all fields of this type. This greatly simplifies the way we configure the tool, as we don’t have to define explicit configurations for every single field that our models have, we can just reuse this same “text” config for all fields that are supposed to be indexed as text.<br />But that’s a lot of talking, let’s get into action!<br />Let’s start the job by looking at our indexed data from before:<br />[“battlestar”, “galactica”, “the”, “boardgame”]<br />The “the” is mostly useless, as it’s going to be available in almost all properties and no one is ever going to search for “the” (oh yeah, there might be that ONE guy that does it). In Information Retrieval lingo, “the” is a stop word, it usually doesn’t have meaning by itself and doesn’t represent valuable information for our indexer, removing all stop words from your input improves performance and the relevance of your results.<br />Given that this is a common operation, Solr already contains a filter that’s capable of removing all stop words from your data, the solr.StopFilterFactory, let’s see how we can add it to our config:<br />Listing 7 – solr/config/schema.xml except<br /><fieldtype class=quot; solr.TextFieldquot; positionIncrementGap=quot; 100quot; name=quot; textquot; ><br /> <analyzer><br /> <tokenizer class=quot; solr.StandardTokenizerFactoryquot; /><br /> <filter class=quot; solr.StandardFilterFactoryquot; /><br /> <filter class=quot; solr.LowerCaseFilterFactoryquot; /><br /> <filter class=quot; solr.StopFilterFactoryquot; words=quot; stopwords.txtquot; ignoreCase=quot; truequot; /><br /> <filter class=quot; solr.ISOLatin1AccentFilterFactoryquot; /><br /> <filter class=quot; solr.TrimFilterFactoryquot; /> <br /> </analyzer><br /></fieldtype><br />If you look at the “solr/config” folder you’ll se a “stopwords.txt” file that already contains most of the common stop words in English, you can add or remove words from there as needed and if you’re not indexing English text you can just remove the English names and add your language’s stop words. Now change this in your “solr/config/schema.xml” file and stop and start Solr again and open the analyzer:<br />Image 4 – Solr analyzer page <br />As you can see, in the last step, the “the” was removed from both the index input and the query input, we’re maintaining only the pieces of information that are really useful, this makes our index smaller and also speeds up searching.<br />While you were not looking, we have also added two other filters, solr.ISOLatin1AccentFilterFactory, that removes accents from words in Latin based languages, like Portuguese. If the input is “não”, it becomes “nao”. And after that there’s solr.TrimFilterFactory, that removes unnecessary spaces from our tokens.<br />Partial matching<br />Another pretty common need is to be able to match only a part of a word, usually a prefix. In the beginning of the tutorial, we saw that searching for “battle” doesn’t yield any results, while “battlestar” does. This happens because Solr, by default, only sees a match if it’s a full match. The word you entered must be exactly the same as a token that’s available in the index, if there is no exact match, Solr you tell you that there are no results.<br />If you look at Lucene’s Query Parser Syntax - http://lucene.apache.org/java/2_9_1/queryparsersyntax.html (Solr is somewhat a web interface to Lucene) you’ll see that you can use the “*” operator to perform a partial match. We could then search for “battle*” and this would yield the results we expect, but doing this kind of partial matching is slow and could possibly become a bottleneck for your application, so we have to figure out another way to do this.<br />When all you need is prefixed partial matching, the solr.EdgeNGramFilterFactory is your best friend. It will break words into pieces that will then be added to the index, so it looks like you have partial matching, but in fact the partials are tokens by themselves in the index, let’s see how our config would look like in this case:<br />Listing 8 – solr/config/schema.xml except<br /><fieldtype class=quot; solr.TextFieldquot; positionIncrementGap=quot; 100quot; name=quot; textquot; ><br /> <analyzer type=quot; indexquot; ><br /> <tokenizer class=quot; solr.StandardTokenizerFactoryquot; /><br /> <filter class=quot; solr.StandardFilterFactoryquot; /><br /> <filter class=quot; solr.LowerCaseFilterFactoryquot; /><br /> <filter class=quot; solr.StopFilterFactoryquot; words=quot; stopwords.txtquot; ignoreCase=quot; truequot; /><br /> <filter class=quot; solr.ISOLatin1AccentFilterFactoryquot; /><br /> <filter class=quot; solr.TrimFilterFactoryquot; /> <br /> <filter class=quot; solr.EdgeNGramFilterFactoryquot; <br /> minGramSize=quot; 3quot; <br /> maxGramSize=quot; 30quot; /><br /> </analyzer><br /> <analyzer type=quot; queryquot; ><br /> <tokenizer class=quot; solr.StandardTokenizerFactoryquot; /><br /> <filter class=quot; solr.StandardFilterFactoryquot; /><br /> <filter class=quot; solr.LowerCaseFilterFactoryquot; /><br /> <filter class=quot; solr.StopFilterFactoryquot; words=quot; stopwords.txtquot; ignoreCase=quot; truequot; /><br /> <filter class=quot; solr.ISOLatin1AccentFilterFactoryquot; /><br /> <filter class=quot; solr.TrimFilterFactoryquot; /> <br /> </analyzer><br /></fieldtype><br />As you can see, now we have two <analyzer> sections in our <fieldtype>, one of the analyzers is for “index” and the other is for “query”. This is needed because we don’t want to have our search parameters being transformed for a partial match. If the user is searching for “battle”, it doesn’t makes sense to show him results for “bat”, so the generation of pieces of each word should be done only when indexing information.<br />Now restart your Solr instance and head run again the form we had in the analyzer view, you should see something like Image 5:<br />Image 5 – Analyzer output with partial matching enabled<br />Looking at the output, “battlestar” became:<br />[“bat”, “batt”, “battl”, “battle”, “battles”, “battlest”, “battlesta”, “battlestar”]<br />Now, if you search for “battle”, you should find all products that have “battle” as a prefix in any of their words and the search input is not affected by this change.<br />Faceting<br />Faceting of results is YACF (Yet Another Cool Feature) that you have when using Solr and Sunspot. “What does that mean?”, you might ask, it means that Solr is able to organize your results based on one of it’s properties and tell you how many results did match for every property value.<br />“I still don’t get it”, you might be thinking now. In our Product model we’re indexing the “category_id” property, we’ll tell Sunspot to facet our search based on the “category_id” field and Sunspot will tell us how many matches each category had, even if we’re paginating the results. Let’s see how our searching code would change:<br />Listing 9 – products_controller.rb except<br /> def index<br /> @page = (params[:page] || 1).to_i<br /> @products = if params[:q].blank?<br /> Product.paginate :order => 'name ASC', :per_page => 3, :page => @page<br /> else<br /> <br /> result = Product.solr_search do |s|<br /> s.keywords params[:q]<br /> unless params[:category_id].blank?<br /> s.with( :category_id ).equal_to( params[:category_id].to_i )<br /> else<br /> s.facet :category_id<br /> end<br /> s.paginate :per_page => 3, :page => @page<br /> end<br /> <br /> if result.facet( :category_id )<br /> @facet_rows = result.facet(:category_id).rows <br /> end<br /> <br /> result<br /> end<br /> end<br />The search code really changed a lot, now if there’s a “category_id” parameter we will use that to filter our search, if there isn’t we’re going to perform faceting with the “s.facet :category_id” call. There’s also a slight change to the “product.rb” class, let’s see it:<br />Listing 10 – product.rb except<br /> searchable :auto_index => true, :auto_remove => true do<br /> text :name, :boost => 2.0<br /> text :description<br /> float :price<br /> integer :category_id, :references => ::Category<br /> end<br />We’ve added the “:references => ::Category” to the “:category_id” field configuration so Sunspot knows that this field is, in fact, a foreign key to another object, this will allow Sunspot to load the categories in the facets automatically for you.<br />The “result.facet(:category_id)” asks the search object for the array that contains the facets returned for the :category_id field in this search. Each row in this list contains an “instance” (which, in our case, is an Category object) and a “count”, that’s the number of hits in that specific facet. Once you get your hands at the rows, we can use it in our view, let’s see how we used them:<br />Listing 11 – products/index.html.haml except<br /> - if !@facet_rows.blank? && @facet_rows.size > 1<br /> %ul<br /> - for row in @facet_rows<br /> %li= link_to( quot; #{row.instance} (#{row.count})quot; , products_path( :q => params[:q], :category_id => row.instance ) )<br />If there are facets available, we use them to add links that will make the user filter based on each specific facet, each row object has an instance and a count, and we use both in the interface to tell the user which category is it and how many hits it had. Look at how our user interface looks like:<br />Image 6 – Faceting information<br />And now you finally have search functionality added to a Rails project, with partial matching, faceting, pagination and input cleanup. Just forget that you have ever performed a “SELECT p.* FROM products p WHERE p.name LIKE ‘%battle%’” and be happy to be using a great full text search solution.<br />Conclusion<br />Hopefully this tutorial should be enough to get you up and running with Solr, for more advanced features I’d recommend you to search on the Solr wiki (http://wiki.apache.org/solr/FrontPage ) and buy “Solr 1.4 – Enterprise Search Server” by David Smiley and Erick Pugh (http://www.amazon.com/gp/product/1847195881?ie=UTF8&tag=ultimaspalavr-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1847195881 ).<br />