Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Moinuddin Ahmed -guided by Dr. Pushpak Bhattacharyya IIT Bombay7/15/2012 1
Outline Solr Introduction Lucene vs. Solr Solr Features Indexing in Solr Querying in Solr Assamese Search Engine Monolingual Search Cross lingual Search Conclusions Future Work7/15/2012 2
What is Solr? Solr is an open source enterprise search platform from the Apache Lucene project. Solr=Lucene + added features Allows for faster, more comprehensive searches on a large volume of data7/15/2012 3
Lucene vs Solr Lucene is a library while Solr is a web application that uses the Lucene library. Built on top of Lucene, Solr extends it with a set of robust features like- Hit highlighting Index replication Faceted searching Distributed searching etc..7/15/2012 4
Features Hit Highlighting - Shows a snippet of a document in the search results that surrounds the search terms. Faceted Search – Clusters search results into drill-down categories. Users can then “categorize" by applying specific constraints to the search results. Distributed Searching: The presence of the shards parameter in a request will cause that request to be distributed across all shards in the list. Pass a number of optional request parameters to the request handler to control what information is returned External XML Configuration –Solr is flexible and adaptable using XML configuration7/15/2012 5
Example of Faceted searching Manufacturer is FACET Dell, HP are constraints • is a technique for accessing information organized Facet count • Faceted search helps users who think in terms of attribute specifications as filtering criteria.7/15/2012 7
Faceted searching contd.. Imagine a situation, where the client wants to have the no. of companies in the cities where the companies were found by the query. One has to return the no. documents with same field value. the chosen facet value is used to construct a filter query which matches that value in the index7/15/2012 8
Distributed SearchWhen an index becomes too large to fit on a single system, an index can besplit into multiple shardsA single shard receives the query, distributes the query to other shardsSolr can query and merge results across those shards. 7/15/2012 9
STARTING UP THE SOLR SERVER Solr 1.4.1 uses Jetty 6.1.3 server Solr is started by the following commnad java –jar start.jar This will start up the jetty application server on port 89837/15/2012 10
Indexing can be done in two ways: Command line : java -jar post.jar *.xml Framework such as Nutch: bin/nutch solrindex <solr url> <crawldb> -linkdb <linkdb> (<segment> ... | -dir <segments>)7/15/2012 12
Schema.xml This file contains all of the details about which fields the documents can contain how those fields should be dealt with when adding documents to the index, or when querying those fields.7/15/2012 13
1)DATA TYPE <types> <fieldType name="string" class="solr.StrField” /> <fieldType name="long" class="solr.LongField” /> <fieldType name="float" class="solr.FloatField” /> <fieldType name="text" class="solr.TextField” /> </types>The <types> section allows one to define:1. a list of <fieldtype> declarations.2. underlying Solr class that should be used for that type,7/15/2012 15
2)Fields <field name="id" type="string" indexed="true" stored="true" multiValued="true"/> The <fields> section lists the individual<field> declarations one wishes to use in documents. Each <field> has a name that will be used to reference it when adding documents or executing searches and an associated type which identifies the name of the fieldtype one wishes to use for this field.7/15/2012 16
Some common options that fields can have are... default The default value for this field if none is provided while adding documents indexed=true|false True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable, and facetable. stored=true|false True if the value of the field should be retrievable during a search multiValued=true|false True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document7/15/2012 17
How to add analyzers in a field? <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="assamese_stop_words.txt"/> <filter class="solr.AssameseStemFilterFactory"/> </analyzer>7/15/2012 18
SOLR REQUEST HANDLER A SolrRequestHandler is a Solr Plugin that defines the logic executed for any request. Can be implemented either in solrconfig.xml or directly in the url/user interface.List of Request Handlers utilized StandardRequestHandler DisMaxRequestHandler LukeRequestHandler MoreLikeThisHandler
DismaxRequestHandler It is designed to process simple user entered phrases and search for the individual words across several fields using different weighting (boosts) based on the significance of each field.  Some parameters of DismaxRequestHandler: qf(query fields), fl(fields), pf(phrase fields), bq(boost query), etc. Example <requesthandler=dismax> <str name="fl"> title,content,anchor,host,url </str> <str name="qf"> url^3.0 anchor content^10.0 title^3.0 host^2.0 </str></requesthandler>
Response Writers A QueryResponseWriter is a Solr Plugin that defines the response format for any request. Uses a default format XmlResponseFormat. Also has several others response formats like Xslt
XSLT RESPONSE WRITER.. The XSLT Response Writer captures the output of the XML Response Writer and applies an XSLT transform to it. http://localhost:8983/solr/select/?q=‘user query’&wt=xslt&tr=example.xsl Parameters: Wt: writer used Tr: Selects the XSLT transformation to use, which must be found in Solrs conf/xslt directory. The Content-Type of the response is set according to the <xsl:output> statement in the XSLT transform, for example: <xsl:output media-type="text/html"/>
INDEXINGFor Assamese monolingual searchIndexed around 500 Assamese text files and about 120URLSupto depth 3.For Cross Lingual searchIndexed a few English URL s.
Analyzers used…• Assamese Stemmer suffix stripping (rule based) + dictionary look-up accuracy: 80%• English Porter Stemmer• Both Assamese and English uses Whitespace tokenizer.• Stop words are removed in both languages.
Future work.. Parsing the query programmatically. Building the resources for adding the Translation and transliteration modules in the monolingual pipeline.
CONCLUSION As we now know Solr uses the Lucene search library and extends it with a set of robust features. Solrs powerful external configuration allows it to be tailored to almost any type of application So it is preferable to use Solr is if a programmer wants to embed its added functionalities into his own existing application.7/15/2012 34