Apache Solr Ratification




     cdevecchi@gmail.com
Solr – What is it?


• Apache Project
• Open source engine based in lucene
• APIs XML/HTTP e JSON
Features


•   Lemmatization

•   Hit Highlight

•   Dictionaries

•   Geosearch

•   Faceted Search

•   Caching

•   Index Replication and Databases Integration
Characteristics


•   Java -> Tomcat / Jboss / Jetty

•   Schema

•   Client solrj

•   Jmx statistics
Query

• Highlighting
   – Activated by query (hl=true)


• Text Analysis
   – Use dictionary and thesaurus
   – Relevancy searchs
   – Spelling suggestions
   – Search by similarity (“More like this”)
   – Fuzzy (Damerau-Levenshtein distance)
Query


• Querying data
   – Words
   – Words by field
   – Orderly (sort)


• Faceted Search
   – Categories
Query


• Faceted Search, the queries could be a problem?
   – Exemple


   http://localhost:8983/solr/select?

   q=video&rows=0&facet=true&facet.field=inStock

   &facet.query=price:[*+TO+500]

   &facet.query=price:[500+TO+*]

   &facet.prefix=xx&facet.limit=5&facet.mincount=1
Data indexing




• Solr XML native
• CSV
• Database (DIH)
• Rich Documents
• Crawler
Index




• Index is being larger than you imagine?


• Could be adjusted in:
   – Index size segments
   – Merge index segments
Collections



• It is possible to create separated index by documents kind
Data Replication


• Master / Slave
   - Index
   - Config files
Sharding

• ZooKeeper
  – http://hadoop.apache.org/zookeeper/
SolrCaching


• Put searched docs on cache
• Two implementations
   – Solr.search.LRUCache (LRU= Least Recently Used in
     memory)
   – Solr.search.FastLRUCache (a partir da versão 1.4)
• How to use
   – filterCache
   – queryResultCache
   – documentCache (sobe tudo em memória)
Cluster – Carrot2



• Search Results Clustering Engine

• Search in many nodes




•   Live Demo

     – http://search.carrot2.org/stable/search
Crawling


• Apache Nutch
  – Search, parse and parallel indexing or distributed indexing
  – Many formats
     • Ex. plain text, html, xml, zip, .doc, javascript, rss, pdf, etc
  – Cluster
  – MapReduce
  – Distributed Filesystem (via hadoop)
Backup / Snapshot



• Active by scripts (solr-tools)

• Index snapshots

• Diferencial backups

   – $solr_data/yyyymmdd
Architecture (Master/Slave)
Architecture (Índice Distribuído)
Indexing Tests



• Indexing tests
   • 7k xml sized, with 111 fields


• 1,2 milion docs on index


• VM -> 2GB RAM, processor 2.33 Ghz
Indexing Tests




                 90

                 44
Search Tests
QPS




      61
       0
37
 5

38    38
References




•   http://lucene.apache.org/solr/

•   http://wiki.apache.org/solr/

•   http://project.carrot2.org/

•   http://download.carrot2.org/head/manual/index.html#chapter.introduction

•   http://wiki.apache.org/solr/ZooKeeperIntegration

Solr

  • 1.
    Apache Solr Ratification cdevecchi@gmail.com
  • 2.
    Solr – Whatis it? • Apache Project • Open source engine based in lucene • APIs XML/HTTP e JSON
  • 3.
    Features • Lemmatization • Hit Highlight • Dictionaries • Geosearch • Faceted Search • Caching • Index Replication and Databases Integration
  • 4.
    Characteristics • Java -> Tomcat / Jboss / Jetty • Schema • Client solrj • Jmx statistics
  • 5.
    Query • Highlighting – Activated by query (hl=true) • Text Analysis – Use dictionary and thesaurus – Relevancy searchs – Spelling suggestions – Search by similarity (“More like this”) – Fuzzy (Damerau-Levenshtein distance)
  • 6.
    Query • Querying data – Words – Words by field – Orderly (sort) • Faceted Search – Categories
  • 7.
    Query • Faceted Search,the queries could be a problem? – Exemple http://localhost:8983/solr/select? q=video&rows=0&facet=true&facet.field=inStock &facet.query=price:[*+TO+500] &facet.query=price:[500+TO+*] &facet.prefix=xx&facet.limit=5&facet.mincount=1
  • 8.
    Data indexing • SolrXML native • CSV • Database (DIH) • Rich Documents • Crawler
  • 9.
    Index • Index isbeing larger than you imagine? • Could be adjusted in: – Index size segments – Merge index segments
  • 10.
    Collections • It ispossible to create separated index by documents kind
  • 11.
    Data Replication • Master/ Slave - Index - Config files
  • 12.
    Sharding • ZooKeeper – http://hadoop.apache.org/zookeeper/
  • 13.
    SolrCaching • Put searcheddocs on cache • Two implementations – Solr.search.LRUCache (LRU= Least Recently Used in memory) – Solr.search.FastLRUCache (a partir da versão 1.4) • How to use – filterCache – queryResultCache – documentCache (sobe tudo em memória)
  • 14.
    Cluster – Carrot2 •Search Results Clustering Engine • Search in many nodes • Live Demo – http://search.carrot2.org/stable/search
  • 15.
    Crawling • Apache Nutch – Search, parse and parallel indexing or distributed indexing – Many formats • Ex. plain text, html, xml, zip, .doc, javascript, rss, pdf, etc – Cluster – MapReduce – Distributed Filesystem (via hadoop)
  • 16.
    Backup / Snapshot •Active by scripts (solr-tools) • Index snapshots • Diferencial backups – $solr_data/yyyymmdd
  • 17.
  • 18.
  • 19.
    Indexing Tests • Indexingtests • 7k xml sized, with 111 fields • 1,2 milion docs on index • VM -> 2GB RAM, processor 2.33 Ghz
  • 20.
  • 21.
  • 22.
    QPS 61 0 37 5 38 38
  • 23.
    References • http://lucene.apache.org/solr/ • http://wiki.apache.org/solr/ • http://project.carrot2.org/ • http://download.carrot2.org/head/manual/index.html#chapter.introduction • http://wiki.apache.org/solr/ZooKeeperIntegration

Editor's Notes