Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Solr

903 views

Published on

Published in: Technology
  • Be the first to comment

Solr

  1. 1. Apache Solr Ratification cdevecchi@gmail.com
  2. 2. Solr – What is it?• Apache Project• Open source engine based in lucene• APIs XML/HTTP e JSON
  3. 3. Features• Lemmatization• Hit Highlight• Dictionaries• Geosearch• Faceted Search• Caching• Index Replication and Databases Integration
  4. 4. Characteristics• Java -> Tomcat / Jboss / Jetty• Schema• Client solrj• Jmx statistics
  5. 5. Query• Highlighting – Activated by query (hl=true)• Text Analysis – Use dictionary and thesaurus – Relevancy searchs – Spelling suggestions – Search by similarity (“More like this”) – Fuzzy (Damerau-Levenshtein distance)
  6. 6. Query• Querying data – Words – Words by field – Orderly (sort)• Faceted Search – Categories
  7. 7. Query• Faceted Search, the queries could be a problem? – Exemple http://localhost:8983/solr/select? q=video&rows=0&facet=true&facet.field=inStock &facet.query=price:[*+TO+500] &facet.query=price:[500+TO+*] &facet.prefix=xx&facet.limit=5&facet.mincount=1
  8. 8. Data indexing• Solr XML native• CSV• Database (DIH)• Rich Documents• Crawler
  9. 9. Index• Index is being larger than you imagine?• Could be adjusted in: – Index size segments – Merge index segments
  10. 10. Collections• It is possible to create separated index by documents kind
  11. 11. Data Replication• Master / Slave - Index - Config files
  12. 12. Sharding• ZooKeeper – http://hadoop.apache.org/zookeeper/
  13. 13. SolrCaching• Put searched docs on cache• Two implementations – Solr.search.LRUCache (LRU= Least Recently Used in memory) – Solr.search.FastLRUCache (a partir da versão 1.4)• How to use – filterCache – queryResultCache – documentCache (sobe tudo em memória)
  14. 14. Cluster – Carrot2• Search Results Clustering Engine• Search in many nodes• Live Demo – http://search.carrot2.org/stable/search
  15. 15. Crawling• Apache Nutch – Search, parse and parallel indexing or distributed indexing – Many formats • Ex. plain text, html, xml, zip, .doc, javascript, rss, pdf, etc – Cluster – MapReduce – Distributed Filesystem (via hadoop)
  16. 16. Backup / Snapshot• Active by scripts (solr-tools)• Index snapshots• Diferencial backups – $solr_data/yyyymmdd
  17. 17. Architecture (Master/Slave)
  18. 18. Architecture (Índice Distribuído)
  19. 19. Indexing Tests• Indexing tests • 7k xml sized, with 111 fields• 1,2 milion docs on index• VM -> 2GB RAM, processor 2.33 Ghz
  20. 20. Indexing Tests 90 44
  21. 21. Search Tests
  22. 22. QPS 61 037 538 38
  23. 23. References• http://lucene.apache.org/solr/• http://wiki.apache.org/solr/• http://project.carrot2.org/• http://download.carrot2.org/head/manual/index.html#chapter.introduction• http://wiki.apache.org/solr/ZooKeeperIntegration

×