Enterprise Search Solution: Apache SOLR. What's available and why it's so cool


Published on

Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool

  1. 1. Apache SOLREnterprise Search Solution (overview)
  2. 2. Enterprise Search ServerThe criteria ...•Fast•Flexible•Powerful•Scalable•Relevant Results•Production ready & Easy deployment
  3. 3. Why SOLR• Greater control over your website search• Caching, Replication, Distributed search• Really fast Indexing/Searching, Indexes can be merged/optimized (Index compaction)• Great admin interface can be used over HTTP• Awesome community support• Support for integration with various other products
  4. 4. SOLR Poweredhttp://wiki.apache.org/solr/PublicServers/ • whitehouse.gov • eBay • Instagram • The Guardian • Apple • Netflix • NASA • Shopper • CISCO • News.com • Disney • digg • Sears • AOL
  5. 5. What is SOLR?• Very fast full text search enginehttp://lucene.apache.org/solr/• Based on Apache Lucene - high-performance, full- featured text search engine library written entirely in Java. In brief Apache Solr exposes Lucenes JAVA API as REST like APIs which can be called over HTTP from any programming language/platform
  6. 6. Features• Full Text Search• Faceted navigation• More items like this(Recommendation)/ Related searches• Spell Suggest/Auto-Complete• Custom document ranking/ordering• Snippet generation/highlighting• Geospatial Search
  7. 7. Spell Suggest/Auto-Complete
  8. 8. Faceted navigation, paging
  9. 9. Geospatial Search
  10. 10. More Features ...• Database integration• Rich document (Word, PDF) handling• REST-like HTTP/XML, JSON APIs (so, you can code virtually in any language)• Flexible configuration• Extensive Plugin architecture for advanced customization• Scalable distributed search, dynamic clustering, index replication
  11. 11. App Server Support• Apache Tomcat• Jetty• Resin• WebLogicTM• WebSphereTM• GlassFish• dmServerTM• JBossTM... and many more
  12. 12. SOLR History• Developed at CNET Networks by Yonik Seeley• Donated to ASF (Apache Software Foundation) in early 2006• Incubation period ended in january 2007 (v1.2 released)• Solr is now maintained as a subproject of Lucene
  13. 13. Solr• Only one table (documents). No joins.• Each row is a document• A document can have multiple fields and fields can have multiple values– e.g. Tags, Categories, ...• Fast for search (finding the documents)• Slow when returning large sets of data• Can scale to many millions of documents
  14. 14. Solr Architecture• Servlet: Jetty,Tomcat ... any :)– Handles http• Solr– Connectivity between Servlet and Lucene• Lucene– Full Text Search Framework
  15. 15. SOLR Workflow
  16. 16. How Lucene Works• key ID Regular indexes banana 1 repeat index data banana 2 for each row banana 3 cat 2 cat 3 dog 1• Inverted Indexes dog 3 reference the term Term IDs once and then the banana 1,2,3 matching documents cat 2,3 dog 1,3
  17. 17. Inverted Index Matching cat bananaTerm IDs Document 1 2 3banana 1,2,3 cat 0 1 1cat 2,3 banana 1 1 1•dog Lucene uses bit 1,3 Match 0 1 1 vectors to quickly dog cat find all documents Document 1 2 3 with terms dog 1 0 1 cat 0 1 1 Match 0 0 1
  18. 18. Scoring• Now that the documents are found, what order should they be viewed• Lucene uses TF-IDF (Term Frequency- Inverse Document Frequency) to score the documents Term IDs banana {1.28} 1 {2}, 2 {5}, 3 {1} cat {1.60} 2 {4}, 3 {2} dog {1.60} 1 {1}, 3 {6}
  19. 19. Scoring NotesThe goal of scoring is:•To boost the importance of documents where the word is mentioned often•To boost the importance of rare words (that don’t appear in many documents) Solr supports term boosts to increase the importance of one term over another as well
  20. 20. Stemming, Stopwords, Synonyms• Terms are trimmed of suffixestrimmed -> trimstemming -> stem• Stopwords remove common parts of speech that are not importantthe, and, for, it, ...• This is done with both the words in the document and the query terms• Solr supports search by predefined synonyms list
  21. 21. Configuring Solr• Schema.xml – Contains all of the details about document structure, index-time and query-time processing• Solrconfig.xml - Contains most of the parameters for configuring Solr itself
  22. 22. QUERY SYNTAXES (RDBMS)SELECT * FROM post WHERE (topic LIKE ‘%apache%’ OR author LIKE ‘%bambr%’) OR (topic LIKE ‘%solr%’ OR author LIKE ‘%frank%’)ORDER BY id DESCQUERY SYNTAXES (SOLR) Topic:"The Right Way" AND author:WrongGuy
  23. 23. Querying Solr 1• Plain text searchq = text:"I love android"• Expanding search to more fields :title:android & type:review & price:[* To 500]• Add facets facet.field=product & facet.field=rating• Ordering resultssort = score desc, price asc
  24. 24. Querying Solr 2• Add facets for range queriesfacet.query=price:[* TO 100] &facet.query=price:[100 TO 200] &facet.query=price:[500 TO *]• Limiting resultsrows=15• Paginating on resultsstart=25 & rows=10
  25. 25. Querying Solr 3Advanced Query operators:•fq : FilterQuery fq = type:review & price:[* TO 500]•fl : Restrict fields to be returnedfl=id,title,text•hl : Highlighting matches in snippet, Snippet generation etc. hl=true&hl.fl=title,text
  26. 26. Solr Caching• External Caching : Memcached, etc.• Internal Caching Different types of cache: 1) FilterCache: Used by facetQueries(fq), sometimes for faceting too 2) QueryResultCache : Used for results returned by generic queries 3) DocumentCache
  27. 27. Books
  28. 28. Skype: dgolovkodimtkg@gmail.com