Enterprise Search Solution: Apache SOLR. What's available and why it's so cool

  • 1,236 views
Uploaded on

Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to …

Solr is a highly scalable and fast open source enterprise search platform from the Apache Lucene project. Let's explore why some of the largest Internet sites in the world are giving a preference to its many exciting features.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,236
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache SOLREnterprise Search Solution (overview)
  • 2. Enterprise Search ServerThe criteria ...•Fast•Flexible•Powerful•Scalable•Relevant Results•Production ready & Easy deployment
  • 3. Why SOLR• Greater control over your website search• Caching, Replication, Distributed search• Really fast Indexing/Searching, Indexes can be merged/optimized (Index compaction)• Great admin interface can be used over HTTP• Awesome community support• Support for integration with various other products
  • 4. SOLR Poweredhttp://wiki.apache.org/solr/PublicServers/ • whitehouse.gov • eBay • Instagram • The Guardian • Apple • Netflix • NASA • Shopper • CISCO • News.com • Disney • digg • Sears • AOL
  • 5. What is SOLR?• Very fast full text search enginehttp://lucene.apache.org/solr/• Based on Apache Lucene - high-performance, full- featured text search engine library written entirely in Java. In brief Apache Solr exposes Lucenes JAVA API as REST like APIs which can be called over HTTP from any programming language/platform
  • 6. Features• Full Text Search• Faceted navigation• More items like this(Recommendation)/ Related searches• Spell Suggest/Auto-Complete• Custom document ranking/ordering• Snippet generation/highlighting• Geospatial Search
  • 7. Spell Suggest/Auto-Complete
  • 8. Faceted navigation, paging
  • 9. Geospatial Search
  • 10. More Features ...• Database integration• Rich document (Word, PDF) handling• REST-like HTTP/XML, JSON APIs (so, you can code virtually in any language)• Flexible configuration• Extensive Plugin architecture for advanced customization• Scalable distributed search, dynamic clustering, index replication
  • 11. App Server Support• Apache Tomcat• Jetty• Resin• WebLogicTM• WebSphereTM• GlassFish• dmServerTM• JBossTM... and many more
  • 12. SOLR History• Developed at CNET Networks by Yonik Seeley• Donated to ASF (Apache Software Foundation) in early 2006• Incubation period ended in january 2007 (v1.2 released)• Solr is now maintained as a subproject of Lucene
  • 13. Solr• Only one table (documents). No joins.• Each row is a document• A document can have multiple fields and fields can have multiple values– e.g. Tags, Categories, ...• Fast for search (finding the documents)• Slow when returning large sets of data• Can scale to many millions of documents
  • 14. Solr Architecture• Servlet: Jetty,Tomcat ... any :)– Handles http• Solr– Connectivity between Servlet and Lucene• Lucene– Full Text Search Framework
  • 15. SOLR Workflow
  • 16. How Lucene Works• key ID Regular indexes banana 1 repeat index data banana 2 for each row banana 3 cat 2 cat 3 dog 1• Inverted Indexes dog 3 reference the term Term IDs once and then the banana 1,2,3 matching documents cat 2,3 dog 1,3
  • 17. Inverted Index Matching cat bananaTerm IDs Document 1 2 3banana 1,2,3 cat 0 1 1cat 2,3 banana 1 1 1•dog Lucene uses bit 1,3 Match 0 1 1 vectors to quickly dog cat find all documents Document 1 2 3 with terms dog 1 0 1 cat 0 1 1 Match 0 0 1
  • 18. Scoring• Now that the documents are found, what order should they be viewed• Lucene uses TF-IDF (Term Frequency- Inverse Document Frequency) to score the documents Term IDs banana {1.28} 1 {2}, 2 {5}, 3 {1} cat {1.60} 2 {4}, 3 {2} dog {1.60} 1 {1}, 3 {6}
  • 19. Scoring NotesThe goal of scoring is:•To boost the importance of documents where the word is mentioned often•To boost the importance of rare words (that don’t appear in many documents) Solr supports term boosts to increase the importance of one term over another as well
  • 20. Stemming, Stopwords, Synonyms• Terms are trimmed of suffixestrimmed -> trimstemming -> stem• Stopwords remove common parts of speech that are not importantthe, and, for, it, ...• This is done with both the words in the document and the query terms• Solr supports search by predefined synonyms list
  • 21. Configuring Solr• Schema.xml – Contains all of the details about document structure, index-time and query-time processing• Solrconfig.xml - Contains most of the parameters for configuring Solr itself
  • 22. QUERY SYNTAXES (RDBMS)SELECT * FROM post WHERE (topic LIKE ‘%apache%’ OR author LIKE ‘%bambr%’) OR (topic LIKE ‘%solr%’ OR author LIKE ‘%frank%’)ORDER BY id DESCQUERY SYNTAXES (SOLR) Topic:"The Right Way" AND author:WrongGuy
  • 23. Querying Solr 1• Plain text searchq = text:"I love android"• Expanding search to more fields :title:android & type:review & price:[* To 500]• Add facets facet.field=product & facet.field=rating• Ordering resultssort = score desc, price asc
  • 24. Querying Solr 2• Add facets for range queriesfacet.query=price:[* TO 100] &facet.query=price:[100 TO 200] &facet.query=price:[500 TO *]• Limiting resultsrows=15• Paginating on resultsstart=25 & rows=10
  • 25. Querying Solr 3Advanced Query operators:•fq : FilterQuery fq = type:review & price:[* TO 500]•fl : Restrict fields to be returnedfl=id,title,text•hl : Highlighting matches in snippet, Snippet generation etc. hl=true&hl.fl=title,text
  • 26. Solr Caching• External Caching : Memcached, etc.• Internal Caching Different types of cache: 1) FilterCache: Used by facetQueries(fq), sometimes for faceting too 2) QueryResultCache : Used for results returned by generic queries 3) DocumentCache
  • 27. Books
  • 28. Skype: dgolovkodimtkg@gmail.com