Intro to Apache Lucene and Solr


Published on

Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Rather than talk you through a lot of the features and functionality, let me show you
  • Do thisExample Queries:ipod184-pin DDRCover: Querying, scoring, faceting, clustering, function queries, spatial, grouping, more like this, indexing
  • Intro to Apache Lucene and Solr

    1. 1. Introduction to Open Source Search with Apache Lucene and Solr<br />Grant Ingersoll<br />
    2. 2. The How Many Game<br />How many of you:<br />Have taken a class in Information Retrieval (IR)?<br />Are doing work/research in IR?<br />Have heard of or are using Lucene?<br />Have heard of or are using Solr?<br />Are doing work on core IR algorithms such as compression techniques or scoring?<br />Are doing UI/Application work/research as they relate to search?<br />
    3. 3. Topics<br />Brief Bio<br />Search 101 (skip?)<br />What is:<br />Apache Lucene<br />Apache Solr<br />What can they do?<br />Features and functionality<br />Intangibles<br />What’s new in Lucene and Solr?<br />How can they help my research/work/____?<br />
    4. 4. Brief Bio<br />Apache Lucene/Solr Committer<br />Apache Mahout co-founder<br />Scalable Machine Learning<br />Co-founder of Lucid Imagination<br /><br />Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy<br />Co-Author of upcoming “Taming Text” (Manning Publications)<br /><br />
    5. 5. Search 101<br />Search tools are designed for dealing with fuzzy data/questions<br />Works well with structured and unstructured data<br />Performs well when dealing with large volumes of data<br />Many apps don’t need the limits that databases place on content<br />Search fits well alongside a DB too<br />Given a user’s information need, (query) find and, optionally, score content relevant to that need<br />Many different ways to solve this problem, each with tradeoffs<br />What’s “relevant” mean?<br />
    6. 6. Vector Space Model (VSM) for relevance<br />Common across many search engines<br />Apache Lucene is a highly optimized implementation of the VSM<br />Search 101<br />Relevance<br />Indexing<br />Finds and maps terms and documents <br />Conceptually similar to a book index<br />At the heart of fast search/retrieve<br />
    7. 7. Apache Lucene in a Nutshell<br /><br />Java based Application Programming Interface (API) for adding search and indexing functionality to applications<br />Fast and efficient scoring and indexing algorithms<br />Lots of contributions to make common tasks easier:<br />Highlighting, spatial, Query Parsers, Benchmarking tools, etc.<br />Most widely deployed search library on the planet<br />
    8. 8. Lucene Basics<br />Content is modeled via Documents and Fields<br />Content can be text, integers, floats, dates, custom<br />Analysis can be employed to alter content before indexing<br />Searches are supported through a wide range of Query options<br />Keyword<br />Terms<br />Phrases<br />Wildcards<br />Many, many more<br />
    9. 9. Apache Solr in a Nutshell<br /><br />Lucene-based Search Server + other features and functionality<br />Access Lucene over HTTP:<br />Java, XML, Ruby, Python, .NET, JSON, PHP, etc.<br />Most programming tasks in Lucene are configuration tasks in Solr<br />Faceting (guided navigation, filters, etc.)<br />Replication and distributed search support<br />Lucene Best Practices<br />
    10. 10. A small sampling of Lucene/Solr-Powered Sites<br />10<br /><br />
    11. 11. Features and Functionality<br />
    12. 12. Quick Solr/Lucene Demo<br />Pre-reqs:<br />Apache Ant 1.7.x, Subversion (SVN)<br />Command Line 1:<br />svn co<br />cdsolr-trunk/solr/<br />ant example<br />cd example<br />java –Dsolr.clustering.enabled=true –jar start.jar<br />Command Line 2<br />cd exampledocs; java –jar post.jar *.xml<br />http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true<br />
    13. 13. Other Features<br />Data Import Handler<br />Database, Mail, RSS, etc.<br />Rich document support via Apache Tika<br />PDF, MS Office, Images, etc.<br />Replication for high query volume<br />Distributed search for large indexes<br />Production systems with 1B+ documents<br />Configurable Analysis chain and other extension points<br />Total control over tokenization, stemming, etc.<br />
    14. 14. Intangibles<br />Open Source<br />Flexible, non-restrictive license<br />Apache License v2 – non-viral<br />“Do what you want with the software, just don’t claim you wrote it”<br />Large community willing to help<br />Great place to learn about real world IR systems<br />Many books and other documentation<br />Lucene in Action by Hatcher, McCandless and Gospodnetic<br />
    15. 15. What’s New?<br /><br /><br />Codecs<br />Pluggable Index Formats<br />Provide Different index compression techniques<br />Stats to enable alternate scoring approaches <br />BM25, Lang. Modeling, etc. -- More work to be done here<br />Faster<br />Java Strings are slow; convert to use byte arrays<br />
    16. 16. Other New Items<br />Many new Analyzers (tokenizers, etc.)<br />Richer Language support (Hindi, Indonesian, Arabic, …)<br />Richer Geospatial (Local) Search capabilities<br />Score, filter, sort by distance<br /><br />Results Grouping<br />Group Related Results<br /><br />More Faceting Capabilities<br />Pivot<br />New underlying algorithms<br />
    17. 17. How can Lucene/Solr help me?<br />
    18. 18. Job Trends<br /><br />
    19. 19. Other Things that Can Help<br />Nutch<br />Crawling<br /><br />Mahout<br />Machine learning (clustering, classification, others)<br /><br />OpenNLP<br />Part of Speech, Parsers, Named Entity Recognition<br /><br />Open Relevance Project<br />Relevance Judgments<br /><br />
    20. 20. Resources<br /><br /><br />{java-user|solr-user}<br />@gsingers<br /><br /><br />