Intro to Apache Lucene and Solr
Upcoming SlideShare
Loading in...5

Intro to Apache Lucene and Solr



Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.

Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Rather than talk you through a lot of the features and functionality, let me show you
  • Do thisExample Queries:ipod184-pin DDRCover: Querying, scoring, faceting, clustering, function queries, spatial, grouping, more like this, indexing

Intro to Apache Lucene and Solr Intro to Apache Lucene and Solr Presentation Transcript

  • Introduction to Open Source Search with Apache Lucene and Solr
    Grant Ingersoll
  • The How Many Game
    How many of you:
    Have taken a class in Information Retrieval (IR)?
    Are doing work/research in IR?
    Have heard of or are using Lucene?
    Have heard of or are using Solr?
    Are doing work on core IR algorithms such as compression techniques or scoring?
    Are doing UI/Application work/research as they relate to search?
  • Topics
    Brief Bio
    Search 101 (skip?)
    What is:
    Apache Lucene
    Apache Solr
    What can they do?
    Features and functionality
    What’s new in Lucene and Solr?
    How can they help my research/work/____?
  • Brief Bio
    Apache Lucene/Solr Committer
    Apache Mahout co-founder
    Scalable Machine Learning
    Co-founder of Lucid Imagination
    Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy
    Co-Author of upcoming “Taming Text” (Manning Publications)
  • Search 101
    Search tools are designed for dealing with fuzzy data/questions
    Works well with structured and unstructured data
    Performs well when dealing with large volumes of data
    Many apps don’t need the limits that databases place on content
    Search fits well alongside a DB too
    Given a user’s information need, (query) find and, optionally, score content relevant to that need
    Many different ways to solve this problem, each with tradeoffs
    What’s “relevant” mean?
  • Vector Space Model (VSM) for relevance
    Common across many search engines
    Apache Lucene is a highly optimized implementation of the VSM
    Search 101
    Finds and maps terms and documents
    Conceptually similar to a book index
    At the heart of fast search/retrieve
  • Apache Lucene in a Nutshell
    Java based Application Programming Interface (API) for adding search and indexing functionality to applications
    Fast and efficient scoring and indexing algorithms
    Lots of contributions to make common tasks easier:
    Highlighting, spatial, Query Parsers, Benchmarking tools, etc.
    Most widely deployed search library on the planet
  • Lucene Basics
    Content is modeled via Documents and Fields
    Content can be text, integers, floats, dates, custom
    Analysis can be employed to alter content before indexing
    Searches are supported through a wide range of Query options
    Many, many more
  • Apache Solr in a Nutshell
    Lucene-based Search Server + other features and functionality
    Access Lucene over HTTP:
    Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
    Most programming tasks in Lucene are configuration tasks in Solr
    Faceting (guided navigation, filters, etc.)
    Replication and distributed search support
    Lucene Best Practices
  • A small sampling of Lucene/Solr-Powered Sites
  • Features and Functionality
  • Quick Solr/Lucene Demo
    Apache Ant 1.7.x, Subversion (SVN)
    Command Line 1:
    svn co
    ant example
    cd example
    java –Dsolr.clustering.enabled=true –jar start.jar
    Command Line 2
    cd exampledocs; java –jar post.jar *.xml
  • Other Features
    Data Import Handler
    Database, Mail, RSS, etc.
    Rich document support via Apache Tika
    PDF, MS Office, Images, etc.
    Replication for high query volume
    Distributed search for large indexes
    Production systems with 1B+ documents
    Configurable Analysis chain and other extension points
    Total control over tokenization, stemming, etc.
  • Intangibles
    Open Source
    Flexible, non-restrictive license
    Apache License v2 – non-viral
    “Do what you want with the software, just don’t claim you wrote it”
    Large community willing to help
    Great place to learn about real world IR systems
    Many books and other documentation
    Lucene in Action by Hatcher, McCandless and Gospodnetic
  • What’s New?
    Pluggable Index Formats
    Provide Different index compression techniques
    Stats to enable alternate scoring approaches
    BM25, Lang. Modeling, etc. -- More work to be done here
    Java Strings are slow; convert to use byte arrays
  • Other New Items
    Many new Analyzers (tokenizers, etc.)
    Richer Language support (Hindi, Indonesian, Arabic, …)
    Richer Geospatial (Local) Search capabilities
    Score, filter, sort by distance
    Results Grouping
    Group Related Results
    More Faceting Capabilities
    New underlying algorithms
  • How can Lucene/Solr help me?
  • Job Trends
  • Other Things that Can Help
    Machine learning (clustering, classification, others)
    Part of Speech, Parsers, Named Entity Recognition
    Open Relevance Project
    Relevance Judgments
  • Resources