Intro to Apache Lucene and Solr
Upcoming SlideShare
Loading in...5
×
 

Intro to Apache Lucene and Solr

on

  • 6,597 views

Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.

Intro talk for UNC School of Information and Library Science. Covers basics of Lucene and Solr as well as info on Lucene/Solr jobs, opportunities, etc.

Statistics

Views

Total Views
6,597
Views on SlideShare
6,597
Embed Views
0

Actions

Likes
1
Downloads
143
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Rather than talk you through a lot of the features and functionality, let me show you
  • Do thisExample Queries:ipod184-pin DDRCover: Querying, scoring, faceting, clustering, function queries, spatial, grouping, more like this, indexing

Intro to Apache Lucene and Solr Intro to Apache Lucene and Solr Presentation Transcript

  • Introduction to Open Source Search with Apache Lucene and Solr
    Grant Ingersoll
  • The How Many Game
    How many of you:
    Have taken a class in Information Retrieval (IR)?
    Are doing work/research in IR?
    Have heard of or are using Lucene?
    Have heard of or are using Solr?
    Are doing work on core IR algorithms such as compression techniques or scoring?
    Are doing UI/Application work/research as they relate to search?
  • Topics
    Brief Bio
    Search 101 (skip?)
    What is:
    Apache Lucene
    Apache Solr
    What can they do?
    Features and functionality
    Intangibles
    What’s new in Lucene and Solr?
    How can they help my research/work/____?
  • Brief Bio
    Apache Lucene/Solr Committer
    Apache Mahout co-founder
    Scalable Machine Learning
    Co-founder of Lucid Imagination
    http://www.lucidimagination.com
    Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy
    Co-Author of upcoming “Taming Text” (Manning Publications)
    http://www.manning.com/ingersoll
  • Search 101
    Search tools are designed for dealing with fuzzy data/questions
    Works well with structured and unstructured data
    Performs well when dealing with large volumes of data
    Many apps don’t need the limits that databases place on content
    Search fits well alongside a DB too
    Given a user’s information need, (query) find and, optionally, score content relevant to that need
    Many different ways to solve this problem, each with tradeoffs
    What’s “relevant” mean?
  • Vector Space Model (VSM) for relevance
    Common across many search engines
    Apache Lucene is a highly optimized implementation of the VSM
    Search 101
    Relevance
    Indexing
    Finds and maps terms and documents
    Conceptually similar to a book index
    At the heart of fast search/retrieve
  • Apache Lucene in a Nutshell
    http://lucene.apache.org/java
    Java based Application Programming Interface (API) for adding search and indexing functionality to applications
    Fast and efficient scoring and indexing algorithms
    Lots of contributions to make common tasks easier:
    Highlighting, spatial, Query Parsers, Benchmarking tools, etc.
    Most widely deployed search library on the planet
  • Lucene Basics
    Content is modeled via Documents and Fields
    Content can be text, integers, floats, dates, custom
    Analysis can be employed to alter content before indexing
    Searches are supported through a wide range of Query options
    Keyword
    Terms
    Phrases
    Wildcards
    Many, many more
  • Apache Solr in a Nutshell
    http://lucene.apache.org/solr
    Lucene-based Search Server + other features and functionality
    Access Lucene over HTTP:
    Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
    Most programming tasks in Lucene are configuration tasks in Solr
    Faceting (guided navigation, filters, etc.)
    Replication and distributed search support
    Lucene Best Practices
  • A small sampling of Lucene/Solr-Powered Sites
    10
    Buy.com
  • Features and Functionality
  • Quick Solr/Lucene Demo
    Pre-reqs:
    Apache Ant 1.7.x, Subversion (SVN)
    Command Line 1:
    svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk
    cdsolr-trunk/solr/
    ant example
    cd example
    java –Dsolr.clustering.enabled=true –jar start.jar
    Command Line 2
    cd exampledocs; java –jar post.jar *.xml
    http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
  • Other Features
    Data Import Handler
    Database, Mail, RSS, etc.
    Rich document support via Apache Tika
    PDF, MS Office, Images, etc.
    Replication for high query volume
    Distributed search for large indexes
    Production systems with 1B+ documents
    Configurable Analysis chain and other extension points
    Total control over tokenization, stemming, etc.
  • Intangibles
    Open Source
    Flexible, non-restrictive license
    Apache License v2 – non-viral
    “Do what you want with the software, just don’t claim you wrote it”
    Large community willing to help
    Great place to learn about real world IR systems
    Many books and other documentation
    Lucene in Action by Hatcher, McCandless and Gospodnetic
  • What’s New?
    https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt
    https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt
    Codecs
    Pluggable Index Formats
    Provide Different index compression techniques
    Stats to enable alternate scoring approaches
    BM25, Lang. Modeling, etc. -- More work to be done here
    Faster
    Java Strings are slow; convert to use byte arrays
  • Other New Items
    Many new Analyzers (tokenizers, etc.)
    Richer Language support (Hindi, Indonesian, Arabic, …)
    Richer Geospatial (Local) Search capabilities
    Score, filter, sort by distance
    http://wiki.apache.org/solr/SpatialSearch
    Results Grouping
    Group Related Results
    http://wiki.apache.org/solr/FieldCollapsing
    More Faceting Capabilities
    Pivot
    New underlying algorithms
  • How can Lucene/Solr help me?
  • Job Trends
    http://www.indeed.com
  • Other Things that Can Help
    Nutch
    Crawling
    http://nutch.apache.org
    Mahout
    Machine learning (clustering, classification, others)
    http://mahout.apache.org
    OpenNLP
    Part of Speech, Parsers, Named Entity Recognition
    http://incubator.apache.org/opennlp
    Open Relevance Project
    Relevance Judgments
    http://lucene.apache.org/openrelevance
  • Resources
    http://lucene.apache.org
    http://www.lucidimagination.com
    {java-user|solr-user}@lucene.apache.org
    @gsingers
    http://www.slideshare.net/gsingers
    grant@lucidimagination.com