TriHUG: Lucene Solr Hadoop

Uploaded on

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Do this
  • Talk about why the need to do this


  • 1. Where It All Began
    Using Apache Hadoop for Search with Apache Lucene and Solr
  • 2. Topics
    What is:
    Apache Lucene?
    Apache Nutch?
    Apache Solr?
    Where does Hadoop (ecosystem) fit?
  • 3. Search 101
    Search tools are designed for dealing with fuzzy data
    Works well with structured and unstructured data
    Performs well when dealing with large volumes of data
    Many apps don’t need the limits that databases place on content
    Search fits well alongside a DB too
    Given a user’s information need, (query) find and, optionally, score content relevant to that need
    Many different ways to solve this problem, each with tradeoffs
    What’s “relevant” mean?
  • 4. Search 101
    Finds and maps terms and documents
    Conceptually similar to a book index
    At the heart of fast search/retrieve
    Vector Space Model (VSM) for relevance
    Common across many search engines
    Apache Lucene is a highly optimized implementation of the VSM
  • 5. Lucene is a mature, high performance Java API to provide search capabilities to applications
    Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)
    Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.
    Created in 1997 and now part of the Apache Software Foundation
    Important to note that Lucene does not have distributed index (shard) support
  • 6. Nutch
    ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies
    Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat
    Only much later did it spin out to become the Hadoop that we all know
    In other words, Hadoop was born from the need to scale search crawling and indexing
    Originally used Lucene for search/indexing, now uses Solr
  • 7. Solr
    Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene
    Without knowing Java!
    Also provides:
    Easy setup and configuration
    Lucene Best Practices
  • 8. Lucene Basics
    Content is modeled via Documents and Fields
    Content can be text, integers, floats, dates, custom
    Analysis can be employed to alter content before indexing
    Searches are supported through a wide range of Query options
    Wildcards, other
  • 9. Quick Solr Demo
    Apache Ant 1.7.x
    svn co
    ant example
    cd example
    java –jar start.jar
    cdexampledocs; java –jar post.jar *.xml
  • 10. Anatomy of a Distributed Search System
    Input Docs
    Fan In/Out
    Sharding Alg.
    Coordination Layer

  • 11. Sharding Algorithm
    Good document distribution across shards is important
    Simple approach:
    hash(id) % numShards
    Fine if number of shards doesn’t change or easy to reindex
    Consistent Hashing
    Also key: how to deal with the shape/size of the cluster changing
  • 12. Hadoop and Search
    Much of the Hadoop ecosystem is useful for search related functionality
    Process of adding documents to inverted index to make them searchable
    In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help
    Query the index and return documents and other info (facets, etc.) related to the result set
    Subsecond response time usually required
    ZooKeeper, Avro and others are still useful
  • 13. Indexing (Lucene)
    Hadoop ships with contrib/index
    Almost no documentation, but…
    Good example of map-side indexing
    Mapper does analysis and creates in memory index which is written out to segments
    Indexes merged on the reduce side
    Shard management, distributed search, etc.
    Both give you large amount of control, but you have to build out all the search framework around it
  • 14. Indexing (Solr)
    Map side formats
    Reduce-side indexing
    Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)
    Manually install index into a Solr core once built
    Map-side indexing
    Incomplete, but based on Hadoop contrib/index
    Write a distributed Update Handler to handle on the server side
  • 15. Indexing (Nutch to Solr)
    Use Nutch to crawl content, Solr to index and serve
    Doesn’t support indexing to Solr shards just yet
    Need to write/use Solr distributed Update Handler
    Still useful for smaller crawls (< 100M pages)
  • 16. Searching
    Hadoop Core is not all that useful for distributed search
    Exception: Hadoop RPC layer, possibly
    Exception: Log analysis, etc. for search related items
    Other Hadoop ecosystem tools are useful:
    Apache ZooKeeper (more in a moment)
    HDFS – storage of shards (pull down to local disk)
    Avro, Thrift, Protocol Buffers (serialization utilities)
  • 17. ZooKeeper and Search
    ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization
    In the context of search, it’s useful for:
    Sharing configuration across nodes
    Maintaining status about shards
    Up/down/latency/rebalancing and more
    Coordinating searches across shards/load balancing
  • 18. ZooKeeper and Search (Practical)
    Katta employs ZooKeeper for search coordination, etc.
    Query distribution, status, etc.
    Solr Cloud
    All the benefits of Solr + ZooKeeper for coordinating distributed capabilities
    Query distribution, configuration sharing, status, etc.
    About to be committed to Solr trunk
  • 19. Other Search Related Tasks
    Log Analysis
    Query analytics
    Related Searches
    Relevance assessments
    Classification and Clustering
    Mahout –
    HBase and other stores for documents
    Avro, Thrift, Protocol Buffers for serialization of objects across the wire
  • 20. Resources