TriHUG: Lucene Solr Hadoop
Upcoming SlideShare
Loading in...5
×
 

TriHUG: Lucene Solr Hadoop

on

  • 14,104 views

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

Statistics

Views

Total Views
14,104
Views on SlideShare
13,541
Embed Views
563

Actions

Likes
21
Downloads
378
Comments
0

4 Embeds 563

http://www.trihug.org 559
http://pmomale-ld1 2
http://www.tumblr.com 1
http://trihug.tumblr.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Do this
  • Talk about why the need to do this

TriHUG: Lucene Solr Hadoop TriHUG: Lucene Solr Hadoop Presentation Transcript

  • Where It All Began
    Using Apache Hadoop for Search with Apache Lucene and Solr
  • Topics
    Search
    What is:
    Apache Lucene?
    Apache Nutch?
    Apache Solr?
    Where does Hadoop (ecosystem) fit?
    Indexing
    Search
    Other
  • Search 101
    Search tools are designed for dealing with fuzzy data
    Works well with structured and unstructured data
    Performs well when dealing with large volumes of data
    Many apps don’t need the limits that databases place on content
    Search fits well alongside a DB too
    Given a user’s information need, (query) find and, optionally, score content relevant to that need
    Many different ways to solve this problem, each with tradeoffs
    What’s “relevant” mean?
  • Search 101
    Relevance
    Indexing
    Finds and maps terms and documents
    Conceptually similar to a book index
    At the heart of fast search/retrieve
    Vector Space Model (VSM) for relevance
    Common across many search engines
    Apache Lucene is a highly optimized implementation of the VSM
  • Lucene is a mature, high performance Java API to provide search capabilities to applications
    Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)
    Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.
    Created in 1997 and now part of the Apache Software Foundation
    Important to note that Lucene does not have distributed index (shard) support
  • Nutch
    ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies
    Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat
    http://labs.google.com/papers/mapreduce.html
    Only much later did it spin out to become the Hadoop that we all know
    In other words, Hadoop was born from the need to scale search crawling and indexing
    Originally used Lucene for search/indexing, now uses Solr
  • Solr
    Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene
    Without knowing Java!
    Also provides:
    Easy setup and configuration
    Faceting
    Highlighting
    Replication/Sharding
    Lucene Best Practices
    http://search.lucidimagination.com
  • Lucene Basics
    Content is modeled via Documents and Fields
    Content can be text, integers, floats, dates, custom
    Analysis can be employed to alter content before indexing
    Searches are supported through a wide range of Query options
    Keyword
    Terms
    Phrases
    Wildcards, other
  • Quick Solr Demo
    Pre-reqs:
    Apache Ant 1.7.x
    SVN
    svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk
    cdsolr-trunk/solr/
    ant example
    cd example
    java –jar start.jar
    cdexampledocs; java –jar post.jar *.xml
    http://localhost:8983/solr/browse
  • Anatomy of a Distributed Search System
    Users
    Input Docs
    Application
    Fan In/Out
    Shard[0]
    Shard[n]
    Sharding Alg.
    Coordination Layer
    Searchers
    Indexers





    Shard[0]
    Shard[n]
  • Sharding Algorithm
    Good document distribution across shards is important
    Simple approach:
    hash(id) % numShards
    Fine if number of shards doesn’t change or easy to reindex
    Better:
    Consistent Hashing
    http://en.wikipedia.org/wiki/Consistent_hashing
    Also key: how to deal with the shape/size of the cluster changing
  • Hadoop and Search
    Much of the Hadoop ecosystem is useful for search related functionality
    Indexing
    Process of adding documents to inverted index to make them searchable
    In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help
    Search
    Query the index and return documents and other info (facets, etc.) related to the result set
    Subsecond response time usually required
    ZooKeeper, Avro and others are still useful
  • Indexing (Lucene)
    Hadoop ships with contrib/index
    Almost no documentation, but…
    Good example of map-side indexing
    Mapper does analysis and creates in memory index which is written out to segments
    Indexes merged on the reduce side
    Katta
    http://katta.sourceforge.net
    Shard management, distributed search, etc.
    Both give you large amount of control, but you have to build out all the search framework around it
  • Indexing (Solr)
    https://issues.apache.org/jira/browse/SOLR-1301
    Map side formats
    Reduce-side indexing
    Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)
    Manually install index into a Solr core once built
    https://issues.apache.org/jira/browse/SOLR-1045
    Map-side indexing
    Incomplete, but based on Hadoop contrib/index
    Write a distributed Update Handler to handle on the server side
  • Indexing (Nutch to Solr)
    Use Nutch to crawl content, Solr to index and serve
    Doesn’t support indexing to Solr shards just yet
    Need to write/use Solr distributed Update Handler
    Still useful for smaller crawls (< 100M pages)
    http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
  • Searching
    Hadoop Core is not all that useful for distributed search
    Exception: Hadoop RPC layer, possibly
    Exception: Log analysis, etc. for search related items
    Other Hadoop ecosystem tools are useful:
    Apache ZooKeeper (more in a moment)
    HDFS – storage of shards (pull down to local disk)
    Avro, Thrift, Protocol Buffers (serialization utilities)
  • ZooKeeper and Search
    ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization
    In the context of search, it’s useful for:
    Sharing configuration across nodes
    Maintaining status about shards
    Up/down/latency/rebalancing and more
    Coordinating searches across shards/load balancing
  • ZooKeeper and Search (Practical)
    Katta employs ZooKeeper for search coordination, etc.
    Query distribution, status, etc.
    Solr Cloud
    All the benefits of Solr + ZooKeeper for coordinating distributed capabilities
    Query distribution, configuration sharing, status, etc.
    About to be committed to Solr trunk
    http://wiki.apache.org/solr/SolrCloud
  • Other Search Related Tasks
    Log Analysis
    Query analytics
    Related Searches
    Relevance assessments
    Classification and Clustering
    Mahout – http://mahout.apache.org
    HBase and other stores for documents
    Avro, Thrift, Protocol Buffers for serialization of objects across the wire
  • Resources
    http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
    http://hadoop.apache.org
    http://nutch.apache.org
    http://lucene.apache.org
    http://www.lucidimagination.com