Your SlideShare is downloading. ×
TriHUG: Lucene Solr Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

TriHUG: Lucene Solr Hadoop

13,301
views

Published on

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

Published in: Technology

0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
13,301
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
382
Comments
0
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Do this
  • Talk about why the need to do this
  • Transcript

    • 1. Where It All Began
      Using Apache Hadoop for Search with Apache Lucene and Solr
    • 2. Topics
      Search
      What is:
      Apache Lucene?
      Apache Nutch?
      Apache Solr?
      Where does Hadoop (ecosystem) fit?
      Indexing
      Search
      Other
    • 3. Search 101
      Search tools are designed for dealing with fuzzy data
      Works well with structured and unstructured data
      Performs well when dealing with large volumes of data
      Many apps don’t need the limits that databases place on content
      Search fits well alongside a DB too
      Given a user’s information need, (query) find and, optionally, score content relevant to that need
      Many different ways to solve this problem, each with tradeoffs
      What’s “relevant” mean?
    • 4. Search 101
      Relevance
      Indexing
      Finds and maps terms and documents
      Conceptually similar to a book index
      At the heart of fast search/retrieve
      Vector Space Model (VSM) for relevance
      Common across many search engines
      Apache Lucene is a highly optimized implementation of the VSM
    • 5. Lucene is a mature, high performance Java API to provide search capabilities to applications
      Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)
      Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.
      Created in 1997 and now part of the Apache Software Foundation
      Important to note that Lucene does not have distributed index (shard) support
    • 6. Nutch
      ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies
      Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat
      http://labs.google.com/papers/mapreduce.html
      Only much later did it spin out to become the Hadoop that we all know
      In other words, Hadoop was born from the need to scale search crawling and indexing
      Originally used Lucene for search/indexing, now uses Solr
    • 7. Solr
      Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene
      Without knowing Java!
      Also provides:
      Easy setup and configuration
      Faceting
      Highlighting
      Replication/Sharding
      Lucene Best Practices
      http://search.lucidimagination.com
    • 8. Lucene Basics
      Content is modeled via Documents and Fields
      Content can be text, integers, floats, dates, custom
      Analysis can be employed to alter content before indexing
      Searches are supported through a wide range of Query options
      Keyword
      Terms
      Phrases
      Wildcards, other
    • 9. Quick Solr Demo
      Pre-reqs:
      Apache Ant 1.7.x
      SVN
      svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk
      cdsolr-trunk/solr/
      ant example
      cd example
      java –jar start.jar
      cdexampledocs; java –jar post.jar *.xml
      http://localhost:8983/solr/browse
    • 10. Anatomy of a Distributed Search System
      Users
      Input Docs
      Application
      Fan In/Out
      Shard[0]
      Shard[n]
      Sharding Alg.
      Coordination Layer
      Searchers
      Indexers





      Shard[0]
      Shard[n]
    • 11. Sharding Algorithm
      Good document distribution across shards is important
      Simple approach:
      hash(id) % numShards
      Fine if number of shards doesn’t change or easy to reindex
      Better:
      Consistent Hashing
      http://en.wikipedia.org/wiki/Consistent_hashing
      Also key: how to deal with the shape/size of the cluster changing
    • 12. Hadoop and Search
      Much of the Hadoop ecosystem is useful for search related functionality
      Indexing
      Process of adding documents to inverted index to make them searchable
      In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help
      Search
      Query the index and return documents and other info (facets, etc.) related to the result set
      Subsecond response time usually required
      ZooKeeper, Avro and others are still useful
    • 13. Indexing (Lucene)
      Hadoop ships with contrib/index
      Almost no documentation, but…
      Good example of map-side indexing
      Mapper does analysis and creates in memory index which is written out to segments
      Indexes merged on the reduce side
      Katta
      http://katta.sourceforge.net
      Shard management, distributed search, etc.
      Both give you large amount of control, but you have to build out all the search framework around it
    • 14. Indexing (Solr)
      https://issues.apache.org/jira/browse/SOLR-1301
      Map side formats
      Reduce-side indexing
      Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)
      Manually install index into a Solr core once built
      https://issues.apache.org/jira/browse/SOLR-1045
      Map-side indexing
      Incomplete, but based on Hadoop contrib/index
      Write a distributed Update Handler to handle on the server side
    • 15. Indexing (Nutch to Solr)
      Use Nutch to crawl content, Solr to index and serve
      Doesn’t support indexing to Solr shards just yet
      Need to write/use Solr distributed Update Handler
      Still useful for smaller crawls (< 100M pages)
      http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
    • 16. Searching
      Hadoop Core is not all that useful for distributed search
      Exception: Hadoop RPC layer, possibly
      Exception: Log analysis, etc. for search related items
      Other Hadoop ecosystem tools are useful:
      Apache ZooKeeper (more in a moment)
      HDFS – storage of shards (pull down to local disk)
      Avro, Thrift, Protocol Buffers (serialization utilities)
    • 17. ZooKeeper and Search
      ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization
      In the context of search, it’s useful for:
      Sharing configuration across nodes
      Maintaining status about shards
      Up/down/latency/rebalancing and more
      Coordinating searches across shards/load balancing
    • 18. ZooKeeper and Search (Practical)
      Katta employs ZooKeeper for search coordination, etc.
      Query distribution, status, etc.
      Solr Cloud
      All the benefits of Solr + ZooKeeper for coordinating distributed capabilities
      Query distribution, configuration sharing, status, etc.
      About to be committed to Solr trunk
      http://wiki.apache.org/solr/SolrCloud
    • 19. Other Search Related Tasks
      Log Analysis
      Query analytics
      Related Searches
      Relevance assessments
      Classification and Clustering
      Mahout – http://mahout.apache.org
      HBase and other stores for documents
      Avro, Thrift, Protocol Buffers for serialization of objects across the wire
    • 20. Resources
      http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
      http://hadoop.apache.org
      http://nutch.apache.org
      http://lucene.apache.org
      http://www.lucidimagination.com

    ×