TriHUG: Lucene Solr Hadoop


Published on

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Do this
  • Talk about why the need to do this
  • TriHUG: Lucene Solr Hadoop

    1. 1. Where It All Began<br />Using Apache Hadoop for Search with Apache Lucene and Solr<br />
    2. 2. Topics<br />Search<br />What is:<br />Apache Lucene?<br />Apache Nutch?<br />Apache Solr?<br />Where does Hadoop (ecosystem) fit?<br />Indexing<br />Search<br />Other<br />
    3. 3. Search 101<br />Search tools are designed for dealing with fuzzy data<br />Works well with structured and unstructured data<br />Performs well when dealing with large volumes of data<br />Many apps don’t need the limits that databases place on content<br />Search fits well alongside a DB too<br />Given a user’s information need, (query) find and, optionally, score content relevant to that need<br />Many different ways to solve this problem, each with tradeoffs<br />What’s “relevant” mean?<br />
    4. 4. Search 101<br />Relevance<br />Indexing<br />Finds and maps terms and documents <br />Conceptually similar to a book index<br />At the heart of fast search/retrieve<br />Vector Space Model (VSM) for relevance<br />Common across many search engines<br />Apache Lucene is a highly optimized implementation of the VSM<br />
    5. 5. Lucene is a mature, high performance Java API to provide search capabilities to applications<br />Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)<br />Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.<br />Created in 1997 and now part of the Apache Software Foundation<br />Important to note that Lucene does not have distributed index (shard) support<br />
    6. 6. Nutch<br />ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies<br />Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat<br /><br />Only much later did it spin out to become the Hadoop that we all know<br />In other words, Hadoop was born from the need to scale search crawling and indexing<br />Originally used Lucene for search/indexing, now uses Solr<br />
    7. 7. Solr<br />Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene <br />Without knowing Java!<br />Also provides:<br />Easy setup and configuration<br />Faceting<br />Highlighting<br />Replication/Sharding<br />Lucene Best Practices<br /><br />
    8. 8. Lucene Basics<br />Content is modeled via Documents and Fields<br />Content can be text, integers, floats, dates, custom<br />Analysis can be employed to alter content before indexing<br />Searches are supported through a wide range of Query options<br />Keyword<br />Terms<br />Phrases<br />Wildcards, other<br />
    9. 9. Quick Solr Demo<br />Pre-reqs:<br />Apache Ant 1.7.x<br />SVN<br />svn co<br />cdsolr-trunk/solr/<br />ant example<br />cd example<br />java –jar start.jar<br />cdexampledocs; java –jar post.jar *.xml<br />http://localhost:8983/solr/browse<br />
    10. 10. Anatomy of a Distributed Search System<br />Users<br />Input Docs<br />Application<br />Fan In/Out<br />Shard[0]<br />Shard[n]<br />Sharding Alg.<br />Coordination Layer<br />Searchers<br />Indexers<br />…<br />…<br />…<br />…<br />…<br />Shard[0]<br />Shard[n]<br />
    11. 11. Sharding Algorithm<br />Good document distribution across shards is important<br />Simple approach:<br />hash(id) % numShards<br />Fine if number of shards doesn’t change or easy to reindex<br />Better:<br />Consistent Hashing<br /><br />Also key: how to deal with the shape/size of the cluster changing<br />
    12. 12. Hadoop and Search<br />Much of the Hadoop ecosystem is useful for search related functionality<br />Indexing<br />Process of adding documents to inverted index to make them searchable<br />In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help<br />Search<br />Query the index and return documents and other info (facets, etc.) related to the result set<br />Subsecond response time usually required<br />ZooKeeper, Avro and others are still useful<br />
    13. 13. Indexing (Lucene)<br />Hadoop ships with contrib/index<br />Almost no documentation, but…<br />Good example of map-side indexing<br />Mapper does analysis and creates in memory index which is written out to segments<br />Indexes merged on the reduce side<br />Katta<br /><br />Shard management, distributed search, etc.<br />Both give you large amount of control, but you have to build out all the search framework around it<br />
    14. 14. Indexing (Solr)<br /><br />Map side formats<br />Reduce-side indexing<br />Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)<br />Manually install index into a Solr core once built<br /><br />Map-side indexing<br />Incomplete, but based on Hadoop contrib/index<br />Write a distributed Update Handler to handle on the server side<br />
    15. 15. Indexing (Nutch to Solr)<br />Use Nutch to crawl content, Solr to index and serve<br />Doesn’t support indexing to Solr shards just yet<br />Need to write/use Solr distributed Update Handler<br />Still useful for smaller crawls (< 100M pages)<br /><br />
    16. 16. Searching<br />Hadoop Core is not all that useful for distributed search<br />Exception: Hadoop RPC layer, possibly<br />Exception: Log analysis, etc. for search related items<br />Other Hadoop ecosystem tools are useful:<br />Apache ZooKeeper (more in a moment)<br />HDFS – storage of shards (pull down to local disk)<br />Avro, Thrift, Protocol Buffers (serialization utilities)<br />
    17. 17. ZooKeeper and Search<br />ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization<br />In the context of search, it’s useful for:<br />Sharing configuration across nodes<br />Maintaining status about shards<br />Up/down/latency/rebalancing and more<br />Coordinating searches across shards/load balancing<br />
    18. 18. ZooKeeper and Search (Practical)<br />Katta employs ZooKeeper for search coordination, etc.<br />Query distribution, status, etc.<br />Solr Cloud<br />All the benefits of Solr + ZooKeeper for coordinating distributed capabilities<br />Query distribution, configuration sharing, status, etc.<br />About to be committed to Solr trunk<br /><br />
    19. 19. Other Search Related Tasks<br />Log Analysis<br />Query analytics<br />Related Searches<br />Relevance assessments<br />Classification and Clustering<br />Mahout –<br />HBase and other stores for documents<br />Avro, Thrift, Protocol Buffers for serialization of objects across the wire<br />
    20. 20. Resources<br /><br /><br /><br /><br /><br />