TriHUG: Lucene Solr Hadoop

Where It All Began Using Apache Hadoop for Search with Apache Lucene and Solr

Topics Search What is: Apache Lucene? Apache Nutch? Apache Solr? Where does Hadoop (ecosystem) fit? Indexing Search Other

Search 101 Search tools are designed for dealing with fuzzy data Works well with structured and unstructured data Performs well when dealing with large volumes of data Many apps don’t need the limits that databases place on content Search fits well alongside a DB too Given a user’s information need, (query) find and, optionally, score content relevant to that need Many different ways to solve this problem, each with tradeoffs What’s “relevant” mean?

Search 101 Relevance Indexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM

Lucene is a mature, high performance Java API to provide search capabilities to applications Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.) Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc. Created in 1997 and now part of the Apache Software Foundation Important to note that Lucene does not have distributed index (shard) support

Nutch ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat http://labs.google.com/papers/mapreduce.html Only much later did it spin out to become the Hadoop that we all know In other words, Hadoop was born from the need to scale search crawling and indexing Originally used Lucene for search/indexing, now uses Solr

Solr Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene Without knowing Java! Also provides: Easy setup and configuration Faceting Highlighting Replication/Sharding Lucene Best Practices http://search.lucidimagination.com

Lucene Basics Content is modeled via Documents and Fields Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Searches are supported through a wide range of Query options Keyword Terms Phrases Wildcards, other

Quick Solr Demo Pre-reqs: Apache Ant 1.7.x SVN svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk cdsolr-trunk/solr/ ant example cd example java –jar start.jar cdexampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse

Anatomy of a Distributed Search System Users Input Docs Application Fan In/Out Shard[0] Shard[n] Sharding Alg. Coordination Layer Searchers Indexers … … … … … Shard[0] Shard[n]

Sharding Algorithm Good document distribution across shards is important Simple approach: hash(id) % numShards Fine if number of shards doesn’t change or easy to reindex Better: Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing Also key: how to deal with the shape/size of the cluster changing

Hadoop and Search Much of the Hadoop ecosystem is useful for search related functionality Indexing Process of adding documents to inverted index to make them searchable In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help Search Query the index and return documents and other info (facets, etc.) related to the result set Subsecond response time usually required ZooKeeper, Avro and others are still useful

Indexing (Lucene) Hadoop ships with contrib/index Almost no documentation, but… Good example of map-side indexing Mapper does analysis and creates in memory index which is written out to segments Indexes merged on the reduce side Katta http://katta.sourceforge.net Shard management, distributed search, etc. Both give you large amount of control, but you have to build out all the search framework around it

Indexing (Solr) https://issues.apache.org/jira/browse/SOLR-1301 Map side formats Reduce-side indexing Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.) Manually install index into a Solr core once built https://issues.apache.org/jira/browse/SOLR-1045 Map-side indexing Incomplete, but based on Hadoop contrib/index Write a distributed Update Handler to handle on the server side

Indexing (Nutch to Solr) Use Nutch to crawl content, Solr to index and serve Doesn’t support indexing to Solr shards just yet Need to write/use Solr distributed Update Handler Still useful for smaller crawls (< 100M pages) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/

Searching Hadoop Core is not all that useful for distributed search Exception: Hadoop RPC layer, possibly Exception: Log analysis, etc. for search related items Other Hadoop ecosystem tools are useful: Apache ZooKeeper (more in a moment) HDFS – storage of shards (pull down to local disk) Avro, Thrift, Protocol Buffers (serialization utilities)

ZooKeeper and Search ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization In the context of search, it’s useful for: Sharing configuration across nodes Maintaining status about shards Up/down/latency/rebalancing and more Coordinating searches across shards/load balancing

ZooKeeper and Search (Practical) Katta employs ZooKeeper for search coordination, etc. Query distribution, status, etc. Solr Cloud All the benefits of Solr + ZooKeeper for coordinating distributed capabilities Query distribution, configuration sharing, status, etc. About to be committed to Solr trunk http://wiki.apache.org/solr/SolrCloud

Other Search Related Tasks Log Analysis Query analytics Related Searches Relevance assessments Classification and Clustering Mahout – http://mahout.apache.org HBase and other stores for documents Avro, Thrift, Protocol Buffers for serialization of objects across the wire

Resources http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ http://hadoop.apache.org http://nutch.apache.org http://lucene.apache.org http://www.lucidimagination.com

TriHUG: Lucene Solr Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to TriHUG: Lucene Solr Hadoop

Similar to TriHUG: Lucene Solr Hadoop (20)

More from Grant Ingersoll

More from Grant Ingersoll (20)

Recently uploaded

Recently uploaded (20)

TriHUG: Lucene Solr Hadoop

Editor's Notes