Your SlideShare is downloading. ×
0
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

TriHUG: Lucene Solr Hadoop

13,392

Published on

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

High level talk with references on using Apache Hadoop with Apache Lucene, Solr, Nutch, etc.

Published in: Technology
0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
13,392
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
382
Comments
0
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Do this
  • Talk about why the need to do this
  • Transcript

    • 1. Where It All Began<br />Using Apache Hadoop for Search with Apache Lucene and Solr<br />
    • 2. Topics<br />Search<br />What is:<br />Apache Lucene?<br />Apache Nutch?<br />Apache Solr?<br />Where does Hadoop (ecosystem) fit?<br />Indexing<br />Search<br />Other<br />
    • 3. Search 101<br />Search tools are designed for dealing with fuzzy data<br />Works well with structured and unstructured data<br />Performs well when dealing with large volumes of data<br />Many apps don’t need the limits that databases place on content<br />Search fits well alongside a DB too<br />Given a user’s information need, (query) find and, optionally, score content relevant to that need<br />Many different ways to solve this problem, each with tradeoffs<br />What’s “relevant” mean?<br />
    • 4. Search 101<br />Relevance<br />Indexing<br />Finds and maps terms and documents <br />Conceptually similar to a book index<br />At the heart of fast search/retrieve<br />Vector Space Model (VSM) for relevance<br />Common across many search engines<br />Apache Lucene is a highly optimized implementation of the VSM<br />
    • 5. Lucene is a mature, high performance Java API to provide search capabilities to applications<br />Supports indexing, searching and a number of other commonly used search features (highlighting, spell checking, etc.)<br />Not a crawler and doesn’t know anything about Adobe PDF, MS Word, etc.<br />Created in 1997 and now part of the Apache Software Foundation<br />Important to note that Lucene does not have distributed index (shard) support<br />
    • 6. Nutch<br />ASF project aimed at providing large scale crawling, indexing and searching using Lucene and other technologies<br />Mike Cafarella and Doug Cutting originally created Hadoop as part of Nutch based on the Google paper by Dean and Ghemawat<br />http://labs.google.com/papers/mapreduce.html<br />Only much later did it spin out to become the Hadoop that we all know<br />In other words, Hadoop was born from the need to scale search crawling and indexing<br />Originally used Lucene for search/indexing, now uses Solr<br />
    • 7. Solr<br />Solr is the Lucene based search server providing the infrastructure required for most users to work with Lucene <br />Without knowing Java!<br />Also provides:<br />Easy setup and configuration<br />Faceting<br />Highlighting<br />Replication/Sharding<br />Lucene Best Practices<br />http://search.lucidimagination.com<br />
    • 8. Lucene Basics<br />Content is modeled via Documents and Fields<br />Content can be text, integers, floats, dates, custom<br />Analysis can be employed to alter content before indexing<br />Searches are supported through a wide range of Query options<br />Keyword<br />Terms<br />Phrases<br />Wildcards, other<br />
    • 9. Quick Solr Demo<br />Pre-reqs:<br />Apache Ant 1.7.x<br />SVN<br />svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk<br />cdsolr-trunk/solr/<br />ant example<br />cd example<br />java –jar start.jar<br />cdexampledocs; java –jar post.jar *.xml<br />http://localhost:8983/solr/browse<br />
    • 10. Anatomy of a Distributed Search System<br />Users<br />Input Docs<br />Application<br />Fan In/Out<br />Shard[0]<br />Shard[n]<br />Sharding Alg.<br />Coordination Layer<br />Searchers<br />Indexers<br />…<br />…<br />…<br />…<br />…<br />Shard[0]<br />Shard[n]<br />
    • 11. Sharding Algorithm<br />Good document distribution across shards is important<br />Simple approach:<br />hash(id) % numShards<br />Fine if number of shards doesn’t change or easy to reindex<br />Better:<br />Consistent Hashing<br />http://en.wikipedia.org/wiki/Consistent_hashing<br />Also key: how to deal with the shape/size of the cluster changing<br />
    • 12. Hadoop and Search<br />Much of the Hadoop ecosystem is useful for search related functionality<br />Indexing<br />Process of adding documents to inverted index to make them searchable<br />In most cases, batch-oriented and embarrassingly parallel, so Hadoop Core can help<br />Search<br />Query the index and return documents and other info (facets, etc.) related to the result set<br />Subsecond response time usually required<br />ZooKeeper, Avro and others are still useful<br />
    • 13. Indexing (Lucene)<br />Hadoop ships with contrib/index<br />Almost no documentation, but…<br />Good example of map-side indexing<br />Mapper does analysis and creates in memory index which is written out to segments<br />Indexes merged on the reduce side<br />Katta<br />http://katta.sourceforge.net<br />Shard management, distributed search, etc.<br />Both give you large amount of control, but you have to build out all the search framework around it<br />
    • 14. Indexing (Solr)<br />https://issues.apache.org/jira/browse/SOLR-1301<br />Map side formats<br />Reduce-side indexing<br />Creates indexes on local file system (outside of HDFS) and copies to default FS (HDFS, etc.)<br />Manually install index into a Solr core once built<br />https://issues.apache.org/jira/browse/SOLR-1045<br />Map-side indexing<br />Incomplete, but based on Hadoop contrib/index<br />Write a distributed Update Handler to handle on the server side<br />
    • 15. Indexing (Nutch to Solr)<br />Use Nutch to crawl content, Solr to index and serve<br />Doesn’t support indexing to Solr shards just yet<br />Need to write/use Solr distributed Update Handler<br />Still useful for smaller crawls (&lt; 100M pages)<br />http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/<br />
    • 16. Searching<br />Hadoop Core is not all that useful for distributed search<br />Exception: Hadoop RPC layer, possibly<br />Exception: Log analysis, etc. for search related items<br />Other Hadoop ecosystem tools are useful:<br />Apache ZooKeeper (more in a moment)<br />HDFS – storage of shards (pull down to local disk)<br />Avro, Thrift, Protocol Buffers (serialization utilities)<br />
    • 17. ZooKeeper and Search<br />ZooKeeper is a centralized service for coordination, configuration, naming and distributed synchronization<br />In the context of search, it’s useful for:<br />Sharing configuration across nodes<br />Maintaining status about shards<br />Up/down/latency/rebalancing and more<br />Coordinating searches across shards/load balancing<br />
    • 18. ZooKeeper and Search (Practical)<br />Katta employs ZooKeeper for search coordination, etc.<br />Query distribution, status, etc.<br />Solr Cloud<br />All the benefits of Solr + ZooKeeper for coordinating distributed capabilities<br />Query distribution, configuration sharing, status, etc.<br />About to be committed to Solr trunk<br />http://wiki.apache.org/solr/SolrCloud<br />
    • 19. Other Search Related Tasks<br />Log Analysis<br />Query analytics<br />Related Searches<br />Relevance assessments<br />Classification and Clustering<br />Mahout – http://mahout.apache.org<br />HBase and other stores for documents<br />Avro, Thrift, Protocol Buffers for serialization of objects across the wire<br />
    • 20. Resources<br />http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/<br />http://hadoop.apache.org<br />http://nutch.apache.org<br />http://lucene.apache.org<br />http://www.lucidimagination.com<br />

    ×