Lucene
InputFormat
and more!
Lookups on HDFS
SequenceFile is great for fast sequential access, but how
to do lookups?
MapFile, BloomMapFile, HBase, Cas...
Lucene to the rescue!
Lucene is, among many things, a file format.
The stored fields file (fdt) has fast sequential access...
Solr HDFSDirectory
● Start with SOLR-4916 (HDFS support)
● Pull out Solr-specific bits so we can use with
vanilla Lucene
●...
Lucene InputFormat
● Glob HDFS for Lucene instance directories
● Read SegmentInfos and create a split per
segment
● Use a ...
Lucene InputFormat cont.
● Gives back a Document with the stored fields
● The time spent searching is negligible
compared ...
Adding a query
Add a simple TermQuery like “key:value” and
specify which fields to return
LIF.setLuceneQuery(job, "body:an...
More complex queries?
Use JavaScript to dynamically set more
complicated queries
var clause1 = new TermQuery("body", "anar...
Adding Pig LoadFunc
X = LOAD 'hdfs://localhost:50001/tmp/lucene/*'
USING DefaultLuceneLoadFunc('body:anarchy')
AS (title:c...
Demo!
Adding some schema
● Schema is hard-coded in previous examples
● InputFormat gives back Lucene Document
● Use Avro to refl...
Avro-ified IF and LoadFunc
X = LOAD 'hdfs://localhost:50001/tmp/lucene/*'
USING AvroLuceneLoadFunc(
'com.lucid.MyAvroClass...
That’s it!
David Arthur
http://mumrah.github.io/
Bonus Slide - Kafka 0.8
Kafka 0.8.0 was released last week!
Now with 100% more logo:

Apache Kafka
Upcoming SlideShare
Loading in …5
×

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

1,329 views

Published on

7 minute overview of some work I did to build a Hadoop InputFormat for Lucene indexes on HDFS. Includes a Pig LoadFunc, and some Avro schema reflection to make things smoother.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,329
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

  1. 1. Lucene InputFormat and more!
  2. 2. Lookups on HDFS SequenceFile is great for fast sequential access, but how to do lookups? MapFile, BloomMapFile, HBase, Cassandra, et al. all provide one primary-key type index of the data. But what if you want to index all your fields (or at least many of them)? What if you want search?
  3. 3. Lucene to the rescue! Lucene is, among many things, a file format. The stored fields file (fdt) has fast sequential access, so it acts as our “sequence file” of key/values. In addition to this, you get the power of the inverted index and the search capabilities of Lucene.
  4. 4. Solr HDFSDirectory ● Start with SOLR-4916 (HDFS support) ● Pull out Solr-specific bits so we can use with vanilla Lucene ● Backport to Hadoop 1.x
  5. 5. Lucene InputFormat ● Glob HDFS for Lucene instance directories ● Read SegmentInfos and create a split per segment ● Use a MatchAllDocsQuery to quickly iterate through the doc set ● RecordReader returns docs from DocIdSetIterator
  6. 6. Lucene InputFormat cont. ● Gives back a Document with the stored fields ● The time spent searching is negligible compared to iterating through docs ● Think of it as a key/value storage format plus an efficient inverted index
  7. 7. Adding a query Add a simple TermQuery like “key:value” and specify which fields to return LIF.setLuceneQuery(job, "body:anarchy"); LIF.setLuceneFields(job, "title", "body");
  8. 8. More complex queries? Use JavaScript to dynamically set more complicated queries var clause1 = new TermQuery("body", "anarchy"); var clause2 = new TermQuery("title", "revolution"); var query = new BooleanQuery(); query.add(clause1, BooleanClause.Occur.MUST); query.add(clause2, BooleanClause.Occur.MUST);
  9. 9. Adding Pig LoadFunc X = LOAD 'hdfs://localhost:50001/tmp/lucene/*' USING DefaultLuceneLoadFunc('body:anarchy') AS (title:chararray, date:long, body:chararray); Y = FOREACH X GENERATE title, date; (Anarchism,1355654644000) (Abraham Lincoln,1357087785000) (Art,1357159249000) (Anarcho-capitalism,1356671677000)
  10. 10. Demo!
  11. 11. Adding some schema ● Schema is hard-coded in previous examples ● InputFormat gives back Lucene Document ● Use Avro to reflect a schema onto the Lucene docs when reading/writing ● Similarly, use Avro to reflect a Pig schema
  12. 12. Avro-ified IF and LoadFunc X = LOAD 'hdfs://localhost:50001/tmp/lucene/*' USING AvroLuceneLoadFunc( 'com.lucid.MyAvroClass', 'body:anarchy' ); Y = FOREACH X GENERATE title, date;
  13. 13. That’s it! David Arthur http://mumrah.github.io/
  14. 14. Bonus Slide - Kafka 0.8 Kafka 0.8.0 was released last week! Now with 100% more logo: Apache Kafka

×