A presentation on using Accumulo Intersecting Iterators to build a Text Search application

  TEXT INDEXING WITH ACCUMULO
Efficient searching in a big data world
Tomer Kishoni
March 21, 2012
  Agenda
•  Problem Statement
•  Term-Based Inverted Index
•  Term-Based Inverted Index and Accumulo
•  Document Partitioned Index
•  Document Partitioned Index and Accumulo
  Problem
•  How can we efficiently search for information in a big data world?
    •  Processing time
    •  Network bandwidth
•  How can we leverage Accumulo's feature set to create efficient search patterns?
  Focus on Indexing
•  Indexing your data is a great place to start
•  Let's focus on:
    •  Term-based inverted index
        •  Great for single term search
    •  Document partitioned index
        •  Great for multiple term search
  Example Dataset
Document ID Column Value
Learning Python Author Lutz
Learning Python Summary Extensive book on …
Programming Pearls Author Bentley
Programming Pearls Summary Classic techniques to …
Computational Geometry Author Martin
Computational Geometry Summary Want to know how to …

•  Dataset of books
    •  Author
    •  Book summary
•  Reference the data using the document id
  Term-Based Inverted Index
Value Column Document ID
Lutz Author Learning Python
Extensive book on … Summary Learning Python
Bentley Author Programming Pearls
Classic techniques to … Summary Programming Pearls
Martin Author Computational Geometry
Want to know how to … Summary Computational Geometry

•  Reference the document id using the value
•  Can split up unstructured text to search for specific terms
  Term-Based Index and Accumulo
•  Accumulo partitions data primarily on the row id
    •  Lexicographic sorting
    •  Sorting provides a much friendlier way to search data
•  Accumulo provides multidimensional storage
    •  Row id  term
    •  Column family  column name
    •  Column qualifier  document id
•  Can normalize the data if needed
    •  E.g., lower case terms
  Term-Based Index and Accumulo
Row ID Column Family Column Qualifier
bentley Author Programming Pearls
book Summary Learning Python
classic Summary Programming Pearls
extensive Summary Learning Python
how Summary Computational Geometry
know Summary Computational Geometry
lutz Author Learning Python
martin Author Computational Geometry
on Summary Learning Python
techniques Summary Programming Pearls
to Summary Computational Geometry
to Summary Programming Pearls
want Summary Computational Geometry
  Term-Based Index and Accumulo
•  Utilize Accumulo's Scanners to search for terms

// Create the scanner object
Scanner indexScanner = ...
// Set the range to the term we want to search
indexScanner.setRange("book");
indexScanner.fetchColumnFamily("Summary");
// Get the index results
for(Entry<Key, Value> entry : indexScanner) {
  Text docId = entry.getKey().getColumnQualifier();
  ...
}
  Term-Based Index and Accumulo
•  Can make this even better using locality groups
    •  Data partitioned by certain column families
    •  Don't need to skip over unnecessary columns
    •  Scan data sequentially

Row ID Column Family Column Qualifier
bentley Author Programming Pearls
lutz Author Learning Python
martin Author Computational Geometry
book Summary Learning Python
classic Summary Programming Pearls
extensive Summary Learning Python
… … …
  Problems with Term-Based Indexing
•  Term-based indexes are great for single term queries
•  Inefficient at multi-term search
    •  The terms of a single document could be split over multiple tablets being served by multiple tablet servers
    •  Need to do set operations on the client
    •  Inefficient use of computer resources and network bandwidth
  Problems with Term-Based Indexing
•  Inefficient at multi-term search

Search: code book
doc1 doc1, doc2 doc1

Row CF CQ Row CF CQ
book summary doc1 code summary doc1
book summary doc2 left summary doc2
classic summary doc3 up summary doc3

•  Wasteful to bring doc2 back
  Document Partitioned Index
•  Distributing the index by the document rather than the term
•  All terms for a document are binned together
•  Since all the terms are binned together we can perform set operations on the servers
  Document Partitioned Index and Accumulo
•  Accumulo stores all data on the same tablet if the key has the same row id
    •  Allows us to easily bin a document's terms
•  Accumulo iterators allow us to perform server-side processing
    •  Allows us to easily perform set operations
    •  IntersectingIterator
  Document Partitioned Index and Accumulo
Row ID Column Family Column Qualifier
bin1 Author=bentley Programming Pearls
bin1 Author=lutz Learning Python
bin1 Summary=book Learning Python
bin1 Summary=classic Programming Pearls
bin1 Summary=extensive Learning Python
bin1 Summary=on Learning Python
bin1 Summary=techniques Programming Pearls
bin1 Summary=to Programming Pearls
bin2 Author=martin Computational Geometry
bin2 Summary=to Computational Geometry
bin2 Summary=want Computational Geometry
bin2 Summary=how Computational Geometry
bin2 Summary=know Computational Geometry
  Multi-Term Search with Document Partitioned Indexes and Accumulo
•  Tablet server only returns fully qualified documents

Search: code book
doc1 doc1 <none>

Row CF CQ Row CF CQ
bin1 summary=book doc1 bin2 summary=book doc2
bin1 summary=code doc1 bin2 summary=classic doc3
bin2 summary=left doc2
bin2 summary=up doc3
  Document Partitioned Index and Accumulo with IntersectingIterators
•  IntersectingIterators will check the column families for the specified terms

// Create the scanner object
BatchScanner indexScanner = ...
// Create the term array
Text[] terms = {new Text("summary=code"), new Text("summary=book")};
// Set the intersecting iterator
indexScanner.setScanIterators(20, IntersectingIterator.class.getName(), "ii");
//Set the iterator options
indexScanner.setScanIteratorOptions("ii", IntersectingIterator.columnFamiliesOptionName, IntersectingIterator.encodeColumns(terms));
  Document Partitioned Index and Accumulo with IntersectingIterators
•  For a basic document partitioned index we want to scan the entire index table

// Set the range to scan everything
indexScanner.setRanges(Collections.singleton(new Range()));
// Only fully qualified documents will return
for(Entry<Key, Value> entry : indexScanner) {
  Text docId = entry.getKey().getColumnQualifier();
  ...
}
  Document Partitioned Index and Accumulo (Bonus)
•  Bin id can include space, time, etc.
    •  Use the dynamic schema of Accumulo to your advantage
    •  Instead of:
        •  bin1, bin2, bin3
    •  Try out:
        •  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2
        •  This includes time and categories
    •  Set the BatchScanner's ranges accordingly
•  Avoid using two scanners to query the index table and then the record table
    •  Store both the index and record data in the same table
    •  Need to correctly format the data and use the FamilyIntersectingIterator
  Summary
•  Term-based inverted index
    •  Take the value from the record table and make it the row id in the index table
    •  Great at single term queries
    •  Bad at multi-term queries
        •  Network bandwidth
        •  Resources
•  Document Partitioned Index
    •  Distributing the index by the document will ensure that all terms for a record are served by a single Tablet Server
    •  Leverage Iterators to do all the work server-side
    •  Great at multi-term queries