Text Indexing in Accumulo


Published on

A presentation on using Accumulo Intersecting Iterators to build a Text Search application

Published in: Technology

Text Indexing in Accumulo

  1. 1. TEXT INDEXING WITH ACCUMULOEfficient searching in a big data worldTomer KishoniMarch 21, 2012
  2. 2. Agenda•  Problem Statement•  Term-Based Inverted Index•  Term-Based Inverted Index and Accumulo•  Document Partitioned Index•  Document Partitioned Index and Accumulo
  3. 3. Problem•  How can we efficiently search for information in a big data world? •  Processing time •  Network bandwidth•  How can we leverage Accumulo’s feature set to create efficient search patterns?
  4. 4. Focus on Indexing•  Indexing your data is a great place to start•  Let’s focus on: •  Term-based inverted index •  Great for single term search •  Document partitioned index •  Great for multiple term search
  5. 5. Example DatasetDocument ID Column ValueLearning Python Author LutzLearning Python Summary Extensive book on …Programming Pearls Author BentleyProgramming Pearls Summary Classic techniques to …Computational Geometry Author MartinComputational Geometry Summary Want to know how to …•  Dataset of books •  Author •  Book summary•  Reference the data using the document id
  6. 6. Term-Based Inverted IndexValue Column Document IDLutz Author Learning PythonExtensive book on … Summary Learning PythonBentley Author Programming PearlsClassic techniques to … Summary Programming PearlsMartin Author Computational GeometryWant to know how to … Summary Computational Geometry•  Reference the document id using the value•  Can split up unstructured text to search for specific terms
  7. 7. Term-Based Index and Accumulo•  Accumulo partitions data primarily on the row id •  Lexicographic sorting •  Sorting provides a much friendlier way to search data•  Accumulo provides multidimensional storage •  Row id  term •  Column family  column name •  Column qualifier  document id•  Can normalize the data if needed •  E.g., lower case terms
  8. 8. Term-Based Index and AccumuloRow ID Column Family Column Qualifierbentley Author Programming Pearlsbook Summary Learning Pythonclassic Summary Programming Pearlsextensive Summary Learning Pythonhow Summary Computational Geometryknow Summary Computational Geometrylutz Author Learning Pythonmartin Author Computational Geometryon Summary Learning Pythontechniques Summary Programming Pearlsto Summary Computational Geometryto Summary Programming Pearlswant Summary Computational Geometry
  9. 9. Term-Based Index and Accumulo•  Utilize Accumulo’s Scanners to search for terms// Create the scanner objectScanner indexScanner = ...// Set the range to the term we want to searchindexScanner.setRange("book”);indexScanner.fetchColumnFamily("Summary");// Get the index resultsfor(Entry<Key, Value> entry : indexScanner) { Text docId = entry.getKey().getColumnQualifier(); ...}
  10. 10. Term-Based Index and Accumulo•  Can make this even better using locality groups •  Data partitioned by certain column families •  Don’t need to skip over unnecessary columns •  Scan data sequentiallyRow ID Column Family Column Qualifierbentley Author Programming Pearlslutz Author Learning Pythonmartin Author Computational Geometrybook Summary Learning Pythonclassic Summary Programming Pearlsextensive Summary Learning Python… … …
  11. 11. Problems with Term-Based Indexing•  Term-based indexes are great for single term queries•  Inefficient at multi-term search •  The terms of a single document could be split over multiple tablets being served by multiple tablet servers •  Need to do set operations on the client •  Inefficient use of computer resources and network bandwidth
  12. 12. Problems with Term-Based Indexing•  Inefficient at multi-term search Search: code book doc1 doc1, doc2 doc1 Row CF CQ Row CF CQ book summary doc1 code summary doc1 book summary doc2 left summary doc2 classic summary doc3 up summary doc3 •  Wasteful to bring doc2 back
  13. 13. Document Partitioned Index•  Distributing the index by the document rather than the term•  All terms for a document are binned together•  Since all the terms are binned together we can perform set operations on the servers
  14. 14. Document Partitioned Index andAccumulo•  Accumulo stores all data on the same tablet if the key has the same row id •  Allows us to easily bin a document’s terms•  Accumulo iterators allow us to perform server-side processing •  Allows us to easily perform set operations •  IntersectingIterator
  15. 15. Document Partitioned Index andAccumuloRow ID Column Family Column Qualifierbin1 Author=bentley Programming Pearlsbin1 Author=lutz Learning Pythonbin1 Summary=book Learning Pythonbin1 Summary=classic Programming Pearlsbin1 Summary=extensive Learning Pythonbin1 Summary=on Learning Pythonbin1 Summary=techniques Programming Pearlsbin1 Summary=to Programming Pearlsbin2 Author=martin Computational Geometrybin2 Summary=to Computational Geometrybin2 Summary=want Computational Geometrybin2 Summary=how Computational Geometrybin2 Summary=know Computational Geometry
  16. 16. Multi-Term Search with DocumentPartitioned Indexes and Accumulo•  Tablet server only returns fully qualified documents Search: code book doc1 doc1 <none> Row CF CQ Row CF CQ bin1 summary=book doc1 bin2 summary=book doc2 bin1 summary=code doc1 bin2 summary=classic doc3 bin2 summary=left doc2 bin2 summary=up doc3
  17. 17. Document Partitioned Index andAccumulo with IntersectingIterators•  IntersectingIterators will check the column families for the specified terms// Create the scanner objectBatchScanner indexScanner = ...// Create the term arrayText[] terms = {new Text("summary=code"), new Text("summary=book")};// Set the intersecting iteratorindexScanner.setScanIterators(20, IntersectingIterator.class.getName(), "ii”);//Set the iterator optionsindexScanner.setScanIteratorOptions("ii", IntersectingIterator.columnFamiliesOptionName, IntersectingIterator.encodeColumns(terms));
  18. 18. Document Partitioned Index andAccumulo with IntersectingIterators•  For a basic document partitioned index we want to scan the entire index table// Set the range to scan everythingindexScanner.setRanges(Collections.singleton(new Range()));// Only fully qualified documents will returnfor(Entry<Key, Value> entry : indexScanner) { Text docId = entry.getKey().getColumnQualifier(); ...}
  19. 19. Document Partitioned Index andAccumulo (Bonus)•  Bin id can include space, time, etc. •  Use the dynamic schema of Accumulo to your advantage •  Instead of: •  bin1, bin2, bin3 •  Try out: •  2012Q4_book_1, 2012Q4_article_1, 2010Q1_tv_2 •  This includes time and categories •  Set the BatchScanner’s ranges accordingly•  Avoid using two scanners to query the index table and then the record table •  Store both the index and record data in the same table •  Need to correctly format the data and use the FamilyIntersectingIterator
  20. 20. Summary•  Term-based inverted index •  Take the value from the record table and make it the row id in the index table •  Great at single term queries •  Bad at multi-term queries •  Network bandwidth •  Resources•  Document Partitioned Index •  Distributing the index by the document will ensure that all terms for a record are served by a single Tablet Server •  Leverage Iterators to do all the work server-side •  Great at multi-term queries