Advertisement
Advertisement

More Related Content

Recently uploaded(20)

Advertisement

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

  1. Culvert A secondary indexing framework for BigTable- style databases with HIVE integration Ed Kohlwey Cloud Computing Team
  2. Session Agenda • Secondary Indexing • The Solution: Culvert • Culvert Design & Architecture • How It Works • API Examples • Where to Get It & Credits
  3. Secondary Indexing • General design pattern for inverted index – Maintain a map from value to location of records/documents that contain them • Lots of different variations – Term partitioned index – Document partitioned index • Solves problem of BigTable-style databases only having one primary key for records
  4. Sample Inventory Application Foo Table RowID contact: city contact: phone inventory:count order:Apples Apples 5 John Springfield (999)-888-7777 3 Pears 10 Sample Term-Partitioned Index Table order:Apples Index RowID 3 -> Dave 3 -> John 17 -> Paul 20 -> Sue
  5. Sample Inventory Application Foo Table RowID contact: comments John John likes apples. Sue Sue likes pears. Sample Document-Partitioned Index Table contact:comments Index RowID apples:john john:John likes:John likes:Sue pears:Sue sue:Sue 0x178df - - - 0x32da4 - - -
  6. We found ourselves implementing these ideas over and over for clients. Why not make a library?
  7. Solution: Culvert
  8. Requirements • Support secondary indexing • Support an analyst query environment • Database Extensibility – There’s actually a lot of BigTable implementations out there (HBase, Cassandra, proprietary) • Internal Extensibility – There’s lots of ways to index records – There’s lots of ways to retrieve records – Separate retrieval operations from index implementation
  9. What Culvert Does • Indexing • Interface for queries (Java and HIVE) • Abstraction mechanism for multiple underlying databases
  10. Culvert Design & Architecture • Use sorted iterators to retrieve values – Lots of algorithms can be expressed as sorting (like people tend to do in Map/Reduce) – Optional “dumping” feature can provide parallelism • Decorator design pattern is intuitive to interact with • Allows streaming of results as they become available • Uses Coprocessors to implement parallel operations
  11. Architecture Diagram Java API Hive Culvert Client-Side Operation TableAdapter Constraint Client Culvert Region-Side Operation Culvert Region-Side Operation LocalTableAdapter RemoteOp LocalTableAdapter RemoteOp
  12. Constraint Architecture • Used to express query predicate operations – projection and selection (SELECT) – set operations (AND/OR) – joins • Decoupled from Indices – Currently focused on term-partitioned indices – Future work includes expanding document- partitioned index functionality
  13. Index Architecture • Index is an abstract type – Defines how to store and use the index • One index per column – Didn’t see a performance reason to index over multiple columns – Multiple indices complicates framework code – Map of “logical fields” was more easily maintained in the application – May evolve in the future
  14. Index Architecture (cont.) • One index table per index – Allows Index implementations to assume they don’t share the index table – Don’t need to worry about other Indices clobbering their table structure – Tables are assumed to be cheap
  15. Table Adapters • TableAdapter and LocalTableAdapter are abstraction mechanisms, roughly equivalent to HTable and HRegion • RemoteOp is roughly equivalent to CoprocessorProtocol, is handled by TableAdapter and LocalTableAdapter • Gives implementers fine-grained control over parallelism + table operations
  16. Using Culvert With HIVE • Why HIVE? – Already very popular – Take advantage of upstream advances – Good framework to “optimize later” • Culvert implements a HIVE StorageHandler and PredicateHandler • Facilitates analyst interaction with database • Reduces the “SQL Gap”
  17. HIVE Culvert Input Format • Handles AND, >, < query predicates based on indices • Each index can be broken up into fragments based on region start and end keys – We take the cross-product of each indexes regions to create input splits for AND
  18. How It Works Overview of Indexing Operations
  19. Indexing • Indices are built via insertion operations on the client (i.e. Client.put(…)) • Whether a field is indexed is controlled by a configuration file • In the future, will support indexing of arbitrary columns via Map/Reduce
  20. Retrieval • Query API is exposed via HIVE and Java – HIVE API delegates to Java API – Java API is based on subclasses of Constraint • Focused on providing parallel, real-time query execution
  21. Walkthrough of Logical Operations on Indices
  22. Logical Operations on Indices • Logical operations can be represented as a merge sort if we return the keys from the original table in sorted order • Example: AND orders:Apples Index orders:Oranges Index 1 -> Dean 4 -> Dean 3 -> Susan 5 -> Susan 4 -> John 5 -> Paul 8 -> Paul 6 -> George 14 -> Renee 12 -> Karen 33 -> Sheryl 19 -> Tom
  23. Apples < 3 AND Oranges > 5 • First query each index orders:Apples Index orders:Oranges Index 1 -> Dean 4 -> Dean 3 -> Susan 5 -> Susan 4 -> John 5 -> Paul 8 -> Paul 6 -> George 14 -> Renee 12 -> Karen 33 -> Sheryl 19 -> Tom
  24. Apples < 3 AND Oranges > 5 • Then order results for each index • Happens on the region servers 1 -> Dean 3 -> Susan 5 -> Susan 5 -> Paul 6 -> George 12 -> Karen 19 -> Tom
  25. Apples < 3 AND Oranges > 5 • Then order results for each index • Happens on the region servers Dean Susan Susan Paul George Karen Tom
  26. Apples < 3 AND Oranges > 5 • Then order results for each index • Notice this happens on the region servers* Done Dean Susan Susan Paul George Karen Tom
  27. Apples < 3 AND Oranges > 5 • Then order results for each index • Notice this happens on the region servers* Done Dean Done Susan George Karen Paul Susan Tom
  28. Apples < 3 AND Oranges > 5 • Then merge the sorted results on the client Dean Susan George Karen Paul Susan Tom
  29. Apples < 3 AND Oranges > 5 • Dean is lowest, Dean is not on the head of all the queues, discard Dean Susan George Karen Paul Susan Tom
  30. Apples < 3 AND Oranges > 5 • George is lowest, George is not on the head of all queues, discard Dean Susan George Karen Paul Susan Tom
  31. Apples < 3 AND Oranges > 5 • Continue… Dean Susan George Karen Paul Susan Tom
  32. Apples < 3 AND Oranges > 5 • Susan is on the head of all the queues, return Susan Dean ✔ Susan George Karen Paul Susan ✔ Tom
  33. Apples < 3 AND Oranges > 5 • Tom is discarded, now we’re finished Dean ✔ Susan George Karen Paul Susan ✔ Tom
  34. Joins • Numerous methods possible • A few examples – Use sub-queries to fetch related records – Use merge sorting to simultaneously fetch records satisfying both sides of the join, filter those that don’t match • Presently, Culvert has only one join (sub- queries method)
  35. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) User performs joins with a JoinConstraint constraint (decorator design pattern)
  36. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint … John Constraint receives row ID’s from a left … sub-constraint. Left SubConstraint
  37. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint … John … Constraint looks up field values for the left side (if not already present in the results) Left SubConstraint order:Apples … … John 5 … …
  38. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint For each record in the left result set, the constraint creates … a new right-side constraint to fetch indexed items matching John the right side of the constraint. … order:Oranges … … Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
  39. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) Finally, … … … the joined JoinConstraint records John 5 George are returned. … John 5 Jane John … … … … order:Oranges … … Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
  40. Culvert Java API Examples • Goal: to be intuitive and easy to interact with • Provide a simple relational API without forcing a developer to use SQL
  41. Culvert API Example: Insertion Configuration culvertConf = CConfiguration.getDefault(); // index definitions are loaded implicitly from the // configuration Client client = new Client(culvertConf); List<CKeyValue> valuesToPut = Lists.newArrayList(); valuesToPut.add(new CKeyValue( "foo".getBytes(), "bar".getBytes(), "baz”.getBytes())); Put put = new Put(valuesToPut); client.put("tableName", put);
  42. Culvert API Example: Retrieval Configuration culvertConf = CConfiguration.getDefault(); // index definitions are loaded implicitly from the configuration Client client = new Client(culvertConf); Index c1Index = client.getIndexByName("index1"); Constraint c1Constraint = new IndexRangeConstraint( c1Index, new CRange( "abba".getBytes(), "cadabra".getBytes())); Index[] c2Indices = client.getIndicesForColumn( "rabbit".getBytes(), "hat".getBytes()); Constraint c2Constraint = new IndexRangeConstraint( c2Indices[0], new CRange("bar".getBytes(), "foo".getBytes())); Constraint and = new And(c1Constraint, c2Constraint); Iterator<Result> results = client.query("tablename", and);
  43. Future Work • (Re)Building Indices via Map/Reduce • More index types – Document-partitioned – Others? • More retrieval operations • Profiling + tuning • Storing configuration details in a table or in Zookeeper
  44. Where to Get It* http://github.com/booz-allen-hamilton/culvert Where to Tweet It #culvert *Available 6/29/2011
  45. Culvert Team • Ed Kohlwey (@ekohlwey) • Jesse Yates (@jesse_yates) • Jeremy Walsh • Tomer Kishoni (@tokbot) • Jason Trost (@jason_trost)
  46. Questions?

Editor's Notes

  1. Just say the bullet points,
Advertisement