CulvertA secondary indexing framework for BigTable-    style databases with HIVE integration   Ed Kohlwey   Cloud Computin...
Session Agenda•   Secondary Indexing•   The Solution: Culvert•   Culvert Design & Architecture•   How It Works•   API Exam...
Secondary Indexing• General design pattern for inverted index  – Maintain a map from value to location of    records/docum...
Sample Inventory Application  Foo Table  RowID    contact: city   contact: phone   inventory:count   order:Apples  Apples ...
Sample Inventory Application        Foo Table                    RowID                         contact: comments          ...
We found ourselves implementingthese ideas over and over for clients.        Why not make a library?
Solution: Culvert
Requirements• Support secondary indexing• Support an analyst query environment• Database Extensibility   – There’s actuall...
What Culvert Does• Indexing• Interface for queries (Java and HIVE)• Abstraction mechanism for multiple  underlying databases
Culvert Design & Architecture• Use sorted iterators to retrieve values   – Lots of algorithms can be expressed as sorting ...
Architecture Diagram                     Java API                        Hive                          Culvert Client-Side...
Constraint Architecture• Used to express query predicate operations  – projection and selection (SELECT)  – set operations...
Index Architecture• Index is an abstract type  – Defines how to store and use the index• One index per column  – Didn’t se...
Index Architecture (cont.)• One index table per index  – Allows Index implementations to assume they    don’t share the in...
Table Adapters• TableAdapter and LocalTableAdapter are  abstraction mechanisms, roughly equivalent  to HTable and HRegion•...
Using Culvert With HIVE• Why HIVE?  – Already very popular  – Take advantage of upstream advances  – Good framework to “op...
HIVE Culvert Input Format• Handles AND, >, < query predicates based on  indices• Each index can be broken up into fragment...
How It WorksOverview of Indexing Operations
Indexing• Indices are built via insertion operations on  the client (i.e. Client.put(…))• Whether a field is indexed is co...
Retrieval• Query API is exposed via HIVE and Java  – HIVE API delegates to Java API  – Java API is based on subclasses of ...
Walkthrough of LogicalOperations on Indices
Logical Operations on Indices• Logical operations can be represented as a merge  sort if we return the keys from the origi...
Apples < 3 AND Oranges > 5• First query each indexorders:Apples Index          orders:Oranges Index1 -> Dean              ...
Apples < 3 AND Oranges > 5• Then order results for each index• Happens on the region servers1 -> Dean3 -> Susan           ...
Apples < 3 AND Oranges > 5• Then order results for each index• Happens on the region serversDeanSusan                     ...
Apples < 3 AND Oranges > 5• Then order results for each index• Notice this happens on the region servers*DoneDeanSusan    ...
Apples < 3 AND Oranges > 5• Then order results for each index• Notice this happens on the region servers*DoneDean         ...
Apples < 3 AND Oranges > 5• Then merge the sorted results on the clientDeanSusan                         George           ...
Apples < 3 AND Oranges > 5• Dean is lowest, Dean is not on the head of all  the queues, discardDeanSusan                  ...
Apples < 3 AND Oranges > 5• George is lowest, George is not on the head of  all queues, discardDeanSusan                  ...
Apples < 3 AND Oranges > 5• Continue…DeanSusan                    George                         Karen                    ...
Apples < 3 AND Oranges > 5  • Susan is on the head of all the queues, return    Susan  Dean✔ Susan                        ...
Apples < 3 AND Oranges > 5  • Tom is discarded, now we’re finished  Dean✔ Susan                        George             ...
Joins• Numerous methods possible• A few examples  – Use sub-queries to fetch related records  – Use merge sorting to simul...
Example: Join Apple Order Size onOrange Order Size (order:Apples =        order:Oranges)                          User per...
Example: Join Apple Order Size on       Orange Order Size (order:Apples =               order:Oranges)                    ...
Example: Join Apple Order Size on       Orange Order Size (order:Apples =               order:Oranges)                    ...
Example: Join Apple Order Size on       Orange Order Size (order:Apples =               order:Oranges)                    ...
Example: Join Apple Order Size on       Orange Order Size (order:Apples =               order:Oranges)                    ...
Culvert Java API Examples• Goal: to be intuitive and easy to interact with• Provide a simple relational API without forcin...
Culvert API Example: InsertionConfiguration culvertConf = CConfiguration.getDefault();// index definitions are loaded impl...
Culvert API Example: RetrievalConfiguration culvertConf = CConfiguration.getDefault();// index definitions are loaded impl...
Future Work• (Re)Building Indices via Map/Reduce• More index types  – Document-partitioned  – Others?• More retrieval oper...
Where to Get It*http://github.com/booz-allen-hamilton/culvert          Where to Tweet It                  #culvert        ...
Culvert Team•   Ed Kohlwey (@ekohlwey)•   Jesse Yates (@jesse_yates)•   Jeremy Walsh•   Tomer Kishoni (@tokbot)•   Jason T...
Questions?
Upcoming SlideShare
Loading in …5
×

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

3,116 views

Published on

Ed Kohlwey's presentation at 2011 Hadoop Summit.

Secondary indexing is a common design pattern in BigTable-like databases that allows users to index one or more columns in a table. This technique enables fast search of records in a database based on a particular column instead of the row id, thus enabling relational-style semantics in a NoSQL environment. This is accomplished by representing the index either in a reserved namespace in the table or another index table. Despite the fact that this is a common design pattern in BigTable-based applications, most implementations of this practice to date have been tightly coupled with a particular application. As a result, few general-purpose frameworks for secondary indexing on BigTable-like databases exist, and those that do are tied to a particular implementation of the BigTable model.

We developed a solution to this problem called Culvert that supports online index updates as well as a variation of the HIVE query language. In designing Culvert, we sought to make the solution pluggable so that it can be used on any of the many BigTable-like databases (HBase, Cassandra, etc.). We will discuss our experiences implementing secondary indexing solutions over multiple underlying data stores, and how these experiences drove design decisions in creating the Culvert framework. We will also discuss our efforts to integrate HIVE on top of multiple indexing solutions and databases, and how we implemented a subset of HIVE's query language on Culvert.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,116
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
70
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Just say the bullet points,
  • Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

    1. 1. CulvertA secondary indexing framework for BigTable- style databases with HIVE integration Ed Kohlwey Cloud Computing Team
    2. 2. Session Agenda• Secondary Indexing• The Solution: Culvert• Culvert Design & Architecture• How It Works• API Examples• Where to Get It & Credits
    3. 3. Secondary Indexing• General design pattern for inverted index – Maintain a map from value to location of records/documents that contain them• Lots of different variations – Term partitioned index – Document partitioned index• Solves problem of BigTable-style databases only having one primary key for records
    4. 4. Sample Inventory Application Foo Table RowID contact: city contact: phone inventory:count order:Apples Apples 5 John Springfield (999)-888-7777 3 Pears 10Sample Term-Partitioned Index Table order:Apples Index RowID 3 -> Dave 3 -> John 17 -> Paul 20 -> Sue
    5. 5. Sample Inventory Application Foo Table RowID contact: comments John John likes apples. Sue Sue likes pears. Sample Document-Partitioned Index Tablecontact:comments IndexRowID apples:john john:John likes:John likes:Sue pears:Sue sue:Sue0x178df - - -0x32da4 - - -
    6. 6. We found ourselves implementingthese ideas over and over for clients. Why not make a library?
    7. 7. Solution: Culvert
    8. 8. Requirements• Support secondary indexing• Support an analyst query environment• Database Extensibility – There’s actually a lot of BigTable implementations out there (HBase, Cassandra, proprietary)• Internal Extensibility – There’s lots of ways to index records – There’s lots of ways to retrieve records – Separate retrieval operations from index implementation
    9. 9. What Culvert Does• Indexing• Interface for queries (Java and HIVE)• Abstraction mechanism for multiple underlying databases
    10. 10. Culvert Design & Architecture• Use sorted iterators to retrieve values – Lots of algorithms can be expressed as sorting (like people tend to do in Map/Reduce) – Optional “dumping” feature can provide parallelism• Decorator design pattern is intuitive to interact with• Allows streaming of results as they become available• Uses Coprocessors to implement parallel operations
    11. 11. Architecture Diagram Java API Hive Culvert Client-Side Operation TableAdapter Constraint Client Culvert Region-Side Operation Culvert Region-Side OperationLocalTableAdapter RemoteOp LocalTableAdapter RemoteOp
    12. 12. Constraint Architecture• Used to express query predicate operations – projection and selection (SELECT) – set operations (AND/OR) – joins• Decoupled from Indices – Currently focused on term-partitioned indices – Future work includes expanding document- partitioned index functionality
    13. 13. Index Architecture• Index is an abstract type – Defines how to store and use the index• One index per column – Didn’t see a performance reason to index over multiple columns – Multiple indices complicates framework code – Map of “logical fields” was more easily maintained in the application – May evolve in the future
    14. 14. Index Architecture (cont.)• One index table per index – Allows Index implementations to assume they don’t share the index table – Don’t need to worry about other Indices clobbering their table structure – Tables are assumed to be cheap
    15. 15. Table Adapters• TableAdapter and LocalTableAdapter are abstraction mechanisms, roughly equivalent to HTable and HRegion• RemoteOp is roughly equivalent to CoprocessorProtocol, is handled by TableAdapter and LocalTableAdapter• Gives implementers fine-grained control over parallelism + table operations
    16. 16. Using Culvert With HIVE• Why HIVE? – Already very popular – Take advantage of upstream advances – Good framework to “optimize later”• Culvert implements a HIVE StorageHandler and PredicateHandler• Facilitates analyst interaction with database• Reduces the “SQL Gap”
    17. 17. HIVE Culvert Input Format• Handles AND, >, < query predicates based on indices• Each index can be broken up into fragments based on region start and end keys – We take the cross-product of each indexes regions to create input splits for AND
    18. 18. How It WorksOverview of Indexing Operations
    19. 19. Indexing• Indices are built via insertion operations on the client (i.e. Client.put(…))• Whether a field is indexed is controlled by a configuration file• In the future, will support indexing of arbitrary columns via Map/Reduce
    20. 20. Retrieval• Query API is exposed via HIVE and Java – HIVE API delegates to Java API – Java API is based on subclasses of Constraint• Focused on providing parallel, real-time query execution
    21. 21. Walkthrough of LogicalOperations on Indices
    22. 22. Logical Operations on Indices• Logical operations can be represented as a merge sort if we return the keys from the original table in sorted order• Example: ANDorders:Apples Index orders:Oranges Index1 -> Dean 4 -> Dean3 -> Susan 5 -> Susan4 -> John 5 -> Paul8 -> Paul 6 -> George14 -> Renee 12 -> Karen33 -> Sheryl 19 -> Tom
    23. 23. Apples < 3 AND Oranges > 5• First query each indexorders:Apples Index orders:Oranges Index1 -> Dean 4 -> Dean3 -> Susan 5 -> Susan4 -> John 5 -> Paul8 -> Paul 6 -> George14 -> Renee 12 -> Karen33 -> Sheryl 19 -> Tom
    24. 24. Apples < 3 AND Oranges > 5• Then order results for each index• Happens on the region servers1 -> Dean3 -> Susan 5 -> Susan 5 -> Paul 6 -> George 12 -> Karen 19 -> Tom
    25. 25. Apples < 3 AND Oranges > 5• Then order results for each index• Happens on the region serversDeanSusan Susan Paul George Karen Tom
    26. 26. Apples < 3 AND Oranges > 5• Then order results for each index• Notice this happens on the region servers*DoneDeanSusan Susan Paul George Karen Tom
    27. 27. Apples < 3 AND Oranges > 5• Then order results for each index• Notice this happens on the region servers*DoneDean DoneSusan George Karen Paul Susan Tom
    28. 28. Apples < 3 AND Oranges > 5• Then merge the sorted results on the clientDeanSusan George Karen Paul Susan Tom
    29. 29. Apples < 3 AND Oranges > 5• Dean is lowest, Dean is not on the head of all the queues, discardDeanSusan George Karen Paul Susan Tom
    30. 30. Apples < 3 AND Oranges > 5• George is lowest, George is not on the head of all queues, discardDeanSusan George Karen Paul Susan Tom
    31. 31. Apples < 3 AND Oranges > 5• Continue…DeanSusan George Karen Paul Susan Tom
    32. 32. Apples < 3 AND Oranges > 5 • Susan is on the head of all the queues, return Susan Dean✔ Susan George Karen Paul Susan ✔ Tom
    33. 33. Apples < 3 AND Oranges > 5 • Tom is discarded, now we’re finished Dean✔ Susan George Karen Paul Susan ✔ Tom
    34. 34. Joins• Numerous methods possible• A few examples – Use sub-queries to fetch related records – Use merge sorting to simultaneously fetch records satisfying both sides of the join, filter those that don’t match• Presently, Culvert has only one join (sub- queries method)
    35. 35. Example: Join Apple Order Size onOrange Order Size (order:Apples = order:Oranges) User performs joins with a JoinConstraint constraint (decorator design pattern)
    36. 36. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint…John Constraint receives row ID’s from a left… sub-constraint.Left SubConstraint
    37. 37. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint…John… Constraint looks up field values for the left side (if not already present in the results)Left SubConstraint order:Apples … … John 5 … …
    38. 38. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) JoinConstraint For each record in the left result set, the constraint creates… a new right-side constraint to fetch indexed items matchingJohn the right side of the constraint.… order:Oranges … …Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
    39. 39. Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges) Finally, … … … the joined JoinConstraint records John 5 George are returned.… John 5 JaneJohn … … …… order:Oranges … …Left SubConstraint order:Apples George 5 … … Jane 5 John 5 … … … …
    40. 40. Culvert Java API Examples• Goal: to be intuitive and easy to interact with• Provide a simple relational API without forcing a developer to use SQL
    41. 41. Culvert API Example: InsertionConfiguration culvertConf = CConfiguration.getDefault();// index definitions are loaded implicitly from the// configurationClient client = new Client(culvertConf);List<CKeyValue> valuesToPut = Lists.newArrayList();valuesToPut.add(new CKeyValue( "foo".getBytes(), "bar".getBytes(), "baz”.getBytes()));Put put = new Put(valuesToPut);client.put("tableName", put);
    42. 42. Culvert API Example: RetrievalConfiguration culvertConf = CConfiguration.getDefault();// index definitions are loaded implicitly from the configurationClient client = new Client(culvertConf);Index c1Index = client.getIndexByName("index1");Constraint c1Constraint = new IndexRangeConstraint( c1Index, new CRange( "abba".getBytes(), "cadabra".getBytes()));Index[] c2Indices = client.getIndicesForColumn( "rabbit".getBytes(), "hat".getBytes());Constraint c2Constraint = new IndexRangeConstraint( c2Indices[0], new CRange("bar".getBytes(), "foo".getBytes()));Constraint and = new And(c1Constraint, c2Constraint);Iterator<Result> results = client.query("tablename", and);
    43. 43. Future Work• (Re)Building Indices via Map/Reduce• More index types – Document-partitioned – Others?• More retrieval operations• Profiling + tuning• Storing configuration details in a table or in Zookeeper
    44. 44. Where to Get It*http://github.com/booz-allen-hamilton/culvert Where to Tweet It #culvert *Available 6/29/2011
    45. 45. Culvert Team• Ed Kohlwey (@ekohlwey)• Jesse Yates (@jesse_yates)• Jeremy Walsh• Tomer Kishoni (@tokbot)• Jason Trost (@jason_trost)
    46. 46. Questions?

    ×