HBase Coprocessor Introduction


Published on

Introduction to HBase Coprocessor, and thinking of something.

Published in: Technology

HBase Coprocessor Introduction

  1. 1. HBase Coprocessor Intro. Anty Rao, Schubert Zhang Aug. 29 2012
  2. 2. Motivation• Distributed and parallel computation over data stored within HBase/Bigtable.• Architecture: {HBase + MapReduce} vs. {HBase with Coprocessor} – {Loosely coupled} vs. {Built in} – E.g., simple additive or aggregating operations like summing, counting, and the like – pushing the computation down to the servers where it can operate on the data directly without communication overheads can give a dramatic performance improvement over HBase’s already good scanning performance.• To be a framework for both flexible and generic extension, and of distributed computation directly within the HBase server processes. – Arbitrary code can run at each tablet in each HBase server. – Provides a very flexible model for building distributed services. – Automatic scaling, load balancing, request routing for applications.
  3. 3. Motivation (cont.)• To be a Data-Driven distributed and parallel service platform. – Distributed parallel computation framework. – Distributed application service platform.• High-level call interface for clients – Calls are addressed to rows or ranges of rows and the coprocessor client library resolves them to actual locations; – Calls across multiple rows are automatically split into multiple parallelized RPC.• Origin – Inspired by Google’s Bigtable Coprocessors. – Jeff Dean gave a talk at LADIS’09 • http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf, page 66-67
  4. 4. HBase vs. Google Bigtable• It is a framework that provides a library and runtime environment for executing user code within the HBase region server and master processes.• Google coprocessors in contrast run co-located with the tablet server but outside of its address space. – https://issues.apache.org/jira/browse/HBASE-4047
  5. 5. Google’s Bigtable Coprocessors
  6. 6. Overview of HBase Coprocessor• Tow scopes – System : loaded globally on all tables and regions. – Per-table: loaded on all regions for a table.• Two types – Observers • Like triggers in conventional databases. • The idea behind observers is that we can insert user code by overriding upcall methods provided by the coprocessor framework. The callback functions are executed from core HBase code when certain events occur. – Endpoints • Dynamic PRC endpoints that resemble stored procedures. • One can invoke an endpoint at any time from the client. The endpoint implementation will then be executed remotely at the target region or regions, and results from those executions will be returned to the client.• Difference of the tow types – Only endpoints return result to client.
  7. 7. Observers• Currently, three observers interfaces provided – RegionObserver • Provides hooks for data manipulation events, Get, Put, Delete, Scan, and so on. There is an instance of a RegionObserver coprocessor for every table region and the scope of the observations they can make is constrained to that region – WALObserver • Provides hooks for write-ahead log (WAL) related operations. This is a way to observe or intercept WAL writing and reconstruction events. A WALObserver runs in the context of WAL processing. There is one such context per region server. – MasterObserver • Provides hooks for DDL-type operation, i.e., create, delete, modify table, etc. The MasterObserver runs within the context of the HBase master.• Multiple Observers are chained to execute sequentially by order of assigned priorities.
  8. 8. Observers: Example
  9. 9. Observers: Example Codepackage org.apache.hadoop.hbase.coprocessor;import java.util.List;import org.apache.hadoop.hbase.KeyValue;import org.apache.hadoop.hbase.client.Get;// Sample access-control coprocessor. It utilizes RegionObserver// and intercept preXXX() method to check user privilege for the given table// and column family.public class AccessControlCoprocessor extends BaseRegionObserver { @Override public void preGet(final ObserverContext<RegionCoprocessorEnvironment> c,final Get get, final List<KeyValue> result) throws IOException throws IOException { // check permissions.. if (!permissionGranted()) { throw new AccessDeniedException("User is not allowed to access."); } } // override prePut(), preDelete(), etc.}
  10. 10. Endpoint• Resembling stored procedures• Invoke an endpoint at any time from the client.• The endpoint implementation will then be executed remotely at the target region or regions.• Result from those executions will be returned to the client.• Code implementation – Endpoint is an interface for dynamic RPC extension.
  11. 11. Endpoints: How to implement a custom Coprocessor?• Have a new protocol interface which extends CoprocessorProtocol.• Implement the Endpoint interface and the new protocol interface . The implementation will be loaded into and executed from the region context. – Extend the abstract class BaseEndpointCoprocessor. This convenience class hide some internal details that the implementer need not necessary be concerned about, such as coprocessor class loading.• On the client side, the Endpoints can be invoked by two new HBase client APIs: – Executing against a single region: • HTableInterface.coprocessorProxy(Class<T> protocol, byte[] row) – Executing against a range of regions: • HTableInterface.coprocessorExec(Class<T> protocol, byte[] startKey, byte[] endKey, Batch.Call<T,R> callable)
  12. 12. Endpoints: Example Client Code new Batch.Call (on all regions) Region Server 1 Endpoint Batch.Call<ColumnAggregationProtocol, Long>() tableA, , 12345678 { ColumnAggregationProtocol .) (.. public Long call(ColumnAggregationProtocol instance) or throws IOException ss Endpoint e oc { tableA, bbbb, 12345678 pr return instance.sum(FAMILY, QUALIFIER); Co ColumnAggregationProtocol ec } ex } HTable Region Server 2 Endpoint Map<byte[], Long> sumResults = tableA, cccc, 12345678 table.coprocessorExec(ColumnAggregationProtocol.class, ColumnAggregationProtocol startRow, endRow) Endpoint tableA, dddd, 12345678 ColumnAggregationProtocol Batch Results Map<byte[], Long> sumResults• Note that the HBase client has the responsibility for dispatching parallel endpoint invocations to the target regions, and for collecting the returned results to present to the application code.• Like a lightweight MapReduce job: The “map” is the endpoint execution performed in the region server on every target region, and the “reduce” is the final aggregation at the client.• The distributed systems programming details behind a clean API.
  13. 13. Step-1: Define protocol interface/** * A sample protocol for performing aggregation at regions. */public interface ColumnAggregationProtocol extends CoprocessorProtocol{ /** * Perform aggregation for a given column at the region. The aggregation * will include all the rows inside the region. It can be extended to allow * passing start and end rows for a fine-grained aggregation. * * @param family * family * @param qualifier * qualifier * @return Aggregation of the column. * @throws exception. */ public long sum(byte[] family, byte[] qualifier) throws IOException;}
  14. 14. Step-2: Implement endpoint and the interface trypublic class ColumnAggregationEndpoint extends {BaseEndpointCoprocessor List<KeyValue> curVals = new ArrayList<KeyValue>(); implements ColumnAggregationProtocol boolean done = false;{ do @Override { public long sum(byte[] family, byte[] qualifier) throws curVals.clear();IOException done = scanner.next(curVals); { KeyValue kv = curVals.get(0); // aggregate at each region sumResult += Scan scan = new Scan(); Bytes.toLong(kv.getBuffer(), kv.getValueOffset()); scan.addColumn(family, qualifier); } while (done); long sumResult = 0; } finally InternalScanner scanner = { ((RegionCoprocessorEnvironment) scanner.close();getEnvironment()).getRegion() } .getScanner(scan); return sumResult; } }
  15. 15. Step-3 Deployment• Two chooses – Load from configuration (hbase-site.xml, restart HBase) – Load from table attribute (disable and enable table) • From shell
  16. 16. Step-4: Invoking HTable table = new HTable(util.getConfiguration(), TEST_TABLE);• On client Map<byte[], Long> results; // scan: for all regions side, invoking results = table.coprocessorExec(ColumnAggregationProtocol.class, the endpoint ROWS[rowSeperator1 - 1], ROWS[rowSeperator2 + 1], new Batch.Call<ColumnAggregationProtocol, Long>() { public Long call(ColumnAggregationProtocol instance) throws IOException { return instance .sum(TEST_FAMILY, TEST_QUALIFIER); } }); long sumResult = 0; long expectedResult = 0; for (Map.Entry<byte[], Long> e : results.entrySet()) { sumResult += e.getValue(); }
  17. 17. Server side execution• Region Server public interface HRegionInterface provide extends VersionedProtocol, environment to Stoppable,Abortable execute custom coprocessor in { region context. …• Exec ExecResult execCoprocessor(byte[] – Custom protocol regionName, Exec call) throws name IOException; – Method name … – Method parameters }
  18. 18. Coprocessor Manangement• Build your own Coprocessor – Write server-side coprocessor code like above example, compiled and packaged as a jar file. • CoprocessorProtocol (e.g. ColumnAggregationProtocol) • Endpoint implementation (e.g. ColumnAggregationEndpoint)• Coprocessor Deployment – Load from Configuration (hbase-site.xml, restart HBase) • The jar file must be in classpath of HBase servers. • Global for all regions of all tables (system coprocessors). – Load from table attribute (from shell) • per table basis • The jar file should be put into HDFS or HBase servers’ classpath firstly, and set in the table attribute.
  19. 19. Future Work based on Coprocessors• Parallel Computation Framework (our first goal!) • Others – Higher level of abstraction – E.g. MapReduce APIs similar. – External Coprocessor Host – Integration and implementation Dremel and/or dremel (HBASE-4047) computation model into HBase. • separate processes• Distributed application service platform (our second – Code Weaving (HBASE-2058) goal !?) • protect against malicious actions – Higher level of abstraction or faults accidentally introduced – Data-driven distributed application architecture. by a coprocessor. – Avoid building similar distributed architecture repeatedly. – …• HBase system enhancements – HBase internal measurements and statistics for administration.• Support application like percolator – Observes and notifications.
  20. 20. Reference• https://blogs.apache.org/hbase/entry/coprocessor_intr oduction
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.