HBaseCon, May 2012HBase CoprocessorsLars George | Solutions Architect
Revision HistoryVersion      Revised By                                    Description of RevisionVersion 1    Lars George...
Overview•  Coprocessors were added to Bigtable  –  Mentioned during LADIS 2009 talk•  Runs user code within each region of...
Examples Use-Cases•  Bigtable uses Coprocessors  –  Scalable metadata management  –  Distributed language model for machin...
HBase Coprocessors•  Inspired by Google’s Coprocessors   –  Not much information available, but general idea is      under...
Coprocessor Classes and Interfaces•  The Coprocessor Interface  –  All user code must inherit from this class•  The Coproc...
Coprocessor Priority•  System or User/** Highest installation priority */static final int PRIORITY_HIGHEST = 0;/** High (s...
Coprocessor Environment•  Available Methods
Coprocessor Host•  Maintains all Coprocessor instances and   their environments (state)•  Concrete Classes  –  MasterCopro...
Control Flow
Coprocessor Interface•  Base for all other types of Coprocessors•  start() and stop() methods for lifecycle   management• ...
Observer Classes•  Comparable to database triggers  –  Callback functions/hooks for every explicit API     method, but als...
Region Observers•  Can mediate (veto) actions  –  Used by the security policy extensions  –  Priority allows mediators to ...
Endpoint Classes•  Define a dynamic RPC protocol, used   between client and region server•  Executes arbitrary code, loade...
Coprocessor Loading•  There are two ways: dynamic or static  –  Static: use configuration files and table schema  –  Dynam...
Loading from Configuration•  Example:  <property>!    <name>hbase.coprocessor.region.classes</name> !    <value>coprocesso...
Coprocessor Loading (cont.)•  For static loading from table schema:  –  Definition per table  –  For all regions of the ta...
Loading from Table Schema•  Example:COPROCESSOR$1 =>  !  hdfs://localhost:8020/users/leon/test.jar| !   coprocessor.Test|1...
Example: Add Coprocessorpublic static void main(String[] args) throws IOException { !  Configuration conf = HBaseConfigura...
Example Output{NAME => testtable, COPROCESSOR$1 =>!file:/test.jar|coprocessor.RegionObserverExample|1073741823, FAMILIES =...
Region Observers•  Handles all region related events•  Hooks for two classes of operations:  –  Lifecycle changes  –  Clie...
Handling Region Lifecycle Events•  Hook into pending open, open, and pending   close state changes•  Called implicitly by ...
Region Environment
Special Hook Parameterpublic interface RegionObserver extends Coprocessor {!!  /**!   * Called before the region is report...
ObserverContext
Chain of Command•  Especially the complete() and bypass()   methods allow to change the processing   chain  –  complete() ...
Example: Pre-Hook Complete@Override !public void preSplit(ObserverContext!       <RegionCoprocessorEnvironment> e) {!   e....
Master Observer•  Handles all HMaster related events  –  DDL type calls, e.g. create table, add column  –  Region manageme...
Master Environment
Master Services (cont.)•  Very powerful features  –  Access the AssignmentManager to modify     plans  –  Access the Maste...
Example: Master Post Hookpublic class MasterObserverExample !  extends BaseMasterObserver { !  @Override public void postC...
Example Output hbase(main):001:0> create   testtable, colfam1‘! 0 row(s) in 0.4300 seconds ! ! $ bin/hadoop dfs -ls
   Fou...
Endpoints•  Dynamic RPC extends server-side   functionality  –  Useful for MapReduce like implementations  –  Handles the ...
Custom Endpoint Implementation•  Involves two steps:  –  Extend the CoprocessorProtocol interface     •  Defines the actua...
Example: Row Count Protocolpublic interface RowCountProtocol!  extends CoprocessorProtocol {!  long getRowCount() !    thr...
Example: Endpoint for Row Countpublic class RowCountEndpoint !extends BaseEndpointCoprocessor !implements RowCountProtocol...
Example: Endpoint for Row Count  RegionCoprocessorEnvironment environment = !    (RegionCoprocessorEnvironment)!    getEnv...
Example: Endpoint for Row Count      try { !        List<KeyValue> curVals = !          new ArrayList<KeyValue>(); !      ...
Example: Endpoint for Row Count        @Override!        public long getRowCount() throws IOException {!          return g...
Endpoint Invocation•  There are two ways to invoke the call  –  By Proxy, using HTable.coprocessorProxy()     •  Uses a de...
Exec vs. Proxy
Example: Invocation by Execpublic static void main(String[] args) throws IOException { !  Configuration conf = HBaseConfig...
Example: Invocation by Exec       long total = 0;!       for (Map.Entry<byte[], Long> entry : !            results.entrySe...
Example OutputRegion: testtable,,  1303417572005.51f9e2251c...cbcb  0c66858f., Count: 2 !Region: testtable,row3,  13034175...
Batch Convenience•  The Batch.forMethod() helps to quickly   map a protocol function into a Batch.Call•  Useful for single...
Batch Convenience    Batch.Call call =!      Batch.forMethod(!        RowCountProtocol.class,!        "getKeyValueCount");...
Call Multiple Endpoints•  Sometimes you need to call more than   one endpoint in a single roundtrip call to   the servers•...
Call Multiple Endpoints   Map<byte[], Pair<Long, Long>> !   results = table.coprocessorExec( !     RowCountProtocol.class,...
Example: Invocation by Proxy   RowCountProtocol protocol =!     table.coprocessorProxy(!       RowCountProtocol.class,!   ...
50    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.     Reproduction or redistribution without written permissio...
Upcoming SlideShare
Loading in...5
×

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on the Cluster - Cloudera

5,402

Published on

The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,402
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
205
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on the Cluster - Cloudera

  1. 1. HBaseCon, May 2012HBase CoprocessorsLars George | Solutions Architect
  2. 2. Revision HistoryVersion Revised By Description of RevisionVersion 1 Lars George Initial version2 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  3. 3. Overview•  Coprocessors were added to Bigtable –  Mentioned during LADIS 2009 talk•  Runs user code within each region of a table –  Code split and moves with region•  Defines high level call interface for clients•  Calls addressed to rows or ranges of rows•  Implicit automatic scaling, load balancing, and request routing
  4. 4. Examples Use-Cases•  Bigtable uses Coprocessors –  Scalable metadata management –  Distributed language model for machine translation –  Distributed query processing for full-text index –  Regular expression search in code repository•  MapReduce jobs over HBase are often map- only jobs –  Row keys are already sorted and distinct ➜ Could be replaced by Coprocessors
  5. 5. HBase Coprocessors•  Inspired by Google’s Coprocessors –  Not much information available, but general idea is understood•  Define various types of server-side code extensions –  Associated with table using a table property –  Attribute is a path to JAR file –  JAR is loaded when region is opened –  Blends new functionality with existing•  Can be chained with Priorities and Load Order➜ Allows for dynamic RPC extensions
  6. 6. Coprocessor Classes and Interfaces•  The Coprocessor Interface –  All user code must inherit from this class•  The CoprocessorEnvironment Interface –  Retains state across invocations –  Predefined classes•  The CoprocessorHost Interface –  Ties state and user code together –  Predefined classes
  7. 7. Coprocessor Priority•  System or User/** Highest installation priority */static final int PRIORITY_HIGHEST = 0;/** High (system) installation priority */static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4;/** Default installation prio for user coprocessors */static final int PRIORITY_USER = Integer.MAX_VALUE / 2;/** Lowest installation priority */static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
  8. 8. Coprocessor Environment•  Available Methods
  9. 9. Coprocessor Host•  Maintains all Coprocessor instances and their environments (state)•  Concrete Classes –  MasterCoprocessorHost –  RegionCoprocessorHost –  WALCoprocessorHost•  Subclasses provide access to specialized Environment implementations
  10. 10. Control Flow
  11. 11. Coprocessor Interface•  Base for all other types of Coprocessors•  start() and stop() methods for lifecycle management•  State as defined in the interface:
  12. 12. Observer Classes•  Comparable to database triggers –  Callback functions/hooks for every explicit API method, but also all important internal calls•  Concrete Implementations –  MasterObserver •  Hooks into HMaster API –  RegionObserver •  Hooks into Region related operations –  WALObserver •  Hooks into write-ahead log operations
  13. 13. Region Observers•  Can mediate (veto) actions –  Used by the security policy extensions –  Priority allows mediators to run first•  Hooks into all CRUD+S API calls and more –  get(), put(), delete(), scan(), increment(),… –  checkAndPut(), checkAndDelete(),… –  flush(), compact(), split(),…•  Pre/Post Hooks for every call•  Can be used to build secondary indexes, filters
  14. 14. Endpoint Classes•  Define a dynamic RPC protocol, used between client and region server•  Executes arbitrary code, loaded in region server –  Future development will add code weaving/ inspection to deny any malicious code•  Steps to add your own methods –  Define and implement your own protocol –  Implement endpoint coprocessor –  Call HTable’s coprocessorExec() or coprocessorProxy()
  15. 15. Coprocessor Loading•  There are two ways: dynamic or static –  Static: use configuration files and table schema –  Dynamic: not available (yet)•  For static loading from configuration: –  Order is important (defines the execution order) –  Special property key for each host type –  Region related classes are loaded for all regions and tables –  Priority is always System –  JAR must be on class path
  16. 16. Loading from Configuration•  Example: <property>! <name>hbase.coprocessor.region.classes</name> ! <value>coprocessor.RegionObserverExample, ! coprocessor.AnotherCoprocessor</value>! </property>
 <property> ! <name>hbase.coprocessor.master.classes</name> ! <value>coprocessor.MasterObserverExample</value>! </property>
 <property> ! <name>hbase.coprocessor.wal.classes</name> ! <value>coprocessor.WALObserverExample, ! bar.foo.MyWALObserver</value> ! </property> ! !
  17. 17. Coprocessor Loading (cont.)•  For static loading from table schema: –  Definition per table –  For all regions of the table –  Only region related classes, not WAL or Master –  Added to HTableDescriptor, when table is created or altered –  Allows to set the priority and JAR path COPROCESSOR$<num> ➜ ! <path-to-jar>|<classname>|<priority> !
  18. 18. Loading from Table Schema•  Example:COPROCESSOR$1 => ! hdfs://localhost:8020/users/leon/test.jar| ! coprocessor.Test|10 !!COPROCESSOR$2 => ! /Users/laura/test2.jar| ! coprocessor.AnotherTest|1000 !!
  19. 19. Example: Add Coprocessorpublic static void main(String[] args) throws IOException { ! Configuration conf = HBaseConfiguration.create(); ! FileSystem fs = FileSystem.get(conf);
 Path path = new Path(fs.getUri() + Path.SEPARATOR +! "test.jar"); ! HTableDescriptor htd = new HTableDescriptor("testtable");! htd.addFamily(new HColumnDescriptor("colfam1"));! htd.setValue("COPROCESSOR$1", path.toString() +! "|" + RegionObserverExample.class.getCanonicalName() +! "|" + Coprocessor.PRIORITY_USER); ! HBaseAdmin admin = new HBaseAdmin(conf);! admin.createTable(htd); ! System.out.println(admin.getTableDescriptor(! Bytes.toBytes("testtable"))); !} !
  20. 20. Example Output{NAME => testtable, COPROCESSOR$1 =>!file:/test.jar|coprocessor.RegionObserverExample|1073741823, FAMILIES => [{NAME => colfam1,BLOOMFILTER => NONE, REPLICATION_SCOPE => 0,COMPRESSION => NONE, VERSIONS => 3, TTL =>2147483647, BLOCKSIZE => 65536, IN_MEMORY =>false, BLOCKCACHE => true}]} !!
  21. 21. Region Observers•  Handles all region related events•  Hooks for two classes of operations: –  Lifecycle changes –  Client API Calls•  All client API calls have a pre/post hook –  Can be used to grant access on preGet() –  Can be used to update secondary indexes on postPut()
  22. 22. Handling Region Lifecycle Events•  Hook into pending open, open, and pending close state changes•  Called implicitly by the framework –  preOpen(), postOpen(),…•  Used to piggyback or fail the process, e.g. –  Cache warm up after a region opens –  Suppress region splitting, compactions, flushes
  23. 23. Region Environment
  24. 24. Special Hook Parameterpublic interface RegionObserver extends Coprocessor {!! /**! * Called before the region is reported as open to the master.! * @param c the environment provided by the region server! */! void preOpen(final! ObserverContext<RegionCoprocessorEnvironment> c);!! /**! * Called after the region is reported as open to the master.! * @param c the environment provided by the region server! */! void postOpen(final ! ObserverContext<RegionCoprocessorEnvironment> c);!!
  25. 25. ObserverContext
  26. 26. Chain of Command•  Especially the complete() and bypass() methods allow to change the processing chain –  complete() ends the chain at the current coprocessor –  bypass() completes the pre/post chain but uses the last value returned by the coprocessors, possibly not calling the actual API method (for pre-hooks)
  27. 27. Example: Pre-Hook Complete@Override !public void preSplit(ObserverContext! <RegionCoprocessorEnvironment> e) {! e.complete(); !}!
  28. 28. Master Observer•  Handles all HMaster related events –  DDL type calls, e.g. create table, add column –  Region management calls, e.g. move, assign•  Pre/post hooks with Context•  Specialized environment provided
  29. 29. Master Environment
  30. 30. Master Services (cont.)•  Very powerful features –  Access the AssignmentManager to modify plans –  Access the MasterFileSystem to create or access resources on HDFS –  Access the ServerManager to get the list of known servers –  Use the ExecutorService to run system-wide background processes•  Be careful (for now)!
  31. 31. Example: Master Post Hookpublic class MasterObserverExample ! extends BaseMasterObserver { ! @Override public void postCreateTable( ! ObserverContext<MasterCoprocessorEnvironment> env, ! HRegionInfo[] regions, boolean sync) ! throws IOException { ! String tableName = ! regions[0].getTableDesc().getNameAsString(); ! MasterServices services =! env.getEnvironment().getMasterServices();! MasterFileSystem masterFileSystem =! services.getMasterFileSystem(); ! FileSystem fileSystem = masterFileSystem.getFileSystem();! Path blobPath = new Path(tableName + "-blobs");! fileSystem.mkdirs(blobPath); ! }!} !!
  32. 32. Example Output hbase(main):001:0> create testtable, colfam1‘! 0 row(s) in 0.4300 seconds ! ! $ bin/hadoop dfs -ls
 Found 1 items
 drwxr-xr-x - larsgeorge supergroup 0 ... /user/ larsgeorge/testtable-blobs !
  33. 33. Endpoints•  Dynamic RPC extends server-side functionality –  Useful for MapReduce like implementations –  Handles the Map part server-side, Reduce needs to be done client side•  Based on CoprocessorProtocol interface•  Routing to regions is based on either single row keys, or row key ranges –  Call is sent, no matter if row exists or not since region start and end keys are coarse grained
  34. 34. Custom Endpoint Implementation•  Involves two steps: –  Extend the CoprocessorProtocol interface •  Defines the actual protocol –  Extend the BaseEndpointCoprocessor •  Provides the server-side code and the dynamic RPC method
  35. 35. Example: Row Count Protocolpublic interface RowCountProtocol! extends CoprocessorProtocol {! long getRowCount() ! throws IOException; ! long getRowCount(Filter filter)! throws IOException; ! long getKeyValueCount() ! throws IOException; !} !!
  36. 36. Example: Endpoint for Row Countpublic class RowCountEndpoint !extends BaseEndpointCoprocessor !implements RowCountProtocol { !! private long getCount(Filter filter, ! boolean countKeyValues) throws IOException {
 Scan scan = new Scan();! scan.setMaxVersions(1); ! if (filter != null) { ! scan.setFilter(filter); ! } !
  37. 37. Example: Endpoint for Row Count RegionCoprocessorEnvironment environment = ! (RegionCoprocessorEnvironment)! getEnvironment();! // use an internal scanner to perform! // scanning.! InternalScanner scanner =! environment.getRegion().getScanner(scan); ! int result = 0;!
  38. 38. Example: Endpoint for Row Count try { ! List<KeyValue> curVals = ! new ArrayList<KeyValue>(); ! boolean done = false;! do { ! curVals.clear(); ! done = scanner.next(curVals); ! result += countKeyValues ? curVals.size() : 1; ! } while (done); ! } finally { ! scanner.close(); ! } ! return result; ! } !!
  39. 39. Example: Endpoint for Row Count @Override! public long getRowCount() throws IOException {! return getRowCount(new FirstKeyOnlyFilter()); ! } !! @Override ! public long getRowCount(Filter filter) throws IOException {! return getCount(filter, false); ! } !! @Override! public long getKeyValueCount() throws IOException {! return getCount(null, true); ! } !}
 ! !!
  40. 40. Endpoint Invocation•  There are two ways to invoke the call –  By Proxy, using HTable.coprocessorProxy() •  Uses a delayed model, i.e. the call is send when the proxied method is invoked –  By Exec, using HTable.coprocessorExec() •  The call is send in parallel to all regions and the results are collected immediately•  The Batch.Call class is used be coprocessorExec() to wrap the calls per region•  The optional Batch.Callback can be used to react upon completion of the remote call
  41. 41. Exec vs. Proxy
  42. 42. Example: Invocation by Execpublic static void main(String[] args) throws IOException { ! Configuration conf = HBaseConfiguration.create(); ! HTable table = new HTable(conf, "testtable");! try { ! Map<byte[], Long> results = ! table.coprocessorExec(RowCountProtocol.class, null, null,! new Batch.Call<RowCountProtocol, Long>() { ! @Override! public Long call(RowCountProtocol counter) ! throws IOException { ! return counter.getRowCount(); ! } ! }); ! !
  43. 43. Example: Invocation by Exec long total = 0;! for (Map.Entry<byte[], Long> entry : ! results.entrySet()) { ! total += entry.getValue().longValue();! System.out.println("Region: " + ! Bytes.toString(entry.getKey()) +! ", Count: " + entry.getValue()); ! } ! System.out.println("Total Count: " + total); ! } catch (Throwable throwable) { ! throwable.printStackTrace(); ! } !} !
  44. 44. Example OutputRegion: testtable,, 1303417572005.51f9e2251c...cbcb 0c66858f., Count: 2 !Region: testtable,row3, 1303417572005.7f3df4dcba...dbc9 9fce5d87., Count: 3 !Total Count: 5 !!
  45. 45. Batch Convenience•  The Batch.forMethod() helps to quickly map a protocol function into a Batch.Call•  Useful for single method calls to the servers•  Uses the Java reflection API to retrieve the named method•  Saves you from implementing the anonymous inline class
  46. 46. Batch Convenience Batch.Call call =! Batch.forMethod(! RowCountProtocol.class,! "getKeyValueCount"); ! Map<byte[], Long> results =! table.coprocessorExec(! RowCountProtocol.class, ! null, null, call); ! !
  47. 47. Call Multiple Endpoints•  Sometimes you need to call more than one endpoint in a single roundtrip call to the servers•  This requires an anonymous inline class, since Batch.forMethod cannot handle this
  48. 48. Call Multiple Endpoints Map<byte[], Pair<Long, Long>> ! results = table.coprocessorExec( ! RowCountProtocol.class, null, null,! new Batch.Call<RowCountProtocol,! Pair<Long, Long>>() { ! public Pair<Long, Long> call(! RowCountProtocol counter) ! throws IOException {
 return new Pair(! counter.getRowCount(), ! counter.getKeyValueCount()); ! }! }); !
  49. 49. Example: Invocation by Proxy RowCountProtocol protocol =! table.coprocessorProxy(! RowCountProtocol.class,! Bytes.toBytes("row4")); ! long rowsInRegion =! protocol.getRowCount(); ! System.out.println(! "Region Row Count: " +! rowsInRegion); ! !
  50. 50. 50 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×