Apache Hive Hook

  • 3,232 views
Uploaded on

Apache Hive Hook …

Apache Hive Hook

I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • hi,
    I saw your PPT, its quite nice. But I have few doubts. I want to write a listener which fire when there is new partition in Hive metastore.
    I placed the jar in usr/lib/hive/lib which contains my eventlistener and make a entry in hive-site.xml. But my eventlistener is not getting executed whenever there is new partition. Do I need to do anything else?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,232
On Slideshare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
56
Comments
1
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Hive Hook 2013. 8 Minwoo Kim michael.kim@nexr.com
  • 2. Apache Hive Hook • The reason why I made this is that Ryan asked me about hive hook, but I couldn’t find any info about hook in hive wiki. • I hope this will be helpful to develop applications using Hive when you want to get extra info while executing a query on Hive. • This document was written based on release-0.11 tag • Source: - https://github.com/apache/hive (mirror of apache hive)
  • 3. What is a hook? • As you know, this is about computer programming technique, but .. • Hooking - Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components. • Hook - Code that handles intercepted function calls, events or messages
  • 4. Hive provides some hooking points • pre-execution • post-execution • execution-failure • pre- and post-driver-run • pre- and post-semantic-analyze • metastore-initialize
  • 5. How to set up hooks in Hive <property> <name>hive.exec.pre.hooks</name> <value></value> <description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description> </property> hive-site.xml <property> <name>hive.aux.jars.path</name> <value></value> </property> Setting hook property Setting path of jars contains implementations of hook interfaces or abstract class You can use hive.added.jars.path instead of hive.aux.jars.path
  • 6. Hive hook properties and interfaces Property Interface or Abstract class hive.exec.pre.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext (PreExecute is deprecated) hive.exec.post.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext (PostExecute is deprecated) hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook
  • 7. When those hooks fire? • You can submit a query on Hive through the following entry points - CLIDriver main method (called by shell script) - HCatCli main method (called by shell script) - HiveServer (called by thrift client) - HiveServer2 (called by thrift client or beeline)
  • 8. CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no
  • 9. CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no HCatCli HCatCli.main() ➔ processLine() ➔ processCmd() ➔ HCatDriver.run() ⤇ Driver.run() ➠
  • 10. HiveServer.execute() ➔ Driver.run() ➠ HiveServer CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no HCatCli HCatCli.main() ➔ processLine() ➔ processCmd() ➔ HCatDriver.run() ⤇ Driver.run() ➠
  • 11. HiveServer2 ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement() CLIService.executeStatement() ↳ SessionManager.getSession() ↳ HiveSession.executeStatement() ↳ OperationManager.newExecuteStatementOperation() ↳ SQLOperation.run() ➔ Driver.run() ➠ ⤶
  • 12. HiveServer2 ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement() CLIService.executeStatement() ↳ SessionManager.getSession() ↳ HiveSession.executeStatement() ↳ OperationManager.newExecuteStatementOperation() ↳ SQLOperation.run() ➔ Driver.run() ➠ • OperationManager.newExecuteStatementOperation() is like a kind of factory - AddResourceOperation, DeleteResourceOperation, DfsOperation, GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation, GetSchemasOperation, GetTablesOperation, GetTableTypesOperation, GetTypeInfoOperation, SetOperation, SQLOperation ⤶
  • 13. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse()
  • 14. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↝ HiveParser { • HiveParser.g - SelectClauseParser.g - FromClauseParser.g - IdentifiersParser.g • ParseDriver.parse() - Command String ➡ root of AST tree
  • 15. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • SemanticAnalyzerFactory.get(conf, ast) - SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer, ExportSemanticAnalyzer, FunctionSemanticAnalyzer, ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer
  • 16. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 17. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • FilterOperator • SelectOperator • ForwardOperator • FileSinkOperator • ScriptOperator • PTFOperator • ReduceSinkOperator • ExtractOperator • GroupByOperator • JoinOperator • MapJoinOperator • SMBMapJoinOperator • LimitOperator • TableScanOperator • UnionOperator • UDTFOperator • LateralViewJoinOperator • LateralViewForwardOperator • HashTableDummyOperator • HashTableSinkOperator • DummyStoreOperator • DemuxOperator • MuxOperator ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 18. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • PredicateTransitivePropagate • PredicatePushDown • PartitionPruner • PartitionConditionRemover • ListBucketingPruner • ListBucketingPruner • ColumnPruner • SkewJoinOptimizer • RewriteGBUsingIndex • GroupByOptimizer • SamplePruner • MapJoinProcessor • BucketMapJoinOptimizer • BucketMapJoinOptimizer • SortedMergeBucketMapJoinO ptimizer • BucketingSortingReduceSink Optimizer • UnionProcessor • JoinReorder • ReduceSinkDeDuplication • NonBlockingOpDeDupProc • GlobalLimitOptimizer • CorrelationOptimizer • SimpleFetchOptimizer ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 19. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • MapRedTask • FetchTask • ConditionalTask • ExplainTask • CopyTask • DDLTask • MoveTask • FunctionTask • StatsTask • ColumnStatsTask • DependencyCollectionTask ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 20. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
  • 21. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() { • ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob() ExecMapper, ExecReducer
  • 22. ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() PRE- and POST-DRIVER-RUN PRE- and POST-SEMANTIC-ANALYZE PRE-, POST-EXEC and ON-FAILURE
  • 23. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
  • 24. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ CLIService.executeStatement() ⇒ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
  • 25. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ CLIService.executeStatement() ⇒ SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
  • 26. HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ ➠ new HiveMetaStoreClient() ➔ HiveMetaStore.newHMSHandler() ➔ RetryingHMSHandler.getProxy() ➔ new RetryingHMSHandler() ➔ new HMSHandler() ➔ HMSHandler.init() ➔ HiveMetaStore.init() CLIService.executeStatement() ⇒ MATASTORE-INIT SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
  • 27. How Hive executes hooks List<HiveDriverRunHook> driverRunHooks; try { driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS, HiveDriverRunHook.class); for (HiveDriverRunHook driverRunHook : driverRunHooks) { driverRunHook.preDriverRun(hookContext); } } catch (Exception e) { • Hive executes multiple hooks on each hook points. ex. Driver.runInternal()
  • 28. 1. MetaStoreInitListener public abstract class MetaStoreInitListener implements Configurable { private Configuration conf; public MetaStoreInitListener(Configuration config){ this.conf = config; } public abstract void onInit(MetaStoreInitContext context) throws MetaException; @Override public Configuration getConf() { return this.conf; } @Override public void setConf(Configuration config) { this.conf = config; } }
  • 29. 1. MetaStoreInitListener public abstract class MetaStoreInitListener implements Configurable { private Configuration conf; public MetaStoreInitListener(Configuration config){ this.conf = config; } public abstract void onInit(MetaStoreInitContext context) throws MetaException; @Override public Configuration getConf() { return this.conf; } @Override public void setConf(Configuration config) { this.conf = config; } }
  • 30. What MetaStoreInitContext got • has Nothing! - This hook just alarms you when metastore initialize. (but you, of course, can get HiveConf by calling getConf()) public class MetaStoreInitContext { }
  • 31. 2. HiveDriverRunHook • preDriverRun - Invoked before Hive begins any processing of a command in the Driver, before compilation • postDriverRun - Invoked after Hive performs any processing of a command, just before a response is returned to the entity calling the Driver.run() public interface HiveDriverRunHook extends Hook { public void preDriverRun( HiveDriverRunHookContext hookContext) throws Exception; public void postDriverRun( HiveDriverRunHookContext hookContext) throws Exception; }
  • 32. What HiveDriverRunHookContext got • You can get command string from this hook context. - This is the only thing that HiveDriverRunHookContext has. public interface HiveDriverRunHookContext extends Configurable{ public String getCommand(); public void setCommand(String command); }
  • 33. 3.AbstractSemanticAnalyzerHook • You can get - HiveSemanticAnalyzerHookContext and ASTNode (Root node of abstract syntax tree) before analyze. - HiveSemanticAnalyzerHookContext and List<Task> after analyze. public abstract class AbstractSemanticAnalyzerHook implements HiveSemanticAnalyzerHook { public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,ASTNode ast) throws SemanticException { return ast; } public void postAnalyze(HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { } }
  • 34. What HiveSemanticAnalyzerHookContext got • Hive Object - contains information about a set of data in HDFS organized for query processing. (from comment) • ReadEntity, WriteEntity • update method will be invoked after the semantic analyzer completes. public interface HiveSemanticAnalyzerHookContext extends Configurable{ public Hive getHive() throws HiveException; public void update(BaseSemanticAnalyzer sem); public Set<ReadEntity> getInputs(); public Set<WriteEntity> getOutputs(); }
  • 35. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 36. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 37. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 38. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 39. How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
  • 40. 4. ExecuteWithHookContext • Can be used in the followings - hive.exec.pre.hooks - hive.exec.post.hooks - hive.exec.failure.hooks public interface ExecuteWithHookContext extends Hook { /**    *    * @param hookContext    * The hook context passed to each hooks.    * @throws Exception    */ void run(HookContext hookContext) throws Exception; }
  • 41. What HookContext got • HookType - PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK • QueryPlan • HiveConf • LineageInfo • UserGroupInformation • OperationName • List<TaskRunner> completeTaskList • Set<ReadEntity> inputs • Set<WriteEntity> outputs • Map<String, ContentSummary> inputPathToContentSummary
  • 42. How Hive fires hooks without executing query physically • This has the effect of causing the pre/post execute hooks to fire. ALTER TABLE table_name TOUCH [PARTITION partitionSpec];
  • 43. MetaStore Event Listeners Property Abstract Class hive.metastore.pre.event.listeners MetaStorePreEventListener hive.metastore.end.function.listeners MetaStoreEndFunctionListener hive.metastore.event.listeners MetaStoreEventListener package : org.apache.hadoop.hive.metastore • I think those listeners look like hooks. • I couldn’t find any particular differences between listeners and hooks while just taking a look. The only thing I found is that listeners can’t affect query processing. It can only read. • Anyway, it looks useful to let you know when a metastore do something.
  • 44. MetaStoreEventListener • The followings will be performed when a particular event occurs on a metastore. - onCreateTable - onDropTable - onAlterTable - onDropPartition - onAlterPartition - onCreateDatabase - onDropDatabase - onLoadPartitionDone If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener
  • 45. Be careful! • Hooks - can be a critical failure point! (you should better catch runtime exceptions) - are preformed synchronously. - can affect query processing time.
  • 46. Let's try it out • Demo - Don’t be surprised if it doesn’t work. - That’s the way the demo is...
  • 47. Thanks! • Questions? • Resources - https://cwiki.apache.org/confluence/display/Hive/ - https://github.com/apache/hive