• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Hive Hook
 

Apache Hive Hook

on

  • 2,545 views

Apache Hive Hook ...

Apache Hive Hook

I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.

Statistics

Views

Total Views
2,545
Views on SlideShare
1,500
Embed Views
1,045

Actions

Likes
5
Downloads
31
Comments
0

5 Embeds 1,045

http://julingks.tistory.com 1036
http://cloud.feedly.com 3
http://webcache.googleusercontent.com 3
http://www.hanrss.com 2
https://www.google.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Hive Hook Apache Hive Hook Presentation Transcript

    • Apache Hive Hook 2013. 8 Minwoo Kim michael.kim@nexr.com
    • Apache Hive Hook • The reason why I made this is that Ryan asked me about hive hook, but I couldn’t find any info about hook in hive wiki. • I hope this will be helpful to develop applications using Hive when you want to get extra info while executing a query on Hive. • This document was written based on release-0.11 tag • Source: - https://github.com/apache/hive (mirror of apache hive)
    • What is a hook? • As you know, this is about computer programming technique, but .. • Hooking - Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components. • Hook - Code that handles intercepted function calls, events or messages
    • Hive provides some hooking points • pre-execution • post-execution • execution-failure • pre- and post-driver-run • pre- and post-semantic-analyze • metastore-initialize
    • How to set up hooks in Hive <property> <name>hive.exec.pre.hooks</name> <value></value> <description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description> </property> hive-site.xml <property> <name>hive.aux.jars.path</name> <value></value> </property> Setting hook property Setting path of jars contains implementations of hook interfaces or abstract class You can use hive.added.jars.path instead of hive.aux.jars.path
    • Hive hook properties and interfaces Property Interface or Abstract class hive.exec.pre.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext (PreExecute is deprecated) hive.exec.post.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext (PostExecute is deprecated) hive.exec.failure.hooks org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext hive.metastore.init.hooks org.apache.hadoop.hive.metastore.MetaStoreInitListener hive.exec.driver.run.hooks org.apache.hadoop.hive.ql.HiveDriverRunHook hive.semantic.analyzer.hook org.apache.hadoop.hive.ql.parse.AbstractSemanticAnalyzerHook
    • When those hooks fire? • You can submit a query on Hive through the following entry points - CLIDriver main method (called by shell script) - HCatCli main method (called by shell script) - HiveServer (called by thrift client) - HiveServer2 (called by thrift client or beeline)
    • CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no
    • CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no HCatCli HCatCli.main() ➔ processLine() ➔ processCmd() ➔ HCatDriver.run() ⤇ Driver.run() ➠
    • HiveServer.execute() ➔ Driver.run() ➠ HiveServer CLIDriver.main() ➔ run() ➔ executeDriver() ➔ processLine() ➔ processCmd() ↳ CliSessionState.getClient() ↳ HiveClient.execute() ➠ ➔ processLocalCmd() ➔ Driver.run() ➠ CLIDriver ➔ is remote ? yes no HCatCli HCatCli.main() ➔ processLine() ➔ processCmd() ➔ HCatDriver.run() ⤇ Driver.run() ➠
    • HiveServer2 ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement() CLIService.executeStatement() ↳ SessionManager.getSession() ↳ HiveSession.executeStatement() ↳ OperationManager.newExecuteStatementOperation() ↳ SQLOperation.run() ➔ Driver.run() ➠ ⤶
    • HiveServer2 ThriftCLIService.ExecuteStatement() ➔ CLIService.executeStatement() CLIService.executeStatement() ↳ SessionManager.getSession() ↳ HiveSession.executeStatement() ↳ OperationManager.newExecuteStatementOperation() ↳ SQLOperation.run() ➔ Driver.run() ➠ • OperationManager.newExecuteStatementOperation() is like a kind of factory - AddResourceOperation, DeleteResourceOperation, DfsOperation, GetCatalogsOperation, GetColumnsOperation, GetFunctionsOperation, GetSchemasOperation, GetTablesOperation, GetTableTypesOperation, GetTypeInfoOperation, SetOperation, SQLOperation ⤶
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse()
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↝ HiveParser { • HiveParser.g - SelectClauseParser.g - FromClauseParser.g - IdentifiersParser.g • ParseDriver.parse() - Command String ➡ root of AST tree
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • SemanticAnalyzerFactory.get(conf, ast) - SemanticAnalyzer, ColumnStatsSemanticAnalyzer, ExplainSemanticAnalyzer, ExportSemanticAnalyzer, FunctionSemanticAnalyzer, ImportSemanticAnalyzer, LoadSemanticAnalyzer, MacroSemanticAnalyzer
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • FilterOperator • SelectOperator • ForwardOperator • FileSinkOperator • ScriptOperator • PTFOperator • ReduceSinkOperator • ExtractOperator • GroupByOperator • JoinOperator • MapJoinOperator • SMBMapJoinOperator • LimitOperator • TableScanOperator • UnionOperator • UDTFOperator • LateralViewJoinOperator • LateralViewForwardOperator • HashTableDummyOperator • HashTableSinkOperator • DummyStoreOperator • DemuxOperator • MuxOperator ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • PredicateTransitivePropagate • PredicatePushDown • PartitionPruner • PartitionConditionRemover • ListBucketingPruner • ListBucketingPruner • ColumnPruner • SkewJoinOptimizer • RewriteGBUsingIndex • GroupByOptimizer • SamplePruner • MapJoinProcessor • BucketMapJoinOptimizer • BucketMapJoinOptimizer • SortedMergeBucketMapJoinO ptimizer • BucketingSortingReduceSink Optimizer • UnionProcessor • JoinReorder • ReduceSinkDeDuplication • NonBlockingOpDeDupProc • GlobalLimitOptimizer • CorrelationOptimizer • SimpleFetchOptimizer ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() • MapRedTask • FetchTask • ConditionalTask • ExplainTask • CopyTask • DDLTask • MoveTask • FunctionTask • StatsTask • ColumnStatsTask • DependencyCollectionTask ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() {
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() ➔ analyzeInternal() • processPositionAlias() • doPhase1() • getMetaData() • genPlan() • Optimizer.optimize() • MapReduceCompiler.compile() { • ex) MapRedTask.execute() ⤇ ExecDriver.execute() ➔ JobClient.submitJob() ExecMapper, ExecReducer
    • ➠ Driver.run() ➔ Driver.runInternal() ↳ Driver.compile() ↳ ParseDriver.parse() ↳ SemanticAnalyzer.analyze() ↳ Driver.execute() ➔ loop (List<Task>) ⟳ Driver.launchTask() ➔ TaskRunner.runSequential() ➔ Task.executeTask() ➔ Task.execute() PRE- and POST-DRIVER-RUN PRE- and POST-SEMANTIC-ANALYZE PRE-, POST-EXEC and ON-FAILURE
    • HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠
    • HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ CLIService.executeStatement() ⇒ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
    • HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ CLIService.executeStatement() ⇒ SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
    • HiveServer2.main() ➔ HiveServer2.start() ➔ CLIService.start() ➔ new HiveMetaStoreClient() ➠ ➔ HiveSession.getMetaStoreClient() ➔ new HiveMetaStoreClient() ➠ ➠ new HiveMetaStoreClient() ➔ HiveMetaStore.newHMSHandler() ➔ RetryingHMSHandler.getProxy() ➔ new RetryingHMSHandler() ➔ new HMSHandler() ➔ HMSHandler.init() ➔ HiveMetaStore.init() CLIService.executeStatement() ⇒ MATASTORE-INIT SemanticAnalyzer ↝ Hive ↝ getMSC() is invoked by many other methods in Hive object Hive.getMSC() ➔ Hive.createMetaStoreClient() ➔ RetryingHMSHandler.getProxy() ➠ GetColumnsOperation.run() GetSchemasOperation.run() GetTablesOperation.run()
    • How Hive executes hooks List<HiveDriverRunHook> driverRunHooks; try { driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS, HiveDriverRunHook.class); for (HiveDriverRunHook driverRunHook : driverRunHooks) { driverRunHook.preDriverRun(hookContext); } } catch (Exception e) { • Hive executes multiple hooks on each hook points. ex. Driver.runInternal()
    • 1. MetaStoreInitListener public abstract class MetaStoreInitListener implements Configurable { private Configuration conf; public MetaStoreInitListener(Configuration config){ this.conf = config; } public abstract void onInit(MetaStoreInitContext context) throws MetaException; @Override public Configuration getConf() { return this.conf; } @Override public void setConf(Configuration config) { this.conf = config; } }
    • 1. MetaStoreInitListener public abstract class MetaStoreInitListener implements Configurable { private Configuration conf; public MetaStoreInitListener(Configuration config){ this.conf = config; } public abstract void onInit(MetaStoreInitContext context) throws MetaException; @Override public Configuration getConf() { return this.conf; } @Override public void setConf(Configuration config) { this.conf = config; } }
    • What MetaStoreInitContext got • has Nothing! - This hook just alarms you when metastore initialize. (but you, of course, can get HiveConf by calling getConf()) public class MetaStoreInitContext { }
    • 2. HiveDriverRunHook • preDriverRun - Invoked before Hive begins any processing of a command in the Driver, before compilation • postDriverRun - Invoked after Hive performs any processing of a command, just before a response is returned to the entity calling the Driver.run() public interface HiveDriverRunHook extends Hook { public void preDriverRun( HiveDriverRunHookContext hookContext) throws Exception; public void postDriverRun( HiveDriverRunHookContext hookContext) throws Exception; }
    • What HiveDriverRunHookContext got • You can get command string from this hook context. - This is the only thing that HiveDriverRunHookContext has. public interface HiveDriverRunHookContext extends Configurable{ public String getCommand(); public void setCommand(String command); }
    • 3.AbstractSemanticAnalyzerHook • You can get - HiveSemanticAnalyzerHookContext and ASTNode (Root node of abstract syntax tree) before analyze. - HiveSemanticAnalyzerHookContext and List<Task> after analyze. public abstract class AbstractSemanticAnalyzerHook implements HiveSemanticAnalyzerHook { public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context,ASTNode ast) throws SemanticException { return ast; } public void postAnalyze(HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { } }
    • What HiveSemanticAnalyzerHookContext got • Hive Object - contains information about a set of data in HDFS organized for query processing. (from comment) • ReadEntity, WriteEntity • update method will be invoked after the semantic analyzer completes. public interface HiveSemanticAnalyzerHookContext extends Configurable{ public Hive getHive() throws HiveException; public void update(BaseSemanticAnalyzer sem); public Set<ReadEntity> getInputs(); public Set<WriteEntity> getOutputs(); }
    • How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
    • How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
    • How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
    • How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
    • How Hive executes analyzer hooks List<AbstractSemanticAnalyzerHook> saHooks = getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK, AbstractSemanticAnalyzerHook.class); // ~ ellipsis ~ HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl(); hookCtx.setConf(conf); for (AbstractSemanticAnalyzerHook hook : saHooks) { tree = hook.preAnalyze(hookCtx, tree); } sem.analyze(tree, ctx); hookCtx.update(sem); for (AbstractSemanticAnalyzerHook hook : saHooks) { hook.postAnalyze(hookCtx, sem.getRootTasks()); }
    • 4. ExecuteWithHookContext • Can be used in the followings - hive.exec.pre.hooks - hive.exec.post.hooks - hive.exec.failure.hooks public interface ExecuteWithHookContext extends Hook { /**    *    * @param hookContext    * The hook context passed to each hooks.    * @throws Exception    */ void run(HookContext hookContext) throws Exception; }
    • What HookContext got • HookType - PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK • QueryPlan • HiveConf • LineageInfo • UserGroupInformation • OperationName • List<TaskRunner> completeTaskList • Set<ReadEntity> inputs • Set<WriteEntity> outputs • Map<String, ContentSummary> inputPathToContentSummary
    • How Hive fires hooks without executing query physically • This has the effect of causing the pre/post execute hooks to fire. ALTER TABLE table_name TOUCH [PARTITION partitionSpec];
    • MetaStore Event Listeners Property Abstract Class hive.metastore.pre.event.listeners MetaStorePreEventListener hive.metastore.end.function.listeners MetaStoreEndFunctionListener hive.metastore.event.listeners MetaStoreEventListener package : org.apache.hadoop.hive.metastore • I think those listeners look like hooks. • I couldn’t find any particular differences between listeners and hooks while just taking a look. The only thing I found is that listeners can’t affect query processing. It can only read. • Anyway, it looks useful to let you know when a metastore do something.
    • MetaStoreEventListener • The followings will be performed when a particular event occurs on a metastore. - onCreateTable - onDropTable - onAlterTable - onDropPartition - onAlterPartition - onCreateDatabase - onDropDatabase - onLoadPartitionDone If you need more details, see org.apache.hadoop.hive.metastore.MetaStoreEventListener
    • Be careful! • Hooks - can be a critical failure point! (you should better catch runtime exceptions) - are preformed synchronously. - can affect query processing time.
    • Let's try it out • Demo - Don’t be surprised if it doesn’t work. - That’s the way the demo is...
    • Thanks! • Questions? • Resources - https://cwiki.apache.org/confluence/display/Hive/ - https://github.com/apache/hive