Published on


Published in: Technology
    Are you sure you want to  Yes  No
    Your message goes here
  • 赞,松波
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. HiveServer2 Oct., 2013 Schubert Zhang
  2. 2. Hive Evolution • Original • Let users express their queries in a high-level language without having to write MapReduce programs. • Mainly target to ad-hoc queries. • As a data tool, usually work in CLI mode. • Now more … • A parallel SQL DBMS that happens to use Hadoop for its storage and execution layers. • Ad-hoc + regular • As a service …
  3. 3. Introduction • Limitations of HiveServer1 • • • • Concurrency Security Client Interface Stability • Sessions/Currency • Old Thrift API and server implementation didn’t support currency. • xDBC • Old Thrift API didn’t support common xDBC • Authentication/Authorization • Incomplete implementations • Auditing/Logging HiveServer2: • From hive-0.11 / CDH4.1 • Reconstructed and Re-implemented. (HIVE-2935) • HiveServer2 is a container for the Hive execution engine (Driver). • For each client connection, it creates a new execution context (Connection and Session) that serves Hive SQL requests from the client. • The new RPC interface enables the server to associate this Hive execution context with the thread serving the client’s request.
  4. 4. Architecture In fact, Driver in Operation Context System Arch. Authentication Arch. (don’t talk here) @Cloudera
  5. 5. hiveServer2 Architecture: Internal Client-1 (main entry) start Thrift RPC Iface Client-2 thriftCLIService (TThreadPoolServer, implements Client RPC Iface) lIsten() and accept() new client connection, and process in each Thread) • Core Contexts • Connections • Sessions • Operations • Operation Path … Threads for Client Connections … call (ICLIService internal interface) cliService (Real implementations of various operations) open/close sessions, run operations in existing sessions … HiveSession Interface session HiveConf, SessionState sessionManager backgroundOperationPool runAsync session HiveConf, SessionState operationManager Threads for Async Operations … (handleToSessionMap) ... ... session HiveConf, SessionState (handleToOperationMap) create and run operations SQLop sync/async create and run hive Driver Hive Driver op op ... op SQLOp/SetOp/DfsOp/AddResourceOp/DeleteResourceOp .. GetTypeInfoOp/GetCatalogsOp/GetSchemasOp/GetTablesOp/ GetTableTypesOp/GetColumnsOp/GetFunctionsOp ...
  6. 6. Architecture: Server Context • • • • Client-1 Connection-1 (Thread) Client Connection (Thread) Session (-> HiveConf, SessionState) Operation (-> Driver) Client-2 Connection-2 (Thread) Session-12 • Usually, a client only opens one Session in a Connection. (refer to JDBC HiveDriver: HiveConnection) Op-121 (SQL) Driver Session-11 Op-122 Op-123 (SQL) Driver
  7. 7. Session New Client API SQL and Hive Operation • TCLIService.thrift • Complete API • Complete Database API Hive Command Operation DB Metadata Operation • Think about JDBC/ODBC • To be compatible with existing DB software • Hive Specific API • Best Practice Operation for Operation • Client API vs. Internal API • Converting and Isolation Get Result OpenSession CloseSession ExecuteStatement GetInfo * GetTypeInfo GetCatalogs GetSchemas GetTables GetTableTypes Client request to open a new session. A new HiveSession is created in server and return a unique SessionHandler (UUID). All other calls depend on this session. Client request to close the session. Will also close and remove all operations in this session. Execute a HQL statement. SQLOp Some SQL statement can be tagged “runAsync”, then it will be executed in a dedicated Thread and return immediately. SetOp,DfsOp,AddResourceOp,DeleteResourceOp Get various global variables of Hive. (Key-Type->Value) Get the detailed description and constraint of data type. Do nothing so far. Get schema from metastore. Get table schema from metastore. Get the table type, e.g. MANAGED_TABLE, EXTERNAL_TABLE, VIRTUAL_VIEW, INDEX_TABLE. GetColumns Get columns of a table from metastore. GetFunctions Get the UDF functions. GetOperationStatu Get state of an operation by opHandler, INITIALIZED/ s RUNNING/FINISHED/CANCELED/CLOSED/ERROR/UNKNOWN/PENDI NG. CancelOperation Cancel a RUNNING or PENDING operation by opHandler. For SQLOp, do cleanup: close and destroy Hive Driver, delete temp output files, and cancel the task running in the background thread… CloseOperation Remove this operation and close it: for SQLOp, do cleanup; for HiveCommandOp, tearDownSessionIO. GetResultSetMeta Get the resultset’s schema, such as the title columns. data FetchResults Fetch the result rows from the real resultset.
  8. 8. Code • Packages • org.apache.hive.service …, top project of apache… • Pros • Clear Implementation • Decoupling of HiveServer2 and HiveCore • Decoupling of Thrift Client API and Internal Code • Cons • • • • Too many design pattern. Somewhere, inconsistent principle. Still not complete decoupling of HiveServer2 and HiveCore. The JDBC Driver package/jar still relies on many other core code, such Hive->Hadoop and the libs… (may be because of the support of Embedded Mode.)
  9. 9. Service +state CompositeService Code HiveServer2 AbstractService +serviceList +HiveConf: Global,set by init() +addService() +removeService() +main(): 入口 +init() +start() +stop() +register(): StateChangeListener TCLIService.Iface ThriftCLIService ThrifyBinaryService +cliService ICLIService TThreadPoolServer +openSession() +closeSession() +getInfo() +executeStatement() +...() +fetchResults() CLIService +sessionManager FixedThreadPool +OpenSession() +CloseSession() +GetInfo() +ExecuteStatement() +...() +FetchResults() OperationManager +handleToOperation: HashMap +newExecuteStatementOperation() +newGetTypeInfoOperation() +...() +addOperation() +removeOperation() +getOperation() +getOperationState() +cancelOperation() +closeOperation() +getOperationNextRowSet() +...() SessionManager +handleToSession: HashMap +operationManager +backgroundOperationPool HiveSession HiveSessionImpl +sessionHandle +hiveConf: new for each +sessionState: new for each +opHandleSet +openSession() +closeSession() +getSession() +...() +submitBackgroundOperation() Operation +opHandle +parentSession +state +getState() +setState() +run() +getNextRowSet() +close() +cancel() +...() +getSessionHandle() +getInfo() +executeStatement() +executeStatementAsync() +...() +fetchResults() GetInfoOperation ExecuteStatementOperation SQLOperation AddResourceOperation DeleteResourceOpetation DfsOperation SetOperation GetSchemasOperation XXXOperation This is just a quick view, may be not exact in some detail, and intentionally missed something not so important.
  10. 10. HiveCore and Depending Hive Env.? • HiveConf • Global instance • Instance for each Session. • Client can inject additional KeyValue style configurations when OpenSession. • Set an explicit session name(id) to control the download directory name. • Hive SessionState • Instance for each Session. • Hive Driver • Instance for each SQL Operation. • Global static variables? • ?? • SetOperation ->SetProcessor • set env: variables can not be set. • set system: global System.getProperties().setProperty(..) • We may forbid system setting? Or, only administrator can do it? • set hiveconf: instanced. • set hivevar: instanced. • Set: instanced • AddResource and DeleteResourceOperation • SessionState. add_resource/delete_resource • DOWNLOADED_RESOURCES_DIR("hive.downlo aded.resources.dir", System.getProperty("") + File.separator + "${}_resources") • DfsOperation • Auth. With HDFS?
  11. 11. Handler (Identifier) • SessionHandler • OperationHandler Theift IDL: • Use UUID struct THandleIdentifier { // 16 byte globally unique identifier // This is the public ID of the handle and // can be used for reporting. 1: required binary guid, Now, only the public ID is used, it’s OK. // 16 byte secret generated by the server // and used to verify that the handle is not // being hijacked by another user. 2: required binary secret, }
  12. 12. Configurations and Run Config: Run: • • • • • • • • Start HiveServer2 hive.server2.transport.mode = binary | http | https hive.server2.thrift.port = 10000 hive.server2.thrift.min.worker.threads = 5 hive.server2.thrift.max.worker.threads = 500 hive.server2.async.exec.threads = 50 hive.server2.async.exec.shutdown.timeout = 10 (seconds) • = true ??? • hive.zookeeper.quorum = • … • bin/hiveserver2 & • Start CLI (use standard JDBC) • bin/beeline • !connect jdbc:hive2://localhost:10000 • show tables; • select * from tablename limit 10;
  13. 13. Interface and Clients • RPC (TCLIService.thrift) • Binary Protocol • Http/https Protocol (to be researched) • New JDBC Driver • org.apache.hive.jdbc.HiveDriver • URL: jdbc:hive2://hostname:10000/dbname… (jdbc:hive2://localhost:10000/default) • Implemented more API features. 3party Client over JDBC: • CLI • Beeline based on SQLine • IDE: SQuirreL SQL Client • Web Client (e.g. H2 Web, etc.)
  14. 14. Client Tools: CLI SQLine, Beeline
  15. 15. Client Tools: IDE SQuirreL SQL Client
  16. 16. Client Tools: Web Client
  17. 17. Think More … • Thinking of XX as Platform • Standard JDBC/ODBC • RESTful API over HTTP, Web Service • AWS Redshift, SimpleDB … • Hive as a Service? • • Request Cluster, run SQL ad-hoc and Regularly, workflow and schedule. • Language • SQL, R, Pig • Computing of Estimation, Probability …
  18. 18. Thank You!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.