HiveServer2
Upcoming SlideShare
Loading in...5
×
 

HiveServer2

on

  • 2,857 views

HiveServer2

HiveServer2

Statistics

Views

Total Views
2,857
Views on SlideShare
2,818
Embed Views
39

Actions

Likes
2
Downloads
58
Comments
0

4 Embeds 39

http://www.bigdatapro.io 31
http://localhost 4
http://bigdatapro.io 3
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HiveServer2 HiveServer2 Presentation Transcript

  • HiveServer2 Oct., 2013 Schubert Zhang
  • Hive Evolution • Original • Let users express their queries in a high-level language without having to write MapReduce programs. • Mainly target to ad-hoc queries. • As a data tool, usually work in CLI mode. • Now more … • A parallel SQL DBMS that happens to use Hadoop for its storage and execution layers. • Ad-hoc + regular • As a service …
  • Introduction • Limitations of HiveServer1 • • • • Concurrency Security Client Interface Stability • Sessions/Currency • Old Thrift API and server implementation didn’t support currency. • xDBC • Old Thrift API didn’t support common xDBC • Authentication/Authorization • Incomplete implementations • Auditing/Logging HiveServer2: • From hive-0.11 / CDH4.1 • Reconstructed and Re-implemented. (HIVE-2935) • HiveServer2 is a container for the Hive execution engine (Driver). • For each client connection, it creates a new execution context (Connection and Session) that serves Hive SQL requests from the client. • The new RPC interface enables the server to associate this Hive execution context with the thread serving the client’s request.
  • Architecture In fact, Driver in Operation Context System Arch. Authentication Arch. (don’t talk here) http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/ @Cloudera
  • hiveServer2 Architecture: Internal Client-1 (main entry) start Thrift RPC Iface Client-2 thriftCLIService (TThreadPoolServer, implements Client RPC Iface) lIsten() and accept() new client connection, and process in each Thread) • Core Contexts • Connections • Sessions • Operations • Operation Path … Threads for Client Connections … call (ICLIService internal interface) cliService (Real implementations of various operations) open/close sessions, run operations in existing sessions … HiveSession Interface session HiveConf, SessionState sessionManager backgroundOperationPool runAsync session HiveConf, SessionState operationManager Threads for Async Operations … (handleToSessionMap) ... ... session HiveConf, SessionState (handleToOperationMap) create and run operations SQLop sync/async create and run hive Driver Hive Driver op op ... op SQLOp/SetOp/DfsOp/AddResourceOp/DeleteResourceOp .. GetTypeInfoOp/GetCatalogsOp/GetSchemasOp/GetTablesOp/ GetTableTypesOp/GetColumnsOp/GetFunctionsOp ...
  • Architecture: Server Context • • • • Client-1 Connection-1 (Thread) Client Connection (Thread) Session (-> HiveConf, SessionState) Operation (-> Driver) Client-2 Connection-2 (Thread) Session-12 • Usually, a client only opens one Session in a Connection. (refer to JDBC HiveDriver: HiveConnection) Op-121 (SQL) Driver Session-11 Op-122 Op-123 (SQL) Driver
  • Session New Client API SQL and Hive Operation • TCLIService.thrift • Complete API • Complete Database API Hive Command Operation DB Metadata Operation • Think about JDBC/ODBC • To be compatible with existing DB software • Hive Specific API • Best Practice Operation for Operation • Client API vs. Internal API • Converting and Isolation Get Result OpenSession CloseSession ExecuteStatement GetInfo * GetTypeInfo GetCatalogs GetSchemas GetTables GetTableTypes Client request to open a new session. A new HiveSession is created in server and return a unique SessionHandler (UUID). All other calls depend on this session. Client request to close the session. Will also close and remove all operations in this session. Execute a HQL statement. SQLOp Some SQL statement can be tagged “runAsync”, then it will be executed in a dedicated Thread and return immediately. SetOp,DfsOp,AddResourceOp,DeleteResourceOp Get various global variables of Hive. (Key-Type->Value) Get the detailed description and constraint of data type. Do nothing so far. Get schema from metastore. Get table schema from metastore. Get the table type, e.g. MANAGED_TABLE, EXTERNAL_TABLE, VIRTUAL_VIEW, INDEX_TABLE. GetColumns Get columns of a table from metastore. GetFunctions Get the UDF functions. GetOperationStatu Get state of an operation by opHandler, INITIALIZED/ s RUNNING/FINISHED/CANCELED/CLOSED/ERROR/UNKNOWN/PENDI NG. CancelOperation Cancel a RUNNING or PENDING operation by opHandler. For SQLOp, do cleanup: close and destroy Hive Driver, delete temp output files, and cancel the task running in the background thread… CloseOperation Remove this operation and close it: for SQLOp, do cleanup; for HiveCommandOp, tearDownSessionIO. GetResultSetMeta Get the resultset’s schema, such as the title columns. data FetchResults Fetch the result rows from the real resultset.
  • Code • Packages • org.apache.hive.service …, top project of apache… • Pros • Clear Implementation • Decoupling of HiveServer2 and HiveCore • Decoupling of Thrift Client API and Internal Code • Cons • • • • Too many design pattern. Somewhere, inconsistent principle. Still not complete decoupling of HiveServer2 and HiveCore. The JDBC Driver package/jar still relies on many other core code, such Hive->Hadoop and the libs… (may be because of the support of Embedded Mode.)
  • Service +state CompositeService Code HiveServer2 AbstractService +serviceList +HiveConf: Global,set by init() +addService() +removeService() +main(): 入口 +init() +start() +stop() +register(): StateChangeListener TCLIService.Iface ThriftCLIService ThrifyBinaryService +cliService ICLIService TThreadPoolServer +openSession() +closeSession() +getInfo() +executeStatement() +...() +fetchResults() CLIService +sessionManager FixedThreadPool +OpenSession() +CloseSession() +GetInfo() +ExecuteStatement() +...() +FetchResults() OperationManager +handleToOperation: HashMap +newExecuteStatementOperation() +newGetTypeInfoOperation() +...() +addOperation() +removeOperation() +getOperation() +getOperationState() +cancelOperation() +closeOperation() +getOperationNextRowSet() +...() SessionManager +handleToSession: HashMap +operationManager +backgroundOperationPool HiveSession HiveSessionImpl +sessionHandle +hiveConf: new for each +sessionState: new for each +opHandleSet +openSession() +closeSession() +getSession() +...() +submitBackgroundOperation() Operation +opHandle +parentSession +state +getState() +setState() +run() +getNextRowSet() +close() +cancel() +...() +getSessionHandle() +getInfo() +executeStatement() +executeStatementAsync() +...() +fetchResults() GetInfoOperation ExecuteStatementOperation SQLOperation AddResourceOperation DeleteResourceOpetation DfsOperation SetOperation GetSchemasOperation XXXOperation This is just a quick view, may be not exact in some detail, and intentionally missed something not so important.
  • HiveCore and Depending Hive Env.? • HiveConf • Global instance • Instance for each Session. • Client can inject additional KeyValue style configurations when OpenSession. • Set an explicit session name(id) to control the download directory name. • Hive SessionState • Instance for each Session. • Hive Driver • Instance for each SQL Operation. • Global static variables? • ?? • SetOperation ->SetProcessor • set env: variables can not be set. • set system: global System.getProperties().setProperty(..) • We may forbid system setting? Or, only administrator can do it? • set hiveconf: instanced. • set hivevar: instanced. • Set: instanced • AddResource and DeleteResourceOperation • SessionState. add_resource/delete_resource • DOWNLOADED_RESOURCES_DIR("hive.downlo aded.resources.dir", System.getProperty("java.io.tmpdir") + File.separator + "${hive.session.id}_resources") • DfsOperation • Auth. With HDFS?
  • Handler (Identifier) • SessionHandler • OperationHandler Theift IDL: • Use UUID struct THandleIdentifier { // 16 byte globally unique identifier // This is the public ID of the handle and // can be used for reporting. 1: required binary guid, Now, only the public ID is used, it’s OK. // 16 byte secret generated by the server // and used to verify that the handle is not // being hijacked by another user. 2: required binary secret, }
  • Configurations and Run Config: Run: • • • • • • • • Start HiveServer2 hive.server2.transport.mode = binary | http | https hive.server2.thrift.port = 10000 hive.server2.thrift.bind.host hive.server2.thrift.min.worker.threads = 5 hive.server2.thrift.max.worker.threads = 500 hive.server2.async.exec.threads = 50 hive.server2.async.exec.shutdown.timeout = 10 (seconds) • hive.support.concurrency = true ??? • hive.zookeeper.quorum = • … • bin/hiveserver2 & • Start CLI (use standard JDBC) • bin/beeline • !connect jdbc:hive2://localhost:10000 • show tables; • select * from tablename limit 10;
  • Interface and Clients • RPC (TCLIService.thrift) • Binary Protocol • Http/https Protocol (to be researched) • New JDBC Driver • org.apache.hive.jdbc.HiveDriver • URL: jdbc:hive2://hostname:10000/dbname… (jdbc:hive2://localhost:10000/default) • Implemented more API features. 3party Client over JDBC: • CLI • Beeline based on SQLine • IDE: SQuirreL SQL Client • Web Client (e.g. H2 Web, etc.)
  • Client Tools: CLI SQLine, Beeline
  • Client Tools: IDE SQuirreL SQL Client
  • Client Tools: Web Client
  • Think More … • Thinking of XX as Platform • Standard JDBC/ODBC • RESTful API over HTTP, Web Service • AWS Redshift, SimpleDB … • Hive as a Service? • http://www.qubole.com/ • Request Cluster, run SQL ad-hoc and Regularly, workflow and schedule. • Language • SQL, R, Pig • Computing of Estimation, Probability …
  • Thank You!