Apache Hive micro guide - ConfusedCoders

  • 2,583 views
Uploaded on

A small hive walk through for absolute beginner introduction to Apache Hive.

A small hive walk through for absolute beginner introduction to Apache Hive.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,583
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
186
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Hive Walkthrough YASH SHARMA - ConfusedCoders
  • 2. Table of Contents Introducing Apache Hive ............................................................................................................................... 3 Starting Notes ........................................................................................................................................... 3 What is Apache Hive? ............................................................................................................................... 3 Motivation for Hive ................................................................................................................................... 3 Some History ............................................................................................................................................. 3 Deep Dive into Hive ...................................................................................................................................... 4 Hive Architecture ...................................................................................................................................... 4 Hive Data Model ....................................................................................................................................... 5 Hive Query Language (HiveQL) ................................................................................................................. 5 Hive Web Interface ................................................................................................................................... 5 Hive Authorizations................................................................................................................................... 6 Getting Hands dirty – Hive Hands On ........................................................................................................... 7 Hive Installation ........................................................................................................................................ 7 Sample Hive Queries ................................................................................................................................. 8 Hive SerDe ................................................................................................................................................. 9 Hive JDBC .................................................................................................................................................. 9 Changing Hive default Data store ........................................................................................................... 11 Hive User Defined Functions (UDF) ........................................................................................................ 122Page
  • 3. Introducing Apache Hive Starting Notes This is a Hive kick start guide and talks about basic Hive concepts and shows you how hive works over Hadoop. The tutorial expects you to be familiar with Hadoop basics and assumes that you have a preliminary understanding of what Map-Reduce programming is. What is Apache Hive? Apache Hive is an open source data warehouse system for Hadoop. Hive is an abstraction over Apache Hadoop and provides the users an SQL like query interface. The user only sees table like data and can fire hive queries to get results from the data and Hive internally creates, plans and executes Map- Reduce Jobs for us and gives us the desired results. Hive is suitable for both unstructured and semi- structured data. Motivation for Hive The prime motivation for Hive was to enable users to quickly come up with fast business solutions using a familiar language rather than having to think Map-Reduce for every problem. Hive uses its own query language HiveQL which is very much similar to the traditional SQL. Some History Hive written in Java language was initially developed by Facebook, and is under Apache2.0 License, and is now being developed by companies like Netflix. The current version of Hive is 0.9.9 which is compatible with Hadoop 0.20.x and 1.x.3Page
  • 4. Small Dive into Hive Hive Architecture Below is a diagram showing the Hive Architecture and its components. Hive works on top of Hadoop and needs Apache Hadoop to be running on your box. Let’s have a quick note on the components: • Hadoop - Hive needs Hadoop as a Base Framework to operate. • Driver - Hive has its own drivers to communicate with the Hadoop World. • Command Line Interface (CLI) – The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data. • Web Interface - Hive also provides a web interface to monitor/administrate Hive jobs. • Metastore – Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive. • Thrift Server – Hive provides a Thrift Server with itself via which we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc.4Page
  • 5. Apart from these above components there are few more vital actors responsible for the query execution: • Hive Compiler which is responsible for the semantic analysis of the input query and creating an execution plan. The execution plan is a DAG of stages. • Hive Execution Engine which executes the execution plan created by the Hive Compiler. • Hive also has an Optimizer that optimizes the execution plan. Hive Data Model Any data into Hive can be organized into 3 data models: • Tables – are similar to relational DB’ • Partitions – every table can have one or more partition keys. The data is stored in files based on the partition key specified. Without a partition all the values would be submitted to the MR Job, whereas on specifying the partition key only a small subset of data would be passed to the MR jobs. Hive makes different directories for different partitions to hold data. • Buckets – the data in each partition may again be divided into buckets based on the hash values. Each bucket is stored as a file in the partition directory. Hive Query Language (HiveQL) HQL is very much similar to SQL and user can use SQL syntax for loading, querying tables. Hive queries are checked by the compiler for correctness and execution plan is created from the queries. The Hive Executer then runs the execution plan accordingly. The Hive Language Manual can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Web Interface The Hive web interface is another alternative for the command line interface. We can use the Hive web interface for administering and monitoring hive jobs. You can view the Hive web interface here by default: http://localhost:9999/hwi Note: Your Hive Server must be running in order to view the Hive Web Interface.5Page
  • 6. Hive Authorizations Hive Authorization system consists of Users, Groups and Roles. Roles are the name given to a set of grants given to any particular User and can be reused. A role may be assigned to Users, Groups and to some other Roles. Hive roles must be created manually before being used. Users and Groups need not be created manually and the Metastore will determine the username of the connecting user and the groups associated with her. More on users, Groups and Roles can be found here: https://cwiki.apache.org/Hive/languagemanual-auth.html6Page
  • 7. Getting Hands dirty – Hive Hands On Hive Installation This is a short crisp guide to installing Hive on top of your Hadoop setup. Apache Hadoop is a pre- requisite for Apache Hive and must be installed on your box before we can proceed. 1. Download Hive: You can download ‘Apache Hive Here’. 2. Extract Hive: Extract your Hive archive to any location of your choice. 3. Export Environment Variables & Path: Hive needs three environment variables set: • JAVA_HOME: the path where your java is present. • HADOOP_HOME: path where your Hadoop is present. • HIVE_HOME: path where you’ve just extracted your hive. Set these environment variables accordingly to continue. You can export environment variables by the shell command: $> export HADOOP_HOME=/home/ubuntu/hadoop/ $> export HIVE_HOME=/home/ubuntu/hive/ $> export PATH=$PATH:$HADOOP_PATH/bin $> export PATH=$PATH:$HIVE_PATH/bin 4. Create Warehouse directory for Hive: Hive stores all its data in a directory /user/hive/warehouse/. So let’s create the path for it. $> sudo mkdir -P /user/hive/warehouse 5. Start Hadoop: Hive needs Hadoop running for its operations. Let’s start Hadoop by the start-all script. Since you have set the HADOOP_HOME/bin to PATH, you must be able to call the start-all.sh directly. Else you might have to go to the directory and issue the start-all.sh (inside bin). $> start-all.sh 6. Start Hive: Finally start hive by issuing the command ‘hive’. $> hive7Page hive> show databases;
  • 8. You can also start the hive server by the command: $> hive -service hiveserver Sample Hive Queries Below is a sample hive query to create table and import data into the table. The query assumes the data files to be present on your local system, and has the following data format: File: movies.dat movie_id:movie_name:tags File: ratings.dat user_id:movie_id:rating:timestamp Download Movie Lens Dataset Here: http://www.grouplens.org/node/73 CREATE TABLES: ———————------- CREATE TABLE movies (movie_id int, movie_name string, tags string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’; CREATE TABLE ratings (user_id string, movie_id string, rating float, tmstmp string) ROW FORMAT DELIMITED FIELDS TERMINATE BY ‘:’; LOAD DATA FROM LOCAL PATH (Not HDFS): ———————————————————------------------ LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/movies.dat’ OVERWRITE INTO TABLE movies; LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/ratings.dat’ OVERWRITE INTO TABLE ratings; VERIFY: ———–---8 DESCRIBE movies;Page DESCRIBE ratings;
  • 9. OR, SELECT * FROM movies LIMIT 10; SELECT * FROM movies LIMIT 10; Hive SerDe The above queries work fine with single character delimiters, but many a time we face situations where we have a complex delimiter or multi-character delimiters. In these scenarios we need to use the Hive SerDe to get our data into Hive tables. Here is a sample SerDe implementation for a sample USER file which has ‘::’ as the field delimiter. The data in the USER file is in this format id::gender::age::occupation::zipcode While using SerDe we specify a regular expression which is used to divide our line of data into fields. Query Using SerDe: --------------------- CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT) ROW FORMAT SERDE org.apache.hadoop.hive.contrib.serde2.RegexSerDe WITH SERDEPROPERTIES ( "input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)" ); Hive Partitions Hive partitions can be created on any particular field/column data. The hive partitions make your queries faster and Hadoop by default keeps your entire data divided into partitions into separate directories. We can create a partitioned table in Hive by this query below; here we are choosing the date column to be partitioned: create table table_name ( id int, date string, name string ) partitioned by (date string) Hive JDBC On starting the Hive Thrift server hive can be exposed as a service and we can connect to hive via JDBC. Connecting to Hive via JDBC is similar to connecting to any relational DB like MySQL etc. Below is a sample code for demonstrating few common hive queries:9 import java.sql.SQLException;Page import java.sql.Connection; import java.sql.ResultSet;
  • 10. import java.sql.Statement; import java.sql.DriverManager; public class HiveJDBC { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (city string, temperature int)"); // show tables String sql = "show tables " + tableName + ""; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/home/ubuntu/yash_workspace/data"; sql = "load data local inpath " + filepath + " into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getString(1)) + "t" + res.getInt(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } stmt.close();10 con.close(); } }Page
  • 11. The above code fires two select queries on a newly created table in Hive and prints the output on console. Other complex queries are left on the reader to explore. Changing Hive default Data store Hive by default has Derby database as its data store and many a time we may need to change the default data store to some other database. There are a couple of configurations changes that we need to take care for changing Hive’s default data store. Here we will be changing the default data store to MySQL DB. Below are the steps for using MySQL Database as Hive’s database: • Download MySQL JDBC Driver, and copy the jar file in Hive’s lib folder. • Create a ‘metastore_db’ database in MySQL. • Create a user ‘hiveuser’ in MySQL. • Grant all permissions to ‘hiveuser’. GRANT ALL ON *.* TO ‘hiveuser’@localhost IDENTIFIED BY ‘your_password’ • Add the following configuration tags to hive-site.xml : <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExists=true</value> <description>The jdbc connection string for the metastore_db you just created</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>con.mysql.jdbc.driver</value> <description>Driver class name for jdbc</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> <description>DB username, we just created on MySQL</description> </property>11 <property> <name>javax.jdo.option.ConnectionPassword</name>Page <value>your_password</value>
  • 12. <description>Password for your user – hiveuser</description> </property> Hive User Defined Functions (UDF) Hive allows users to create their own User Defined Functions for using them in their hive queries. You need to extend the Hive’s UDF class for creating your user defined UDF. Below is a small UDF for auto- increment functionality. The UDF can also be used to get the ROWNUM of a table in Hive: package com.confusedcoders.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public class AutoIncrUdf extends UDF{ int lastValue; public int evaluate() { lastValue++; return lastValue; } } USAGE: add jar /home/ubuntu/Desktop/HiveUdf.jar; create temporary function incr as com.confusedcoders.hive.udf.AutoIncrUdf; SELECT userid, incr() as rownum FROM USER LIMIT 10;12Page