Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Prepared by Vetri.V What is Hive?  Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.  Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.  At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.  Hive is an abstraction on top of MapReduce it allows users to query data in the Hadoop cluster without knowing Java or MapReduce.It Uses the HiveQLlanguage, Very similar to SQL. Some Points about Hive:-  Hive was originally developed at Facebook  Provides a very SQL-like language  Can be used by people who know SQL  Enabling Hive requires almost no extra work by the system administrator  Hive ‘layers’ table definitions on top of data in HDFS  Hive tables are stored in Hive’s ‘warehouse’ directory in HDFS, By default, /user/hive/warehouse  Tables are stored in subdirectories of the warehouse directory  Actual data is stored in flat files- Control character-delimited text, or SequenceFiles Hive is Data warehousing tool on top of Hadoop. It same as SQL:  SQL like Queries  SHOW TABLES, DESCRIBE, DROPTABLE
  2. 2. Prepared by Vetri.V  CREATE TABLE, ALTER TABLE  SELECT, INSERT Hive Limitations:  Not all ‘standard’ SQL is supported  No support for UPDATE or DELETE  No support for INSERTing single rows  Relatively limited number of built-in functions  No datatypes for date or time - Use the STRING datatypeinstead.In new version date or time datatype will support. Hive Architecture Metastore:  Stores system catalog. Driver:  Manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics Query compiler:  Compiles HiveQL into a directed acyclic graph of map/reduce tasks Execution engines:  The component executes the tasks in proper dependency order; interacts with Hadoop Hive Server:
  3. 3. Prepared by Vetri.V  Provides Thrift interface and JDBC/ODBC for integrating other applications. Client components: CLI, web interface, jdbc/odbc interface Extensibility interface include Server , User Defined Functions and User Defined Aggregate Function. Hive Installations and Metastore with Mysql:  Store Hive Metadata into RDBMS.  The Metastore stores all the information about the tables, their partitions, the schemas, the columns and their types, the table locations etc.  This information can be queried or modified using a thrift interface and as a result it can be called from clients in different programming languages.  As this information needs to be served fast to the compiler, they have chosen to store this information on a traditional RDBMS.  By default, hive use an embedded Derby database to store metadata information. The metastore is the "glue" between Hive and HDFS.  It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.  The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called DataNucleus, to convert object representations into a relational schema and vice versa.  They chose this approach as opposed to storing this information in hdfs as they need the Metastore to be very low latency. The DataNucleus layer allows them to plugin many different RDBMS technologies.  Facebook uses MySQL to store metadata information. Store Hive metadata into MySQL:  write on /etc/profile the following: export JAVA_HOME=/usr/java/jdk1.7.0_17 export PATH=$PATH:/usr/java/jdk1.7.0_17/bin export HADOOP_HOME=/opt/hadoop export PATH=$PATH:/opt/hadoop/bin export HIVE_HOME=/opt/hive export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH  Go to hadoop home and perform the following tasks:  Commands to perform this setup are as follows: $ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
  4. 4. Prepared by Vetri.V We need to perform following steps to setup a metastore in a MySQL server. 1. Install mysql-server. sudo apt-get install mysql-server (or) yum install mysql-server <---centOS------> 2. Use su command. You will see password promt. Type your root userpassword. 3. If your password correct, you'll see like this; root@yourdesktopname#. 4. Now, Start mysql server. /etc/init.d/mysql start. 5. When mysql server started, type mysql -u root mysql. 6. Next, I'm going to create a new MySQL user for hadoop/hive. mysql> CREATE USER 'hadoop'@'localhost' IDENTIFIED BY 'hadoop'; mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'localhost' WITH GRANT OPTION; mysql> exit; 7. su – userName // In my case userName is hadoop. So, switch to that user. 8. mysql -h localhost -u userName -p. 9. Now, we need to change the hive configuration so it can use MySQL: 10. Go to hive/conf directory, then open hive-default.xml and perform following changes. <!-- In local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it.--> <property> <name>hive.metastore.local</name> <value>true</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value> jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true </value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property>
  5. 5. Prepared by Vetri.V <name>javax.jdo.option.ConnectionUserName</name> <value>Username</value><!-- In my case UserName is hadoop--> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>UserPassword</value><!-- In my case password is hadoop--> </property> (OR) create hive-site.xml and paste the followings: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- WARNING!!! This file is provided for documentation purposes ONLY! --> <!-- WARNING!!! Any changes you make to this file will be ignored by Hive. --> <!-- WARNING!!! You must make your changes in hive-site.xml instead. --> <!-- Hive Execution Parameters --> <!-- In local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it.--> <property> <name>hive.metastore.local</name> <value>true</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/eattributes?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value><!-- In my case UserName is hadoop--> </property> <property>
  6. 6. Prepared by Vetri.V <name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value><!-- In my case password is hadoop--> </property> </configuration> 11. Hive needs to have the MySQL jdbc drivers, so we need to download and 12. If all settings are done correctly, we can do this:- copy it to hive/lib folder. impetus@ubuntu:~$ cd HIVE_HOME impetus@ubuntu:~/hive$ bin/hive Hive history file=/tmp/ankit /hive_job_log_ankit_201102211937_456962737.txt hive> hive>create table temp(info INT); OK Time taken: 4.745 seconds hive> show tables; OK temp Time taken: 0.137 seconds hive> 13. Again, start MySQL server. vetri@ubuntu:~$mysql -h localhost -u hadoop -p Enter password: Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 84 Server version: 5.1.41-3ubuntu12.8 (Ubuntu) Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. mysql> 14. Use following command to view all existing databases. mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema |
  7. 7. Prepared by Vetri.V | hive | | mysql | +--------------------+ 3 rows in set (0.05 sec) mysql> use hive; //Select database. Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> 15. Use following command to view all exiting tables. mysql> show tables; +--------------------+ | Tables_in_hive | +--------------------+ | BUCKETING_COLS | | COLUMNS | | DBS | | NUCLEUS_TABLES | | PARTITIONS | | PARTITION_KEYS | | PARTITION_KEY_VALS | | PARTITION_PARAMS | | SDS | | SD_PARAMS | | SEQUENCE_TABLE | | SERDES | | SERDE_PARAMS | | SORT_COLS | | TABLE_PARAMS | | TBLS | +--------------------+ 16 rows in set (0.00 sec) mysql> 16. Use following command to view the metadata information of hive table columns.
  8. 8. Prepared by Vetri.V mysql> select * from COLUMNS; +-------+---------+-------------+-----------+-------------+ |SD_ID|COMMENT|COLUMN_NAME|TYPE_NAME|INTEGER_IDX| +-------+---------+-------------+-----------+-------------+ | 1 | NULL | info | int | 0 | +-------+---------+-------------+-----------+-------------+ 1 rows in set (0.00 sec) mysql> //Where info is the column of temp table. 17. Similarly, you can View the information of partiotions, Bucketing, tables, etc. Hive Applications, Components, Model and Layout Hive Applications:  Log processing  Text mining  Document indexing  Customer-facing business intelligence (e.g., Google Analytics)  Predictive modeling, hypothesis testing Hive Components:  Shell: allows interactive queries like MySQL shell connected to database – Also supports web and JDBC clients  Driver: session handles, fetch, execute  Compiler: parse, plan, optimize  sExecution engine: DAG of stages (M/R,HDFS, or metadata)  Metastore: schema, location in HDFS,SerDe Data Model:  Tables – Typed columns (int, float, string, date,boolean) – Also, list: map (for JSON- like data)  Partitions – e.g., to range-partition tables by date  Buckets – Hash partitions within ranges (useful for sampling, join optimization) Metastore:  Database: namespace containing a set of tables  Holds table definitions (column types,physical layout)  Partition data
  9. 9. Prepared by Vetri.V  Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases Physical Layout:  Warehouse directory in HDFS – e.g., /home/hive/warehouse  Tables stored in subdirectories of warehouse – Partitions, buckets form subdirectories of tables  Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom Server, can use arbitrary format Hive Command Line Interface: To launch the Hive shell, start a terminal and run $ hive Note: example is the table name for all qurey hive> Hive : Creating Tables hive> CREATE TABLE example (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; hive> DESCRIBE example; hive> SHOW TABLES; Hive : Loading Data Into Hive Data is loaded into Hive with the LOAD DATA INPATH statement – Assumes that the data is already in HDFS hive> LOAD DATA INPATH “file_txtdata.txt” INTO TABLE example; If the data is on the local filesystem, use LOAD DATA LOCAL INPATH – Automatically loads it into HDFS hive> LOAD DATA LOCAL INPATH "file_txtdata.txt" INTO TABLE example; Hive : SELECT Queries Hive supports most familiar SELECT syntax hive> SELECT * FROM example LIMIT 10; hive> SELECT * FROM example WHERE id > 100 ORDER BY name ASC LIMIT 10; Joining Tables SELECT, e.dep, FROM example e JOIN sample s ON (e.dep = s.dep) WHERE >= 20; Creating User-Defined Functions INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python' AS (userid, movieid, rating, weekday) FROM u_data;
  10. 10. Prepared by Vetri.V Join Query: sample 1.Create table CREATE TABLE example(ID int,SUBJECTstring,PRODUCTstring,PERIODint,START_TIMEint,OPERATION string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE; 2.Load data (save the file in related folder) hive> LOAD DATA LOCAL INPATH "file_txtdata.txt" INTO TABLE example; 3.Join Query select A.* from example A join ( select id, max(start_time) as start_time from example B where start_time< 25 group by id ) MAXSP ON and A.start_time = MAXSP.start_time; Using NOT IN / IN hive query SELECT * FROM example WHERE NOT array_contains(array(7,6,5,4,2,12), id) ---THANK YOU---